Psychology Today defines synesthesia as a neural condition where stimulation in one sensory or cognitive pathway triggers, involuntarily, another sensory or cognitive pathway. For example, a person with synesthesia will be able to see colours when hearing different sounds.  This project is inspired by the idea of synesthesia, and the initial aim of this project was to allow the user to experience this effect without having synesthesia. Although this seems impossible, something similar was achieved by Neil Harbisson back in 2012. 

Neil Harbisson

Neil Harbisson is an artist who was born completely colour blind. In his TED talk, 'I listen to colour', he demonstrates the device that he wears on his head all the time. This device is a colour sensor that turns the colour it detects into audible frequencies, so while he sees the world in greyscale, he can hear symphonies of colours. This allows him to listen and recognise colour, even faces and paintings.


Visible Light Spectrum

The goal is not to replicate what has been done before but instead doing the opposite. It is to match the different frequencies of a sound source into different colours and visualise them to the user. The aim is to allow the user to understand how sound and colour are related and to mimic the experience of a person with synesthesia has.

As the visible light spectrum modified from Philip Ronan's 'EM spectrum' suggests, the visible lights have wavelengths ranging from 400nm to 700nm depending on the colour.

Visible Light Frequency Range

As a result, each colour has a corresponding frequency. This frequency table shows that visible lights have frequencies ranging from 400THz to 789THz for different colours.

The Spectrum of Speech

The voice spectra shows that for males, the audible speech has frequencies ranging from about 250Hz to 2kHz. For females, the audible speech range from around 500Hz to 4kHz. However, this depends on the vocal effect, which is represented by the different colours in the graph where cyan is shouted, purple is loud, green is raised, red is normal and blue is casual.

Time Domain vs. Frequency Domain

The next step is to map the frequency spectrum of a sound source to the frequency of the colours of visible light. To do so, we need to extract the frequency components of a source. When sound is recorded, it is usually in the time domain. In other words, when it is visualised in 2D, the x-axis will be time and the y-axis will be the amplitude as seen above. Therefore, the first step is to transform the sound source into a frequency domain where the x-axis becomes frequency instead of time in the time domain.

Fast Fourier Transform (FFT)

One easy way to do so would be using FFT. 'It is an algorithm that computes the Discrete Fourier Transform of a sequence.' This can be used to convert the sound signal from its time domain into the frequency domain. It is done using Fourier analysis which is based on Fourier Series by Joseph Fourier.

Max 8 by Cycling' 74

FFT is nearly impossible to be done by hand. Hence, a programme called Max 8 is used. Max 8 is a visual programming language developed by Cycling' 74 which is widely used for music and multimedia development, and it is best known for its neat presentation of the programme's structure and its graphical user interface.


A Max 8 patch is created as a prototype for the idea mentioned earlier. It contains two aspects of distorting a live video using both live and recorded audio. Firstly, it changes the colour of the live video captured by your camera using recorded audio.

The system diagram for this feature is as follows:

1. Record a 10-second clip, using record~, into the buffer by using buffer~, and jit.grab digitalise a video from the camera.

2. Perform FFTon the data stored inside the buffer using a third-party extension vb.FFTWbuf~.

3. This transformed record is then sent to a javascript file which extracts the maximum, the mean and the minimum frequency values.
5. These frequencies are scaled between 85 to 255 to match the RGB values of the colours.

6. The RGB values are translated into HSL value which the jit.pwindow takes as an input.

7. As a result, the video changes colour according to the frequency of the recorded audio.

The second feature demonstrated in this patch is an addition to the prototype that was not discussed earlier. This is to extract the amplitude of live audio received from the microphone and rescale the size of the live video captured by the camera. As mentioned previously, when a sound source is detected, it is usually in the time domain. Hence, extracting the amplitude does not require additional computation.

The system diagram for this feature is as follows:

1. Ezdac~ is used to receive the live sound source.

2. Then, this audio is passed through cascade~ which is a user customisable biquad filters to clean the sound in his preference.

3. The peak amplitude of this sound is extracted using peakamp~.

4. This value is then scaled up and added onto the original defined size of the video shown using jit.pwindow.

Finally, using crossover design from Max 8, the two features are combined, and the output is demonstrated from the video above.

 Download the Max 8 patch  

Future Implementation

Due to the nature of a student project, the possibilities of this project are limited. However, some thoughts are put into the feature persepective.

Firstly, the first idea involves adding a mirror as the input and output. This means that a smart mirror will be adding to the prototype to replace the need for a computer and turning it into a play mirror. The smart mirror will have an in-built camera and speaker to record the video and audio of the user and display the distorted image. The user can sing, shout or speak to the mirror, and he can use a distorted self inside the mirror.

The second idea is a possible expansion on the prototype, which is turning it into a live and  interactive art.

In this setup, the user will be sitting in front of a camera and a monitor. The camera captures the live video and audio from the user, and the monitor outputs videos and sound back to the user. The user is asked to repeat the tongue twister being displayed on the monitor as closely as possible, and the computer will constantly be checking mistakes. While the user is repeating the tongue twister, the projector will show a live feed of the user's face on an active canvas behind him. Whenever a mistake is spotted, it will distort the image displayed and move the canvas with actuators. The spectators are able to see the live changes to the canvas, and the user is shown a repeat of his attempt and the distortion afterwards.