A team of researchers at the UW has developed an artificial intelligence (AI) framework named Audeo that generates music from silent video recordings of piano performances.
The research was conducted by doctoral students Kun Su and Xiulong Liu, and assistant professor of applied mathematics and electrical and computer engineering Eli Shlizerman. Audeo’s produced audio samples were detectable by the music identification software SoundHound 85.6% of the time, compared to a 92.6% detection rate of the original audio.
To reproduce the audio at this high level of precision, Audeo processes the visual input in three stages, as outlined in the paper.
First, a neural network processes multiple consecutive video frames to detect which keys are pressed in the middle frame, repeating this for the duration of the video.
The second stage corrects errors from the first stage and fills in other details, such as the eventual decay of the sound produced when a key is sustained for a long time.
“Music is much faster and more fine than the visual input, which means that there are many details in between the frames that we need to guess,” Shlizerman said.
Once this representation of the audio is complete, musical instrument digital interface (MIDI) synthesizers convert the data into music.
While Audeo was primarily tested on videos recorded by pianist Paul Barton, Shlizerman said the team plans to make the system work “in the wild,” meaning it would be adaptable to any other pianist and even other instruments. This requires training the system on a larger dataset.
Additionally, Shlizerman said the team hopes to make Audeo fast enough to be used in real time. He proposed the idea of a virtual piano, where someone without access to a piano could simulate the experience by using the technology to translate their hand movements into the sound they would produce on a real keyboard.
“We are not there yet, but I feel like this is really exciting — this is a new experience that we could have in the virtual world and a new way to interact with music,” Shlizerman said.
Currently, the AI-generated music sounds expectedly more robotic than a human’s performance, but the technology’s ability to capture the essence of the piece — the sequence of keys pressed — signifies a step forward in uniting the visual and audio streams, according to Shlizerman.
Shlizerman pointed out that professional film editors approach video and audio editing as separate processes, despite the two modalities being deeply interrelated. He suggested that AI technology could be used for generating music to accompany visual scenes, like adding soundtracks to a film.
Though his team is specifically focusing on synthesizing music, Shlizerman said that Audeo could inspire similar technologies for generating speech or other types of audio.
“The visual-audio space is quite new, in terms of computational capabilities,” Shlizerman said. “Some automatic tools are appearing in both spaces separately, but when you try to combine them, it’s a much harder task … we are showing that it is possible, and the next step will be all the cool applications that come out.”
Reach contributing writer Anna Wang at email@example.com. Twitter: @annaw_ng
Like what you’re reading? Support high-quality student journalism by donating here.