Neural networks are already beating us at games, organizing our smartphone photos, and answering our emails. Eventually, they could be filling jobs in Hollywood.
Over at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), a team of six researchers created a machine-learning system that matches sound effects to video clips. Before you get too excited, the CSAIL algorithm can't do its audio work on any old video, and the sound effects it produces are limited. For the project, CSAIL PhD student Andrew Owens and postgrad Phillip Isola recorded videos of themselves whacking a bunch of things with drumsticks: stumps, tables, chairs, puddles, banisters, dead leaves, the dirty ground.
The team fed that initial batch of 1,000 videos through its AI algorithm. By analyzing the physical appearance of objects in the videos, the movement of each drumstick, and the resulting sounds, the computer was able to learn connections between physical objects and the sounds they make when struck. Then, by "watching" different videos of objects being whacked, tapped, and scraped by drumsticks, the system was able to calculate the appropriate pitch, volume, and aural properties of the sound that should accompany each clip.
The algorithm doesn't make its own sounds---it just pulls from a database of tens of thousands of audio clips. Also, sound effects aren't selected based on visual matches; as you can see around the 1:20 mark of the video above, the algorithm gets creative. It selected sound effects as varied as a rustling plastic bag and a smacked stump for a sequence in which a shrub gets a thorough drumsticking.
Owens says the research team used a convolutional neural network to analyze video frames and a recurrent neural network to pick the audio for it. They leaned heavily on the Caffe deep-learning framework, and the project was funded by the National Science Foundation and Shell. One of the team members works for Google Research, and Owens was part of the Microsoft Research fellowship program.
"We're mostly applying existing techniques in deep learning to a new domain," Owens says. "Our goal isn't to develop new deep learning methods."
Matching realistic sounds to video has primarily been the domain of Foley artists---the post-production audio wizards who record the footsteps, door creaks, and flying roundhouse kicks you see (and hear) in a polished Hollywood movie. A skilled Foley artist can make a sound that precisely matches the visual, fooling the viewer into thinking that the sound was captured on the set.
MIT's bot isn't nearly that adept. The research team conducted an online survey where 400 participants were shown versions of the same video with the original audio and the algorithm-generated sounds, then asked to pick which video had the real sounds. The fake audio was selected 22 percent of the time---very far from perfect, but still twice as effective as a previous version of the algorithm.
According to Owens, those test results are a good sign that the computer-vision algorithm can detect the materials an object is made of, as well as the different physics of tapping, whacking, and scraping an object. Still, certain things tripped the system up. Sometimes it thought the drumstick was striking an object when it actually didn't, and more people were fooled by its sound effects for leaves and dirt than its sound effects for more solid objects.
There's a deeper reason behind the project beyond just making fun sound effects. If perfected, Owens thinks the computer-vision tech could help robots identify the materials and physical properties of an object by analyzing the sounds it makes. "We'd like these algorithms to learn by watching this physical interaction occur and observing the response," Owens says. "Think of it as a toy version of learning about the world the way that babies do, by banging, stomping, and playing with things."