Of the many tools available to spies, among the most impressive are the ones that detect how windows vibrate. Why? Because that’s how you listen in on someone without having to bug them.
Whether it’s lasers, ultrasound or radio waves, the idea is the same – someone speaking in a room creates a noise, a vibration in the air. Those vibrations move through the room towards the periphery, where they bounce off the walls and windows. Those vibrations are tiny, on the scale of a hundredth of a millimetre, but they’re large enough to disturb the way that other light or sound waves bounce off the window from the other side. A spy can aim one of these devices at a window of a hotel from across the streeet, and piece together a reasonably good (if often very low fidelity) recording of what’s happening inside the room.
Researchers at MIT, however, have managed to go one further than this – they’ve done it with a video camera, recording everyday objects like crisp packets and the leaves on a houseplant at high-speed. Vibrations on the scale of a five-thousandth of a pixel can be picked up after controlling for background “noise” with a suitable algorithm, giving a decent recording of the noise in the room at the time the video was made. In the announcement of the paper, due to be presented at computer graphics conference Siggraph next week, electrical engineer and lead author Abe Davis said:
When sound hits an object, it causes the object to vibrate. The motion of this vibration creates a very subtle visual signal that’s usually invisible to the naked eye. People didn’t realize that this information was there.”
Using cameras that record between 2,000 and 3,000 frames per second, the team – led by electrical engineer Davis, and also including researchers from Microsoft and Adobe – tested out indirect listening with a range of different materials. Playing (or saying the lyrics to) “Mary Had A Little Lamb”, footage from high-speed cameras could be used to retrieve impressibley audible sound clips. With the crisp packet, the team even had it on the other side of a piece of soundproof glass 15 feet away, and still it was able to pick up the audio. They even manage to pick up “Ice Ice Baby” from a pair of headphones plugged into a laptop, with Shazam able to identify the song from the low-quality reconstruction.
For those examples, the frequency of the audio signal being decoded had to be smaller than the number of frames per second that the camer was recording at. Anything smaller will be lost in the gaps between images. However, the team also succeeded in recreating sound from video clips recorded with basic consumer cameras – ones that record at a mere 60 frames per second – by taking advantage of a quirk of how their shutters work.
Unlike cameras with physical film, which expose the entire negative to light simultaneously, digital cameras use what’s called a “rolling shutter”. They break the image down into individual pixels, and record the image sequentially – that is, they start at the top left, move along the row, then do the next row left to right, then the next one, and so on. The whole process for a camera with several million pixels takes a fraction of a second, but it can have weird side effects if the camera is taking a photo of something that moves faster than its shutter speed. It’s a bit like how the human eye, watching a car tyre, will see the hubcap rotate one way up to a certain speed, but strangely slow down and then appear to rotate in the opposite direction to the way it’s actually moving. (Here’s one of several videos on YouTube showing a similar effect, with an iPhone struggling to record a strobe light.)
Imagine the straight edge of a crisp packet, in front of a dark background – with a rolling shutter, that straight line will appear slightly curved as it vibrates from nearby sound. Using that knowledge (derived from an earlier MIT study which found that video of someone’s face could be used to determine their pulse) it’s possible to derive a recording of sounds up to five times the frequency that the camera should be able to pick up, giving sound reconstructions that are significantly poorer than those made with the high-speed camera, yet still very much discernible.
We’re still a way away from seeing any consumer implications for this research, but the espionage and law enforcement community are likely to be interested – just imagine how useful it would be if it was suddenly possible to recover audio from silent CCTV footage.
The MIT team has released a video demonstrating its discovery. It’s well worth a watch: