Perceptual Bases for Virtual Reality: Part 1, Audio

An important part of creating a truly immersive VR experience is the accurate representation of sounds in space to the user. If a sound source is in motion in virtual space, it stands to reason that we ought to hear the sound source moving.

One solution is to this problem is to use an array of loudspeakers arranged in space around the user. This technique – so-called ‘ambisonics’ – is not only expensive, but also requires space way in excess of the footprint of the average user seated at a consumer-grade computer. For example, Tod Machover’s (MIT) setup is shown below, and is typical of some ambisonic setups. The 5.1 standard for surround sound in home theatres (or related extensions, such as 7.1 – meaning 7 speakers plus a subwoofer) is consumer-grade technology which operates on a similar principle. Clever mixing and editing of movie soundtracks aims to trick the listener in to perceiving tighter sound-image associations by cueing sounds, the sources of which are apparent from the visual content of the media being displayed or projected, in a location in the sonic field corresponding to their virtual source.

Ambisonic sound set up with a circular array of Bowers and Wilkins loudspeakers surrounding a listenr
Tod Machover’s Ambisonic Setup (Source:

It might seem counterintuitive, but most of the psycho-acoustical cues that humans use to localize sounds in space can be replicated using headphones. This follows from the unsubtle observation that we have only two ears, and the slightly more subtle reflection on the results of experiments designed to establish precisely which sources of information our brains depend on in determining the perceived location of a sound source. This behavior is known in the related psychological literature as acoustic (or sound) source localization.

Jobbing programmers, however, don’t have to wade through the reams of scientific research that substantiate the details of the various mechanisms of acoustic source localization, as well their limitations and contingencies. The 3D Audio Rendering and Evaluation Guidelines (Level 1) spec provides baseline requirements for a minimally convincing 3D audio rendering, and provides physiological and psychological justifications for these requirements. Whilst it is exceptionally outdated and outmoded, it still provides a useful overview of the important perceptual bases for VR audio simulation. In particular, this specification is one of the motivating documents in the design of the (erstwhile) open source OpenAL 3D audio API and its descendants. In the remainder of this post, I briefly describe the most important binaural (i.e. stereo) audio cues which are thought to facilitate acoustic source localization in the human brain.

Interaural Intensity Difference

In plain terms: the intensity of the sound entering your ears will be different for each ear, depending on the location of the sound with respect to your head. This is due to two factors:

  1. sound attenuates in intensity over time as it passes through a medium, your ears are a non-zero distance apart, and sound propagates at a finite speed
  2. (more significantly) your head may ‘shadow’ the source of the sound when the source is off-center

You might think that you don’t have a big head, but it’s big enough to make a difference!

Interaural Time Difference

Since sound has a relatively unchanging velocity as it passes through the most common media that we may wish to model virtually, the time that it takes for sound to propagate from the source to one ear differs, very slightly. Our mind is sensitive to these differences, perhaps owing to the evolutionary utility of knowing the location of noisy predators (or prey). Knowing that the speed of sound is roughly constant, the mind performs a rudimentary triangulation in order to locate the sound source in the relevant plane, relative to the listener.

Audio-Visual Synergy

Finally, a less physiological cue: the co-incidence of aural and visual stimuli tricks the brain into attributing the contemporaneous sound to the source denoted or signified by the visual stimulus. By keeping latency between aural and visual stimuli low, we improve the likelihood of the perception of audio-visual synergy. This, in combination with the careful modeling of the above phenomena (amongst many others), contributes towards a more immersive aural experience. In turn, this improves the credibility of VR simulations that have an aural component.

Leave a Reply

Your email address will not be published. Required fields are marked *