Why is it so hard to make unintelligible audio intelligible?

Image from Pixabay

When we listen to speech in a noisy environment, our ears filter out the noise and allow us to concentrate on the flow of words produced by the speaker.

This is called the cocktail party effect, and like many other amazing feats of our auditory system, it goes completely unnoticed by our conscious minds. We believe we hear the speech as separate from the background noise because it really is separate from the background noise, giving no credit to the role our minds have played in separating it out for us to listen to. But there’s more to that separation than meets the ear.

A microphone can’t do what our ears (and minds) do so effortlessly: pay attention to a particular stream of sound and ignore the rest.

Unless we deliberately record the background noise and the speech to different tracks (impossible to do covertly) the speech and the noise get convolved into a single stream of sound.

Up to a point, our minds can still do a surprisingly good job of following the speech and ignoring the noise, but it is far harder with a recording of a cocktail party than at a real cocktail party. You may have noticed that if you have ever made a recording of a conversation: speech that seemed perfectly clear at the time is hard to listen to in the recording.

What can audio engineers do?

The siren is the strong line like a row of mountains running left to right

Audio engineers can do a great job of ‘cleaning up’ noisy recordings. Every year their tools and skills get more sophisticated. Here’s an example from a tutorial showing how to remove the sound of a passing siren from a street recording.

Now removing that siren noise is certainly an impressive achievement with many practical applications. However there is one very important thing to notice. The siren may be annoying or unpleasant to listen to, but it does not affect the intelligibility of the speech.

In general, the background sounds that can be removed from a recording are sounds that do not make the recording unintelligible. In fact, it is rarely if ever possible to make unintelligible audio intelligible – especially if the ‘ground truth’ of what was said is not known with certainty.

Here is a visual demonstration to show why it is so difficult to make unintelligible audio intelligible. The spectrograms below represent 2-second excerpts from different recordings of different quality. The visual clarity of each image is a good analogue of the audio clarity.

Think about what you would have to do to the second image to make it similar in clarity to the first (and recall the photography example we saw earlier).

Good quality recording, easy to understand
Barely intelligible covert recording

A useful analogy

It gets technical to explain the nitty gritty of ‘enhancing’ and its problems, but there is a very useful analogy (discussed in heaps more detail in Rethink Speech 101: Unlearning) that can help courts make better judgments about the effects of enhancing.

According to ‘common knowledge’ speech is a sequence of discrete (separate) words, each made up of a sequence of discrete sounds (‘phonemes’), a bit like the print on this page.

On that assumption, a poor quality recording might look something like this:

degraded print

Even with half the print blotted out, it is not difficult to discern what it might say. It is easy to imagine how an audio engineer could fill in the missing sections and make the whole thing far more intelligible than it presently is.

But this ‘common knowledge’ is wrong!

In reality, speech is really nothing like print, as we demonstrate in detail in Observing Speech and other modules of Rethink Speech 101: Unlearning.

A far better (though still imperfect) analogy is that the clearest of speech is like very messy handwriting with no gaps between words. As shown by the examples below, similar levels of blotting out have very different effects on handwriting. Do you think you could reconstruct the text in these examples with the same degree of confidence as the examples above?

degraded handwriting1

degraded handwriting2

In fact audio engineers are rarely able to provide any objective improvement in the intelligibility of poor quality audio recordings. Here’s an before-and-after pair for you to evaluate yourself: can you tell which has been ‘enhanced’ and which is the original? Can you make out the words in either of them?

Unfortunately, ‘enhancing’ of this kind is worse than a waste of time, however. When listeners believe audio has been ‘enhanced’ they are more likely to trust their perception – even if that perception is demonstrably inaccurate.

Just one reason why it is useful to replace the ‘speech is like print’ analogy that pervades the ‘common knowledge’ used as the basis of legal decisions about speech evidence.