Our brains have a remarkable ability to pick out one voice from among many.
Now, a team of Columbia University neuroengineers has uncovered the steps that take place in the brain to make this feat possible.
Today’s discovery helps to solve a long-standing scientific question as to how the auditory cortex, the brain’s listening center, can decode and amplify one voice over others — at lightning-fast speeds.
This new-found knowledge also stands to spur development of hearing-aid technologies and brain-computer interfaces that more closely resemble the brain.
These findings were reported today in Neuron.
“Our capacity to focus in on the person next to us at a cocktail party while eschewing the surrounding noise is extraordinary, but we understood so little about how it all works,” said Nima Mesgarani, PhD, the paper’s senior author and a principal investigator at Columbia’s Mortimer B. Zuckerman Mind Brain Behavior Institute.
“Today’s study brings that much-needed understanding, which will prove critical to scientists and innovators working to improve speech and hearing technologies.”
The auditory cortex is the brain’s listening hub.
The inner ear sends this brain region electrical signals that represent a jumble of sound waves from the external world.
The auditory cortex must then pick out meaningful sounds from that jumble.
“Studying how the auditory cortex sorts out different sounds is like trying to figure out what is happening on a large lake — in which every boat, swimmer and fish is moving, and how quickly — by only having the patterns of ripples in the water as a guide,” said Dr. Mesgarani, who is also an associate professor of electrical engineering at Columbia Engineering.
Today’s paper builds on the team’s 2012 study showing that the human brain is selective about the sounds it hears.
That study revealed that when a person listens to someone talking, their brain waves change to pick out features of the speaker’s voice and tune out other voices.
The researchers wanted to understand how that happens within the anatomy of the auditory cortex.
“We’ve long known that areas of auditory cortex are arranged in a hierarchy, with increasingly complex decoding occurring at each stage, but we haven’t observed how the voice of a particular speaker is processed along this path,” said James O’Sullivan, PhD, the paper’s first author who completed this work while a postdoctoral researcher in the Mesgarani lab.
“To understand this process, we needed to record the neural activity from the brain directly.”
The researchers were particularly interested in two parts of the auditory cortex’s hierarchy: Heschl’s gyrus (HG) and the superior temporal gyrus (STG).
Information from the ear reaches HG first, passing through it and arriving at STG later.
To understand these brain regions, the researchers teamed up with neurosurgeons Ashesh Mehta, MD, PhD, Guy McKhann, MD, and Sameer Sheth, MD, PhD, neurologist Catherine Schevon, MD, PhD, as well as fellow co-authors Jose Herrero, PhD and Elliot Smith, PhD. Based at Columbia University Irving Medical Center and Northwell Health, these doctors treat epilepsy patients, some of whom must undergo regular brain surgeries. For this study, patients volunteered to listen to recordings of people speaking while Drs. Mesgarani and O’Sullivan monitored their brain waves via electrodes implanted in the patients’ HG or STG regions.
The electrodes allowed the team to identify a clear distinction between the two brain areas’ roles in interpreting sounds.
The data showed that HG creates a rich and multi-dimensional representation of the sound mixture, whereby each speaker is separated by differences in frequency.
This region showed no preference for one voice or another. However, the data gathered from STG told a distinctly different story.
“We found that that it’s possible to amplify one speaker’s voice or the other by correctly weighting the output signal coming from HG. Based on our recordings, it’s plausible that the STG region performs that weighting,” said Dr. O’Sullivan.
This is a visualization of brain activity in a multi-speaker environment.
Taken together, these findings reveal a clear division of duties between these two areas of auditory cortex: HG represents, while STG selects. It all happens in around 150 milliseconds, which seems instantaneous to a listener.
The researchers also found an additional role for STG. After selection, STG formed an auditory object, a representation of the sound that is analogous to our mental representations of the objects we see with our eyes.
This demonstrates that even when a voice is obscured by another speaker — such as when two people talk over each other — STG can still represent the desired speaker as a unified whole that is unaffected by the volume of the competing voice.
The information gleaned here could be used as the basis for algorithms that replicate this biological process artificially, such as in hearing aids. Earlier this year Dr. Mesgarani and his team announced the development of a brain-controlled hearing aid, which utilizes one such algorithm to amplify the sounds of one speaker over another.
The researchers plan to study HG and STG activity in increasingly complex scenarios that have more speakers or include visual cues.
These efforts will help to create a detailed and precise picture of how each area of the auditory cortex operates.
“Our end goal is to better understand how the brain enables us to hear so well, as well as create technologies that help people — whether it’s so stroke survivors can speak to their loved ones, or so the hearing impaired can converse more easily in a crowded party” said Dr. Mesgarani. “And today’s study is a critical way point along that path.”
Funding: This research was supported by the National Institutes of Health (NIDCD-DC014279, S10 OD018211), The Pew Charitable Trusts and Pew Biomedical Scholars Program.
The authors report no financial or other conflicts of interest
In everyday life, we are often faced with multiple speaker situations, for instance, when dining in a crowded restaurant or talking to a friend while hearing a radio in the background. Such situations require segregation of speech streams originating from different sources and selection of one of the streams for further processing. The neural mechanisms through which this type of attentional selection is achieved are not yet fully understood (e.g., Rimmele et al., 2015).
A meta-analysis (Alho et al., 2014) of functional magnetic resonance imaging (fMRI) studies on stimulus-dependent sound processing and attentional related modulations in the auditory cortex showed that speech and voice processing activate overlapping areas in the mid superior temporal gyrus and sulcus bilaterally (STG and STS, respectively). Furthermore, selective attention to continuous speech appeared to modulate activity predominantly in the same areas (Alho et al., 2014).
Importantly, selectively attending to a particular speaker in a multi-talker situation results in the STG activity that represents the spectral and temporal features of attended speech, as if participants were listening only to that speech stream (Mesgarani and Chang, 2012).
In other words, the human auditory system restores the representation of an attended speaker while suppressing irrelevant or competing speech.
In addition to STG/STS, selective attention to non-speech sounds engages prefrontal and parietal cortical areas (Alho et al., 1999, Degerman et al., 2006, Tzourio et al., 1997, Zatorre et al., 1999), which has been associated with top-down control needed to select attended sounds and reject irrelevant sounds. Selective attention to continuous speech, however, does not appear to markedly engage prefrontal and superior parietal areas (Alho et al., 2003, Alho et al., 2006, Scott et al., 2004).
This is likely because selective listening to speech is a highly automatized process, less dependent on fronto-parietal attentional control (Alho et al., 2006; see also Mesgarani and Chang, 2012). Such automaticity might be due to listeners’ lifelong experience in listening to speech. However, initial orienting of attention to one of three concurrent speech streams has yielded enhanced activation in the fronto-parietal network, hence, purportedly engaging an attentional top-down control mechanism (Alho et al., 2015, Hill and Miller, 2010).
Natural situations with multiple speakers might not only be complicated by a demand to listen selectively to one speech stream while ignoring competing speech, but also by degraded quality of the attended speech (e.g., when talking in a noisy café on the phone with a poor signal).
Studies addressing the comprehension of degraded (e.g., noise-vocoded) speech involving only one speech stream have reported increased activity in the posterior parietal cortex (Obleser et al., 2007) and frontal operculum (Davis and Johnsrude, 2003) as compared to more intelligible speech. Listening to degraded, yet intelligible and highly predictable speech, in turn, elicits activity in the dorsolateral prefrontal cortex, posterior cingulate cortex, and angular gyrus (e.g., Obleser et al., 2007). Moreover, the amount of spectral detail in speech signal was found to correlate with STS and left inferior frontal gyrus (IFG) activity, regardless of semantic predictability (Obleser et al., 2007).
McGettigan and colleagues (2012) observed increasing activity along the length of left dorsolateral temporal cortex, in the right dorsolateral prefrontal cortex and bilateral IFG, but decreasing activation in the middle cingulate, middle frontal, inferior occipital, and parietal cortices associated with increasing auditory quality. Listening to degraded speech has also activated the left IFG, attributed to higher-order linguistic comprehension (Davis and Johnsrude, 2003) and the dorsal fronto-parietal network, related to top-down control of attention (Obleser et al., 2007).
Overall, increased speech intelligibility enhances activity in the STS (McGettigan et al., 2012, Obleser et al., 2007, Scott et al., 2000), STG (Davis and Johnsrude, 2003), middle temporal gyrus (MTG; Davis and Johnsrude, 2003), and left IFG (Davis and Johnsrude, 2003, McGettigan et al., 2012, Obleser et al., 2007). Enhanced activity in these areas is therefore may be related to enhanced speech comprehension with increasing availability of linguistic information.
The studies described above, however, used only single-speaker paradigms. Recently, Evans and colleagues (2016) examined how different masking sounds are processed in the human brain. They used a selective attention paradigm with two speech streams, namely, a masked stream and a target stream. The target speech was always clear, whilst the masked speech was either clear, spectrally rotated or noise-modulated. Increased intelligibility of the masked speech activated the left posterior STG/STS, however, less extensively than a clear single speech alone.
This was taken to suggest that syntactic and other higher order properties of masking speech are not actively processed and the masker sounds may be actively suppressed already at early processing stages (see also Mesgarani and Chang, 2012).
In contrast, the masked speech yielded increased activation in the frontal (bilateral middle frontal gyrus, left superior orbital gyrus, right IFG), parietal (left inferior and superior parietal lobule) and middle/anterior cingulate cortices, as well as in the frontal operculum and insula. These activations were suggested to reflect increased attentional and control processes.
The results corroborate those from earlier positron emission tomography (PET) studies (e.g., Scott et al., 2004) on selective attention to a target speaker in the presence of another speaker (speech-in-speech) or noise (speech-in-noise).
More specifically, Scott and colleagues (2004) found more activity in the bilateral STG for speech-in-speech than speech-in-noise, whereas speech-in-noise elicited more activity in the left prefrontal and right parietal cortex than speech-in-speech. Scott and colleagues suggested that these additional areas might be engaged to facilitate speech comprehension or that they are related to top-down attentional control. Correspondingly, Wild and colleagues (2012) reported activations in frontal areas (including the left IFG) that were only present when the participants selectively attended to the target speech among non-speech distractors.
In contrast to studies reporting increased left IFG activations to increased intelligibility of degraded speech (Davis and Johnsrude, 2003, McGettigan et al., 2012, Obleser et al., 2007), Wild and colleagues (2012) found greater activity in the left IFG for degraded than for clear target speech. By contrast, STS activity was increased with decreasing speech intelligibility, regardless of attention. Increased activity for attended degraded speech was proposed to reflect “the improvement in intelligibility afforded by explicit, effortful processing, or by additional cognitive processes (such as perceptual learning) that are engaged under directed attention” (Wild et al., 2012, p. 14019).
The authors further suggested that top-down influences on early auditory processing might facilitate speech comprehension in difficult listening situations.
The majority of fMRI studies on selective attention to speech have used only auditory speech stimuli (e.g., Alho et al., 2003, Alho et al., 2006, Evans et al., 2016, Puschmann et al., 2017, Wild et al., 2012). However, natural conversations often include also visual speech information. Integrating a voice with mouth movements (i.e., visual speech) facilitates speech understanding in relation to mere listening (Sumby and Pollack, 1954). In accordance, fMRI studies on listening to speech have shown that the presence of visual speech enhances activity in the auditory cortex and higher order speech-processing areas (e.g., Bishop and Miller, 2009, McGettigan et al., 2012).
A related magnetoencephalography (MEG) study showed that the presence of visual speech enhances auditory-cortex activity that follows the temporal amplitude envelope of attended speech (Zion Golumbic et al., 2013; for similar electroencephalography (EEG) evidence, see O’Sullivan et al., 2015). Facilitation of speech comprehension by visual speech holds especially true for noisy situations (e.g., Sumby and Pollack, 1954) and degraded quality of attended speech (e.g., McGettigan et al., 2012, Zion Golumbc et al., 2013). Some fMRI studies have suggested maximal facilitation of speech comprehension by visual speech at intermediate signal-to-noise ratios of auditory information (McGettigan et al., 2012, Ross et al., 2007).
Degraded speech increases demands for fronto-parietal top-down control (Davis and Johnsrude, 2003, Evans et al., 2016), whereas adding visual speech appears to facilitate selective attention (Sumby and Pollack, 1954, Zion Golumbic et al., 2013). However, it is still unknown whether fronto-parietal areas are activated during selective attention to visually degraded speech.
Moreover, an earlier study that employed a factorial design with different levels of auditory and visual clarity in sentences (McGettigan et al., 2012) did not include an unmodulated (clear) visual and auditory condition. Hence, to our knowledge, brain responses to continuous naturalistic dialogues with varying audio-visual speech quality have not been systematically examined before.
In the current study, we collected whole-head fMRI data in order to identify brain regions critical for selective attention to natural audiovisual speech.
More specifically, we examined attention-related modulations in the auditory cortex and associated fronto-parietal activity during selective attention to audiovisual dialogues. In addition, we assessed an interplay between auditory and visual quality manipulations.
We also included clear auditory and visual stimulus conditions to investigate brain areas activated during selective attention to naturalistic dialogues in the presence of irrelevant clear speech in the background. Our experimental setup might be regarded as mimicking watching a talk show on a TV while a radio program is playing on the background.
Comparing brain activity during attention to the dialogues with activity during control conditions where the dialogues are ignored and fixation cross is to be attended, allowed us to determine attention-related top-down effects and distinguish them from stimulus-dependent bottom-up effects (Alho et al., 2014).
We predicted that both increased speech intelligibility and increased amount of visual speech information in the attended speech would be associated with stronger stimulus-dependent activity in the STG/STS as well as subsequent activity in brain areas involved in linguistic processing.
Moreover, we hypothesized that degrading auditory or visual quality of attended speech might be related to increased fronto-parietal activity due to enhanced attentional demands.
Finally, we were interested to see whether attention to audiovisual speech and the quality of this speech would have interactions in some brain areas involved in auditory, visual or linguistic processing, or in the control of attention.
We investigated brain areas activated during selective attention to audiovisual dialogues. In particular, we sought attention-related modulations in the auditory cortex and fronto-parietal activity during selective attention to naturalistic dialogues with varying auditory and visual quality.
Behaviorally, we observed that increased quality of both auditory and visual information resulted in improved accuracy in answering to the questions related to the content of dialogues. Hence, expectedly, both increased auditory quality (e.g., Davis and Johnsrude, 2003) and increased visual quality (Sumby and Pollack, 1954) facilitated speech comprehension. However, no significant interaction between Auditory Quality and Visual Quality was observed.
Thus, our results are not able to give full support to maximal facilitation of speech processing by visual speech at the intermediate signal-to-noise ratio reported, for instance, by McGettigan and colleagues (2012) and Ross and colleagues (2007).
In the fMRI analysis, the main effect of Auditory Quality showed that increasing speech quality was associated with increased activity in the (bilateral) STG/STS, which corroborates previous studies on speech intelligibility (e.g., Scott et al., 2000, Davis and Johnsrude, 2003, Obleser et al., 2007, Okada et al., 2010, McGettigan et al., 2012, Evans et al., 2014, Evans et al., 2016). Enhanced activity in the STG/STS bilaterally is most probably related to enhanced speech comprehension with increasing availability of linguistic information.
The STG/STS activity extended to the temporal pole, which might be associated with enhanced semantic processing with increasing speech quality (Patterson et al., 2007). The right STG/STS activity observed here might also be related to prosodic processing during attentive listening (Alho et al., 2006, McGettigan et al., 2013, Kyong et al., 2014).
The right temporal pole, in turn, its most anterior part in particular, has been also associated with social cognition (Olson et al., 2013), which may have been triggered by our naturalistic audiovisual dialogues. In addition to the temporal lobe activity, we observed increasing activity in the left angular gyrus and left medial frontal gyrus with increasing speech intelligibility. Enhanced activity in the left angular gyrus may reflect successful speech comprehension, stemming either from increased speech quality or from facilitated semantic processing due to improved speech quality (Humphries et al., 2007, Obleser and Kotz, 2010).
The left medial frontal gyrus, in turn, has been attributed to semantic processing as a part of a semantic network (Binder et al., 2009). Hence, an increase in these activations with improving speech quality implies a successful integration of linguistic information onto the existing semantic network and improved comprehension of the spoken input – extending beyond the STG/STS.
The main effect of Visual Quality demonstrated increasing activity in the bilateral occipital cortex and right fusiform gyrus with decreasing visual quality – areas related to object and face recognition (e.g., Weiner and Zilles, 2016).
Enhanced activity in these areas might be due to, for instance, noise-modulation of the videos that contained more random motion on the screen than good quality videos. Visual noise has been shown to activate primary regions in the occipital cortex more than coherent motion (e.g., Braddick et al., 2001).
It is, however, also possible that viewing masked visual speech required more visual attention than viewing the unmasked videos, causing enhanced activity in the degraded visual conditions. Nevertheless, activity in the middle occipital gyrus was higher for poor visual quality combined with poor auditory quality than for good visual and auditory quality even during attention to the fixation cross (Figure 4).
This suggests that increased visual cortex activity for poorer visual quality was at least partly caused by random motion of the masker.
Activity enhancements with poor (contra good) audiovisual quality were also observed in the left superior parietal lobule, precuneus and right inferior parietal lobule in both attention conditions, implying contribution of random motion in the masker to these effects as well. Increased visual quality was also associated with enhanced activity in the bilateral STG/STS, corroborating other studies reporting these areas being involved in multisensory integration (e.g., Beauchamp et al., 2004, Beauchamp et al., 2004).
We also observed an increase in the left IFG activity with increasing visual quality, an area related to the processing of high-order linguistic information (e.g., Obleser et al., 2007). Enhanced activity in the left IFG has been also observed during integration of speech and gestures (Willems et al., 2009), suggesting its involvement in multimodal integration also in the current study.
We also performed the 2 × 2 ANOVA on brain activity during attention to speech with activity during attention to the fixation cross for stimuli with poor and good audiovisual quality. There, the main effect of Audiovisual Quality in the ANOVA indicated higher bilateral STG/STS (extending to the temporal pole) activity for good auditory and good visual quality than for poor auditory and poor visual quality.
The STG/STS effects were also observed during attention to the fixation cross, implying quite automatic speech processing with enhanced audiovisual quality
Furthermore, we found that for poorer audiovisual quality, activity was higher in the left superior parietal lobule, the right inferior parietal, the left precuneus, and bilateral middle occipital gyrus, possibly reflecting automatic processing of facial information.
The main effect of Attention in the 2 × 2 ANOVA indicated enhanced activity during attention to speech in the left planum polare, angular and lingual gyrus, as well as the right temporal pole. We also observed activity in the dorsal part of the right inferior parietal lobule and supramarginal gyrus, as well as in the oribitofrontal/ventromedial frontal gyrus and posterior cingulate bilaterally.
One might wonder why attending to the dialogues in relation to attending to the fixation cross was not associated with activity enhancements in the STG/STS as in some previous studies on selective attention to continuous speech (e.g., Alho et al., 2003; Alho et al., 2006).
One possible explanation is the ease of the visual control task (i.e. counting the rotations of the fixation cross), eliminating the need to disregard audiovisual speech in the background altogether.
This interpretation is also supported by the STG/STS activations observed even during attention to the fixation cross, at least when the audiovisual quality in to-be-ignored speech was good (see Figure 4). Areas in the planum polare have been shown to be associated with task-related manipulations in relation to speech stimuli (Harinen et al., 2013, Wikman et al., 2019).
Auditory attention effects have also been reported outside the STG/STS, for instance, in the middle and superior frontal gyri, precuneus, as well as superior parietal inferior and superior parietal lobule (e.g., Degerman et al., 2006, Salmi et al., 2007).
These areas are at least partly involved in the top down control of auditory cortex during selective attention.
Interestingly, even though the participants attended to visual stimuli both during attention to speech and attention to the fixation cross, activity was higher in the lingual gyrus (approximately in areas V2/V3 of the visual cortex) during attention to speech. This effect is presumably explained by differences in visual attention between the tasks (see, e.g., Martínez et al., 1999).
In other words, while both tasks demanded visual attention, task-related processing of visual speech was presumably more attentiondemanding, especially when the faces were masked, as compared to processing of fixation-cross rotations. In line with the previous studies on selective attention to continuous speech (Alho et al., 2003, Alho et al., 2006, Scott et al., 2004), attention to audiovisual dialogues did not significantly engage dorsolateral prefrontal and superior parietal areas.
This may be due to high automaticity of selective listening to continuous speech, which might, hence, be quite independent of fronto-parietal attentional control (Alho et al., 2006).
However, for the present audiovisual attention to speech, we observed activation in the left inferior parietal lobule, which may be related to attentive auditory processing (e.g., Alain et al., 2010, Rinne et al., 2009).
Furthermore, attention to audiovisual speech elicited enhanced activity in the orbitofrontal/ventromedial prefrontal cortex in comparison with attention to the fixation cross. One possible explanation would be that this activity is related to processing of semantic information (e.g., Binder et al., 2009) in attended speech in contrast to visual information in the fixation cross.
Alternatively, this effect may be related to the social aspect of the attended dialogues, since the ventromedial frontal area is associated with social cognition, such as theory of mind and moral judgment (Bzdok et al., 2012), as well as evaluation of other persons’ traits (Araujo et al., 2013).
Moreover, enhanced activity in the posterior cingulate and right superior temporal pole observed here during attention to speech may be related to social perception, as both these areas have been involved in social cognition (Bzdok et al., 2012).
To our knowledge, no previous study has shown that attending to emotionally neutral dialogues would enhance activity in these three brain regions related to social perception and cognition.
To summarize, our study is the first to present findings on selective attention to natural audiovisual dialogues. Our results demonstrate that increased auditory and visual quality of speech facilitated selective listening to the dialogues, seen in enhanced brain activity in the bilateral STG/STS and the temporal pole.
Enhanced activity in the temporal pole might be related to semantic processing particularly in the left hemisphere, whereas in the right hemisphere, it may index processing of social information activated during attention to the dialogues.
The fronto-parietal network was associated with enhanced activity during attention to speech, reflecting top-down attentional control. Attention to audiovisual speech also activated the orbitofrontal/ventromedial prefrontal cortex – a region associated with social and semantic cognition.
Hence, our findings on selective attention in realistic audiovisual dialogues emphasize not only involvement of brain networks related to audiovisual speech processing and semantic comprehension but, as a novel observation, the social brain network.
Anne Holden – Zuckerman Institute
The image is credited to Zuckerman Institute.
Original Research: Closed access
“Hierarchical Encoding of Attended Auditory Objects in Multi-talker Speech Perception”. James O’Sullivan, Jose Herrero, Elliot Smith, Catherine Schevon, Guy M. McKhann, Sameer A. Sheth, Ashesh D. Mehta, Nima Mesgarani.