When your mother calls your name, you know it’s her voice – no matter the volume, even over a poor cell phone connection.
And when you see her face, you know it’s hers – if she is far away, if the lighting is poor, or if you are on a bad FaceTime call.
This robustness to variation is a hallmark of human perception.
On the other hand, we are susceptible to illusions: We might fail to distinguish between sounds or images that are, in fact, different. Scientists have explained many of these illusions, but we lack a full understanding of the invariances in our auditory and visual systems.
Deep neural networks also have performed speech recognition and image classification tasks with impressive robustness to variations in the auditory or visual stimuli.
But are the invariances learned by these models similar to the invariances learned by human perceptual systems?
A group of MIT researchers has discovered that they are different. They presented their findings yesterday at the 2019 Conference on Neural Information Processing Systems.
The researchers made a novel generalization of a classical concept: “metamers” – physically distinct stimuli that generate the same perceptual effect.
The most famous examples of metamer stimuli arise because most people have three different types of cones in their retinae, which are responsible for color vision.
The perceived color of any single wavelength of light can be matched exactly by a particular combination of three lights of different colors – for example, red, green, and blue lights.
Nineteenth-century scientists inferred from this observation that humans have three different types of bright-light detectors in our eyes.
This is the basis for electronic color displays on all of the screens we stare at every day.
Another example in the visual system is that when we fix our gaze on an object, we may perceive surrounding visual scenes that differ at the periphery as identical.
In the auditory domain, something analogous can be observed.
For example, the “textural” sound of two swarms of insects might be indistinguishable, despite differing in the acoustic details that compose them, because they have similar aggregate statistical properties.
In each case, the metamers provide insight into the mechanisms of perception, and constrain models of the human visual or auditory systems.
Credit: Massachusetts Institute of Technology
In the current work, the researchers randomly chose natural images and sound clips of spoken words from standard databases, and then synthesized sounds and images so that deep neural networks would sort them into the same classes as their natural counterparts.
That is, they generated physically distinct stimuli that are classified identically by models, rather than by humans.
This is a new way to think about metamers, generalizing the concept to swap the role of computer models for human perceivers. They therefore called these synthesized stimuli “model metamers” of the paired natural stimuli. The researchers then tested whether humans could identify the words and images.
“Participants heard a short segment of speech and had to identify from a list of words which word was in the middle of the clip.
For the natural audio this task is easy, but for many of the model metamers humans had a hard time recognizing the sound,” explains first-author Jenelle Feather, a graduate student in the MIT Department of Brain and Cognitive Sciences (BCS) and a member of the Center for Brains, Minds, and Machines (CBMM).
That is, humans would not put the synthetic stimuli in the same class as the spoken word “bird” or the image of a bird.
In fact, model metamers generated to match the responses of the deepest layers of the model were generally unrecognizable as words or images by human subjects.
Josh McDermott, associate professor in BCS and investigator in CBMM, makes the following case: “The basic logic is that if we have a good model of human perception, say of speech recognition, then if we pick two sounds that the model says are the same and present these two sounds to a human listener, that human should also say that the two sounds are the same. If the human listener instead perceives the stimuli to be different, this is a clear indication that the representations in our model do not match those of human perception.”
Joining Feather and McDermott on the paper are Alex Durango, a post-baccalaureate student, and Ray Gonzalez, a research assistant, both in BCS.
There is another type of failure of deep networks that has received a lot of attention in the media: adversarial examples (see, for example, “Why did my classifier just mistake a turtle for a rifle?”).
These are stimuli that appear similar to humans but are misclassified by a model network (by design—they are constructed to be misclassified).
They are complementary to the stimuli generated by Feather’s group, which sound or appear different to humans but are designed to be co-classified by the model network.
The vulnerabilities of model networks exposed to adversarial attacks are well-known—face-recognition software might mistake identities; automated vehicles might not recognize pedestrians.
The importance of this work lies in improving models of perception beyond deep networks. Although the standard adversarial examples indicate differences between deep networks and human perceptual systems, the new stimuli generated by the McDermott group arguably represent a more fundamental model failure – they show that generic examples of stimuli classified as the same by a deep network produce wildly different percepts for humans.
The team also figured out ways to modify the model networks to yield metamers that were more plausible sounds and images to humans. As McDermott says, “This gives us hope that we may be able to eventually develop models that pass the metamer test and better capture human invariances.”
“Model metamers demonstrate a significant failure of present-day neural networks to match the invariances in the human visual and auditory systems,” says Feather, “We hope that this work will provide a useful behavioral measuring stick to improve model representations and create better models of human sensory systems.”
Vision science seeks to understand why things look as they do (Koffka, 1935). Typically, our entire visual field looks subjectively crisp and clear. Yet our perception of the scene falling onto the peripheral retina is actually limited by at least three distinct sources: the optics of the eye, retinal sampling, and the mechanism(s) giving rise to crowding, in which our ability to identify and discriminate objects in the periphery is limited by the presence of nearby items (Bouma, 1970; Pelli and Tillman, 2008).
Many other phenomena also demonstrate striking ‘failures’ of visual perception, for example change blindness (Rensink et al., 1997; O’Regan et al., 1999) and inattentional blindness (Mack and Rock, 1998), though there is some discussion as to what extent these are distinct from crowding (Rosenholtz, 2016). Whatever the case, it is clear that we can be insensitive to significant changes in the world despite our rich subjective experience.
Visual crowding has been characterised as compulsory texture perception (Parkes et al., 2001; Lettvin, 1976) and compression (Balas et al., 2009; Rosenholtz et al., 2012a). This idea entails that we cannot perceive the precise structure of the visual world in the periphery. Rather, we are aware only of some set of summary statistics or ensemble properties of visual displays, such as the average size or orientation of a group of elements (Ariely, 2001; Dakin and Watt, 1997).
One of the appeals of the summary statistic idea is that it can be directly motivated from the perspective of efficient coding as a form of compression.
Image-computable texture summary statistics have been shown to be correlated with human performance in various tasks requiring the judgment of peripheral information, such as crowding and visual search (Rosenholtz et al., 2012a; Balas et al., 2009; Freeman and Simoncelli, 2011; Rosenholtz, 2016; Ehinger and Rosenholtz, 2016). Recently, it has even been suggested that summary statistics underlie our rich phenomenal experience itself—in the absence of focussed attention, we perceive only a texture-like visual world (Cohen et al., 2016).
Across many tasks, summary statistic representations seem to capture aspects of peripheral vision when the scaling of their pooling regions corresponds to ‘Bouma’s Law’ (Rosenholtz et al., 2012a; Balas et al., 2009; Freeman and Simoncelli, 2011; Wallis and Bex, 2012; Ehinger and Rosenholtz, 2016). Bouma’s Law states that objects will crowd (correspondingly, statistics will be pooled) over spatial regions corresponding to approximately half the retinal eccentricity (Bouma, 1970; Pelli and Tillman, 2008; though see Rosen et al., 2014).
the precise value of Bouma’s law can vary substantially even over different visual quadrants within an individual (see e.g. Petrov and Meleshkevich, 2011), we refer here to the broader notion that summary statistics are pooled over an area that increases linearly with eccentricity, rather than the exact factor of this increase (the exact factor becomes important in the paragraph below).
If the visual system does indeed represent the periphery using summary statistics, then Bouma’s scaling implies that as retinal eccentricity increases, increasingly large regions of space are texturised by the visual system.
If a model captured these statistics and their pooling, and the model was amenable to being run in a generative mode, then images could be created that are indistinguishable from the original despite being physically different (metamers).
Freeman and Simoncelli (2011) developed a model (hereafter, FS-model) in which texture-like summary statistics are pooled over spatial regions inspired by the receptive fields in primate visual cortex. The size of neural receptive fields in ventral visual stream areas increases as a function of retinal eccentricity, and as one moves downstream from V1 to V2 and V4 at a given eccentricity.
Each visual area therefore has a signature scale factor, defined as the ratio of the receptive field diameter to retinal eccentricity (Freeman and Simoncelli, 2011). Similarly, the pooling regions of the FS-model also increase with retinal eccentricity with a definable scale factor. New images can be synthesised that match the summary statistics of original images at this scale factor. As scale factor increases, texture statistics are pooled over increasingly large regions of space, resulting in more distorted synthesised images relative to the original (that is, more information is discarded).
The maximum scale factor for which the images remain indistinguishable (the critical scale) characterises perceptually-relevant compression in the visual system’s representation. If the scale factor of the model corresponded to the scaling of the visual system in the responsible visual area, and information in upstream areas was irretrievably lost, then the images synthesised by the model should be indistinguishable while discarding as much information as possible. That is, we seek the maximum compression that is perceptually lossless:
where scrit(I) is the critical scale for an image I, I^s is a synthesised image at scale s and d is a perceptual distance. Larger scale factors discard more information than the relevant visual area and therefore the images should look different. Smaller scale factors preserve information that could be discarded without any perceptual effect.
Crucially, it is the minimum critical scale over images that is important for the scaling theory. If the visual system computes summary statistics over fixed (image-independent) pooling regions in the same way as the model, then the model must be able to produce metamers for all images. While images may vary in their individual critical scales, the image with the smallest critical scale determines the maximum compression for appearance to be matched by the visual system in general, assuming an image-independent representation:
Freeman and Simoncelli showed that the largest scale factor for which two synthesised images could not be told apart was approximately 0.5, or pooling regions of about half the eccentricity. This scaling matched the signature of area V2, and also matched the approximate value of Bouma’s Law. Subsequently, this result has been interpreted as demonstrating a link between receptive field scaling, crowding, and our rich phenomenal experience (e.g. Block, 2013; Cohen et al., 2016, Landy, 2013, Movshon and Simoncelli, 2014, Seth, 2014).
These interpretations imply that the FS-model creates metamers for natural scenes. However, observers in Freeman and Simoncelli’s experiment never saw the original scenes, but only compared synthesised images to each other. Showing that two model samples are indiscriminable from each other could yield trivial results.
For example, two white noise samples matched to the mean and contrast of a natural scene would be easy to discriminate from the scene but hard to discriminate from each other. Furthemore, since synthesised images represent a specific subset of images, and the system critical scale ssystem is the minimum over all possible images, the ssystem estimated in Freeman and Simoncelli (2011) is likely to be an overestimate.
No previous paper has estimated ssystem for the FS-model using natural images. Wallis et al., 2016 tested the related Portilla and Simoncelli (2000) model textures, and found that observers could easily discriminate these textures from original images in the periphery. However, the Portilla and Simoncelli model makes no explicit connection to neural receptive field scaling. In addition, relative to the textures tested by Wallis et al., 2016, the pooling region overlap used in the FS-model provides a strong constraint on the resulting syntheses, making the images much more similar to the originals. It is therefore still possible that the FS-model produces metamers for natural scenes for scale factors of 0.5.
Provided by Massachusetts Institute of Technology