Sound Interfaces and Communication Featured Signal of Change: SoC1079 April 2019

Author: Martin Schwirn (Send us feedback.)

Ubiquitous Interfaces, Constant Interactions (Signal of Change, February 2019) looks at the goal of achieving true ubiquity of interfaces, which gives users a constant connection to the digital world so they can extract information or interact with apps. In particular, speech-recognition technology has moved a lot closer toward providing users with seamless interaction with devices since 2003's Business Implications of New Directions in Speech Technology. Humans and Machines Converse from Signal of Change, December 2018 maintains that the ability to use natural language to engage with applications will spread to virtually every human interaction with digital and electronic systems. Several technologies could advance speech- and sound-driven interaction with devices and the environment. Scan™ mentioned some related conceptual approaches more than a decade ago. For instance, 2004's, Advances in Speech Technologies: Pointing the Way to New Applications highlights speechless speech recognition. The timing seems right to take a look at advanced speech- and sound-related interface technologies, which could drive a wide range of applications beyond smart speakers.

Researchers are working on enablers for solutions that could eventually enable the surreptitious transmission of sound.

Various approaches offer potential ways to facilitate the quiet and surreptitious speech control of devices. For example, researchers at the Massachusetts Institute of Technology (MIT; Cambridge, Massachusetts) are looking at the use of subvocal interfaces. Such interfaces make use of neuromuscular signals that occur when users verbalize commands internally. In simple terms, the technology makes use of physiological processes that occur when people talk to themselves in their minds. A subvocal interface can capture such information via electrodes on the user's jaw and face. The MIT researchers' device, which enables users to ask questions and control devices without speaking, also features bone-conduction headphones to enable the transmission of information to users without disrupting their hearing other tasks they are engaging in. The device looks very obtrusive, covering the lower part of one side of the user's face. Likely, such a device will find use in only very focused application areas—potentially in loud industrial or commercial environments where the speechless capture of commands could provide many benefits. Meanwhile, Microsoft (Redmond, Washington) patented a technology that would enable a person to use whispers to interact with speech-enabled devices, thereby preventing the person from disturbing other people in the area. Problematically, the technology, which requires positioning a microphone extremely close to the user's mouth, calls for the user to inhale while whispering to prevent distorting the input with exhaled airflow. Even if the technology were reliable, few people are likely to expend the effort necessary to train themselves to inhale while whispering.

Other approaches could also provide improvements in speech recognition and even lead to solutions that could enable the extraction of speech-related information from a distance. Researchers at Xi'an Jiaotong-Liverpool University (Suzhou, China)—an international joint university by the University of Liverpool (Liverpool, England) and Xi'an Jiaotong University (Xi'an, China)—are working on a technology that will improve hearing aids but could also support advanced speech-technology systems. The researchers are looking at a way to use visual information to augment the information that sound provides. To that end, a system the researchers are developing would use a camera to capture and track speakers' lip movement, providing information that could lead to a more robust speech-processing experience for wearers of hearing aids—particularly in loud situations. According to Andrew Abel from Xi'an Jiaotong-Liverpool University, "The idea is that an audio signal from a microphone and visual information from a camera can be fed into the same system which then processes the information, using the visual information to filter the 'noise' from the audio signal" ("Research: Developing the Cognitive Hearing Aid," Xi'an Jiaotong-Liverpool University, 4 December 2018; online). The researchers drew inspiration for their system from the cocktail-party effect, in which a person focuses on a conversation partner's speech and tunes out surrounding distracting noise. Such a system could not only help speech-recognition technologies in general but also offer the potential to extract information from video footage that has no sound, for instance. Another technology that could capture sound and improve the ability to extract information for a wide range of applications comes from Nokia Corporation (Espoo, Finland). Developers from Nokia created OZO Sound—software that enables smartphone users to zoom in on sound information while recording video much like the way smartphone users can zoom in on visual information while recording video or taking photographs. The software enables users to focus selectively on various points in a scene to "capture sound coming from in front (or even behind) the cellphone's cameras and microphones. What's more, the software allows the audio recording to automatically track the selected person, animal, or object. It also allows users to zoom in on a particular sound...and sync the audio zoom with the video zoom" ("Zoom and Focus: These Familiar Video Features Are Coming to Audio," IEEE Spectrum, 15 January 2019; online). For instance, a user could zoom in on his or her children at a playground and then hear what the children are saying to friends. The software even reduces surrounding noise. Technical complications prevent OZO Sound's rolling out as a downloadable app and require smartphone manufacturers to license the software for integration into their phones' built-in audio- and video-capturing software. These conditions create an adoption hurdle for OZO Sound.

Similarly, developers are looking at ways to transmit sound and information in surreptitious ways. For example, researchers at the University of California, Berkeley (Berkeley, California), demonstrated that they could hide speech-based commands for virtual assistants in recordings of music or spoken text. Although these commands are nearly inaudible to the human ear, they can make a virtual assistant interact with smartphones and some other connected devices. Meanwhile, researchers at the MIT Lincoln Laboratory are developing ways to use a laser to send an audible message directly to a person's ear across a distance of several feet. The laser-based approaches do not require the use of any receiver equipment, and they enable the targeted transmission of sound across noisy rooms. Conceivably, such methods could also find use in transmitting commands and information to speech-enabled systems in industrial facilities and other noisy environments. Other researchers are working on enablers for solutions that could eventually enable the surreptitious transmission of sound. Scientists from the City University of New York (New York, New York) developed a metamaterial capable of transmitting sound along its edges. The researchers believe that the material's capability to provide abnormally robust sound transmission gives the material multiple application areas, including ultrasound imaging and underwater acoustics.