As someone interested in real-time video communications, you may have been wowed by the recent news that Google’s Duo engineering team has developed an audio codec capable of delivering a reasonable facsimile of human speech at 3 kbps. But — for now at least — the codec, dubbed Lyra, doesn’t look like anything to get… Continue reading What Does the New Google Lyra Audio Codec Mean for Real-Time Video Streaming?
As someone interested in real-time video communications, you may have been wowed by the recent news that Google’s Duo engineering team has developed an audio codec capable of delivering a reasonable facsimile of human speech at 3 kbps.
But — for now at least — the codec, dubbed Lyra, doesn’t look like anything to get excited about. Its development is far more interesting for where it might lead one day than for any impact it is likely to have anytime soon in the highly competitive social meeting app world occupied by Duo, Facetime, WhatsApp, and many others, let alone the video conferencing environment populated by Google Meet, Zoom, and Skype, among others.
Duo, like most video communications apps, relies on the WebRTC streaming protocol, as do most of the multidirectional, real-time use cases running on Red5 Pro’s experience delivery network (XDN) platform. Whether Lyra may one day be useful for XDN applications depends on where Google takes it from here and how those efforts compare to other advances in encoding technology.
The Shrinking Bit Rate Trend in Speech Coding
As things stand now in the obscure world of cutting-edge speech compression, 3 kbps isn’t all that unusual. By restricting algorithmic processing to all or some portion of vocal sound wave frequencies between 300 Hz to 18 kHz, older as well as new voice codecs are far more bandwidth efficient than audio codecs that support the full range of sound audible to humans. For example, the most widely used audio codec in video streams, Advanced Audio Coding (AAC), typically covers a 0 to 96 kHz frequency range, which is extended to 120 kHz with use of low frequency enhancement (LFE), the subwoofer feed used in surround sound and other advanced acoustics.
AAC, which is incorporated into the H.264/AVC standard, consumes bandwidth at 96 kbps with typical settings for stereo sound using a 48 kHz encode sample rate, although purely musical apps often use AAC at much higher sample rates with bit rates extending all the way to 512 kbps. In contrast, Opus, the most widely used next-generation speech codec in WebRTC streamed communications (including Duo’s), can replicate speech to near perfection at just 32 kbps and delivers viable voice communications at bit rates as low as 6 kbps.
Support for Opus, along with G.722 and G.711, is mandated by the WebRTC specifications, which means they are supported natively by the leading browsers. Codecs like Lyra can be used with WebRTC provided that they have app plug-in support, as is the case with Duo.
Many speech codecs, including Lyra and Opus, can accommodate severely bandwidth-constrained scenarios by limiting sound replication to reduced frequency ranges at 300 Hz to 8 kHz and even 500 Hz to 3 kHz, which are still wide enough to convey comprehensible, if awful-sounding speech. These frequency ranges make it possible to lower the minimum bit rates used for intelligible speech to sub-3 kbps levels.
Codecs that can do this include the Defense Department’s enhanced Mixed-Excitation Linear Prediction (eMELP), 3GPP’s Adaptive Multi-Rate (AMR), and Speex, an open-source predecessor to Opus, both of which were developed by Xiph.Org. In addition, the Code-Excited Linear Prediction (CELP) and Harmonic Vector Excitation Coding (HVXC) algorithms specified for voice-only coding by MPEG-4 Part 3, are designed to support transmission of viable speech at bit rates as low as 3.65 kbps and 2 kbps, respectively.
Comparing Lyra and Opus
In a recent blog post, the team behind Lyra begins their assessment of what makes Lyra special with the claim that at 3 kbps, the codec outperforms all others that operate at that bit rate and delivers better quality than Opus at 6 kbps. “Other codecs are capable of operating at comparable bit rates to Lyra (Speex, MELP, AMR), but each suffer from increased artifacts and result in a robotic sounding voice,” state Google’s Alejandro Luebs, a software engineer, and Jamieson Brettle, product manager for Chrome.
But the test samples provided in the blog only include a short speech clip encoded by Lyra at 3 kbps, Opus at 6 kbps, and Speex at 3 kbps. These are the royalty-free options among the codecs mentioned here, which may explain why these test samples were the only ones included.
The differences in quality levels reported from these tests seem meaningful. The averages of mean opinion scores (MOS) generated by neutral viewers on a 1 to 5 scale showed Lyra at 3.5, Opus at 2.5, and Speex at 1.7. Still, if, as the writers maintain, additional tests demonstrated Opus at 8 kbps is equivalent to Lyra at 3 kbps, one wonders whether the bit rate savings are enough to merit putting Lyra to work.
The Role for Lyra
Obviously, the Duo folks think Lyra is worth their time. Noting the Lyra 3 kbps equivalency with Opus at 8 kbps equates to a 60% reduction in consumed bandwidth, they assert that “billions of users in emerging markets can have access to an efficient low-bit-rate codec that allows them to have higher quality audio than ever before.”
Fair enough. Better audio quality is a good thing; if a new codec can deliver the quality of another at significantly lower bit rates, all users, not just those in bandwidth-restricted markets, will benefit.
For now, however, Lyra’s real impact is likely to be on people who don’t have the bandwidth to support video communications but who would benefit from being able to have a decent audio chat connection. Indeed, Google is reported to be accelerating implementation of Lyra for just that purpose in regions where people are still on 2G connections or wireline dial-up links.
As for people on 3G connections, replacing Opus with Duo is not likely to bring more consumers into the fold, given that 3G supports 240p video well within the throughput limits of that standard, whether at 350 kbps when H.264 is employed or at 200 kbps when VP9, the open source video codec used by Duo, is in play. Saving 5 kbps by using Lyra for the lowest audio quality at 3 kbps versus Opus delivering the same quality at 8 kbps would not be decisive as to whether 3G users could participate in video chats.
The Google team suggests that Lyra, used in combination with AV1, which improves coding efficiency by about 40% compared to VP9, would enable video chats “for users connecting to the internet via a 56 kbps dial-in modem.” But the AV1/Lyra combination wouldn’t work for people on 2G phones owing to the inability of such phones to support the processing required by AV1.
In fact, Google’s use of AV1, which it said it was implementing last year, will be limited to computers and 5G smartphones with enough processing power to handle AV1. It remains to be seen whether Lyra will matter in those high-bandwidth environments.
Efficient Speech Coding on the XDN Platform
Such considerations are irrelevant to providers who want to improve audio quality with apps delivered over an XDN infrastructure. They can do this at significant bandwidth savings by simply using Opus as a browser-supported WebRTC codec.
Whether Lyra might have implications for apps operating on XDN infrastructures depends on how Google employs the innovations it came up with to make Lyra possible. The Duo developers say they are “beginning to research how these technologies can lead to a low-bitrate general-purpose audio codec (i.e., music and other non-speech use cases).”
These efforts and others like them are well worth tracking. Lyra is in a new class of parametric codecs, which is to say codecs that recreate signals in the decoding process from a few key parameters extracted from raw speech rather than encoding waveforms directly, as is done with Opus. Lyra and other new parametric codecs use what is known as generative modeling to create a richer palette of parameters by producing more signals to work with in the recreation of speech during the decoding process.
How this is done while reducing the bit count rather than increasing it brings into play a dazzling array of techniques involving the creation of what’s known as log mel spectrograms, which are logarithms of numbers assigned to soundwaves from tens of thousands of recorded voice samplings that are parsed with the aid of machine learning (ML) to replicate a specific voice track.
Google’s team came up with a way to improve on the verisimilitude of the speech recreated through these methods. The details of how these new methods work and what they say about the ramifications of ML and other aspects of AI in signal processing and other streaming-related functions will be explored at much greater depth in a forthcoming blog.
Meanwhile, to learn more about XDN technology, contact us at email@example.com or schedule a call.