Lyra audio codec

Created on 2023-02-28T05:49:43-06:00

The basic architecture of the Lyra codec is quite simple. Features, or distinctive speech attributes, are extracted from speech every 40ms and are then compressed for transmission. The features themselves are log mel spectrograms, a list of numbers representing the speech energy in different frequency bands, which have traditionally been used for their perceptual relevance because they are modeled after human auditory response. On the other end, a generative model uses those features to recreate the speech signal. In this sense, Lyra is very similar to other traditional parametric codecs, such as MELP.

Additionally, WaveNetEQ, the generative model-based packet-loss-concealment system currently used in Duo, has demonstrated how this technology can be used in real-world scenarios.

Samples audio in 40ms windows, extracts mel-cepstrum spectrograms from it, encodes to a vector codebook, and transmits that in a bitstream format.

Bitstream format attempts to reconstruct mel-cepstrum spectrogram and then uses a WaveRNN to resynthesize the original sound.