Supplementary material for the paper

An Hybrid Audio Scheme using Hidden Markov Models of Waveforms

by

S. Molla and B. Torrésani

abstract
This paper reports on recent results related to audiophonic signals encoding using time-scale and time-frequency transform. More precisely, non-linear, structured approximations for tonal and transient components using local cosine and wavelet bases will be described, yielding expansions of audio signals in the form

tonal + transient + residual

We describe a general formulation involving hidden Markov models, together with corresponding rate estimates. Estimators for the balance transient/tonal are also discussed.

Motivation, Illustration:

Sparsity is not enough to separate transients from tonals: The figure below represents a glockenspiel signal (top) and the logarithm of the modulus of MDCT coefficients (middle); even though MDCT is a priori more adapted to the estimation of the tonal layer, "vertical" chains of large MDCT coefficients are nevertheless present, which do not correspond to tonals. A simple thresholding of MDCT coefficients clearly does not separate between these "vertical" and "horizontal" parts of the MDCT coefficients. The introduction of structures in the estimation procedure allows one to keep only "horizontal ridges" of large MDCT coefficients (bottom).

MDCT coefficients of glockenspiel

After substraction of the estimated tonal component, the same representation of the residual (bottom figure below) only shows the "vertical" structures, i.e. transients.

Model:

The signal is modeled (HWAM: Hybrid Waveform Audio Model) as a linear combination of a few wavelets and a few local cosines (MDCT basis functions), whose index sets belong to transient and tonal significance maps. The tonal significance map is assumed to form "horizontal ridges" (of significant MDCT coefficients), while the transient significance map is supposed to form "trees" of significant wavelet coefficients.

The observations are the samples of the hybrid signal, which may be represented by "observed" MDCT coefficients (i.e. the MDCT coefficients of the hybrid signal... not those of its MDCT part). The main idea is that these observed coefficients may have two different types of behaviors:

if the corresponding index belongs to the initial tonal significance map, the MDCT coefficient is expected to be large.
otherwise, the MDCT coefficient originates from the wavelets (the transient part), and is expected to be small.

To resolve the ambiguities, the structures (i.e. persistence properties) are exploited: roughly speaking, an MDCT coefficient is considered belonging to the tonal layer if it is large enough, and if its neighbours (in time) belong to the tonal layer. A similar strategy is used to select the wavelet coefficients for describing transients.

Tonal layer:

The signal is first expanded on a local cosine (MDCT) basis, and the largest coefficients are retained, provided they satisfy a time-persistence condition. More precisely, at each frequency, coefficients are modeled using a mixture of Gaussian random variables: T-type (T for tonal) coefficients have large variance, and R-type coefficients (R for residual, originating from the wavelet part, and non-sparse components, noise,...) have small variance. Time persistence is modelled by a Markov chain. The parameters of the model are the variances of the normal distributions at each frequency, and the probabilities of transition T -> R and R -> T.

Parameters are estimated using an adapted EM algorithm, and the locations of T-type coefficients are estimated by thresholding posterior probabilities. The tonal layer is reconstructed as the projection of the signal onto the linear span of MDCT atoms corresponding to T-type coefficients. A nontonal part of the signal is obtained by substracting the tonal layer from the signal.

Transient layer:

The nontonal part is expanded onto a wavelet basis, and the largest coefficients are retained, provided they satisfy a scale-persistence condition. More precisely, at each scale, coefficients are modeled using a mixture of Gaussian random variables: T-type (T now stands for transient) coefficients have large variance, and R-type coefficients (R for residual) have small variance. Scale-persistence is modelled by a Markov chain on the dyadic tree naturally associated with the wavelet basis. The parameters of the model are again the variances of the normal distributions at each frequency, and the probability of transition T -> R (the transition R -> T is forbidden, which ensures connectedness of the tree of T-type coefficients).

Again, parameters are estimated using an adapted EM algorithm, and the locations of T-type coefficients are estimated by thresholding posterior probabilities. The transient layer is reconstructed as the projection of the signal onto the linear span of wavelets corresponding to T-type coefficients. A residual part of the signal is obtained by substracting the transient layer from the nontonal signal.

Residual:

The residual is modeled using standard LPC procedure.

Hybrid decomposition

Test signal: mamavatu

Transientness index

First, a transientness index is estimated (see the article and supplementary material for details), which provides an oracle for the proportion of the bit budget to be spent on the transient part:

mamaindex

same index on a smaller portion of the signal:

smamaindex

Hybrid decomposition:

Given a global bit budget, split into transient bit budget, tonal bit budget as prescribed by the transientness index, and residual bit budget (constant), the three layers are estimated, and are displayed below:

Sound files:

the mamavatu signal

original signal
tonal layer
nontonal part (original - tonal)
transient layer
residual
reconstruction

Another example: the glockenspiel signal

original signal
tonal layer
nontonal part
transient layer
residual
reconstruction