[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This section provides a brief scientific overview of the speech signal analysis techniques involved in SPro with a particular focus on variable resolution spectral analysis. It also defines the equations and methods implemented in SPro.
2.1 Pre-emphasis and windowing Short term windows and pre-emphasis 2.2 Variable resolution spectral analysis 2.3 Filter-bank analysis Filter-bank speech analysis 2.4 Linear predictive analysis Linear prediction speech analysis 2.5 Cepstral analysis 2.6 Deltas and normalization Delta, acceleration and feature normalization
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Speech is intrinsically a highly non-stationary signal. Therefore, speech analysis, whether FFT-based or LPC-based, must be carried out on short segments across which the speech signal is assumed to be stationary. Typically, the feature extraction is performed on 20 to 30 ms windows with 10 to 15 ms shift between two consecutive windows. This principle is illustrated in the figure below
HAMMING | w_i = 0.54 - 0.46 \cos(i \pi^2 / N) | |
HANNING | w_i = (1 - \cos(i \pi^2 / N)) / 2 | |
BLACKMAN | w_i = 0.42 - 0.5 \cos(i \pi^2 / N) + 0.08 cos(2 i \pi^2 / N) |
Pre-emphasis is also traditionally use to compensate for the -6dB/octave spectral slope of the speech signal. This step consists in filtering the signal with a first-order high-pass filter H(z) = 1 - k z^{-1}, with k \in [0,1[. The pre-emphasis filter is applied on the input signal before windowing.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Classical spectral analysis has a constant resolution over the frequency axis. The idea of variable resolution spectral analysis(1) is to vary the spectral resolution as a function of the frequency. This is achieved by applying a bilinear transformation of the frequency axis, the transformation being controlled by a single parameter a. The bilinear warping of the frequency axis is defined by
Using variable resolution spectral analysis with a filter-bank is rather trivial since it simply consists in determining the filter's central frequency according to the warping. See section 2.3 Filter-bank analysis.
Linear predictive models with variable resolution spectral analysis is also possible. Very briefly, the idea consists in solving the normal equations on the generalized auto-correlation rather than on the traditional auto-correlation sequence. The generalized auto-correlation r(p) is the correlation between the original signal filtered by a corrective filter mu(z) = (1 - a^2) / (1 - a z^{-1)^2} and the latter filtered p times by a correction filter of response
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Filter-bank is a classical spectral analysis technique which consists in representing the signal spectrum by the log-energies at the output of a filter-bank, where the filters are overlapping band-pass filters spread along the frequency axis. This representation gives a rough approximation of the signal spectral shape while smoothing out the harmonic structure if any. When using variable resolution analysis, the central frequencies of the filters are determined so as to be evenly spread on the warped axis and all filters share the same bandwidth on the warped axis. This is also applied to MEL frequency warping, a very popular warping in speech analysis which mimics the spectral resolution of the human ear. The MEL warping is approximated by mel(f) = 2595 \log_{10(1 + f / 700)}.
SPro provides an implementation of filter-bank analysis with triangular filters on the FFT module as depicted below
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Linear prediction is a popular speech coding analysis method which relies on a source/filter model if the speech production process. The vocal tract is modeled by an all-pole filter of order p whose response is given by
The idea of the resolution algorithm is to iteratively estimate the prediction coefficients for each prediction order until the required order is reached. Assuming the prediction coefficients for order n-1 are known and yields a prediction error e_{n-1}, the estimation of the coefficients for order n rely on the n'th reflection coefficients defined as
For variable resolution, the generalized auto-correlation sequence is used instead of the traditional auto-correlation. See section 2.2 Variable resolution spectral analysis. for details on generalized auto-correlation.
The all-pole filter coefficients can be represented in several equivalent ways. First, the linear prediction coefficients a_i can be used directly. The reflection (or partial correlation) coefficients k_i \in ]-1,1[ used in the resolution algorithm can also be used to represent the filter. The log-area ratio, defined as
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Probably the most popular features for speech recognition, the cepstral coefficients can be derived both from the filter-bank and linear predictive analyses. From the theoretical point of view, the cepstrum is defined as the inverse Fourier transform of the logarithm of the Fourier transform module. Therefore, by keeping only the first few cepstral coefficients and setting the remaining coefficients to zero, it is possible to smooth the harmonic structure of the spectrum(3). Cepstral coefficients are therefore very convenient coefficients to represent the speech spectral envelope.
In practice, cepstral coefficients can be obtained from the filter-bank energies e_i via a discrete cosine transform (DCT) given by
Cepstral coefficients have rather different dynamics, the higher coefficients showing the smallest variances. It may sometimes be desirable to have a constant dynamic across coefficients for modeling purposes. One way to reduce these differences is liftering which consists in applying a weight to each coefficients. The weight for the i'th coefficient is defined in a parametric way according to
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Feature normalization can be used to reduce the mismatch between signals recorded in different conditions. In SPro, normalization consists in mean removal and eventually variance normalization. Cepstral mean subtraction (CMS) is probably the most popular compensation technique for convolutive distortions. In addition, variance normalization consists in normalizing the feature variance to one and is a rather popular technique in speaker recognition to deal with noises and channel mismatch. Normalization can be global or local. In the first case, the mean and standard deviation are computed globally while in the second case, they are computed on a window centered around the current time.
To account for the dynamic nature of speech, it is possible to append the first and second order derivatives of the chosen features to the original feature vector. In SPro, the first order derivative of a feature $y_i$ is approximated using a second order limited development given by
[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |