[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
3.1 File formats Waveform and feature file formats 3.2 Common options Tools common options 3.3 I/O via stdin and stdout Standard input, standard output and pipes 3.4 Extracting features Feature extraction with SPro 3.5 Manipulating feature streams The scopy utility for manipulating feature streams
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
3.1.1 Waveform streams Supported input waveform file formats 3.1.2 Feature streams Output feature file format
This section describes the file formats manipulated by SPro. Most SPro tools input signal from a waveform stream and output feature vectors to a feature stream.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Waveform streams are files which contains the signal samples, either in
raw PCM format or in an encoded format to save disk
space. Currently, SPro supports raw, mono, 16 bits/sample files as well
as WAVE and optionally SPHERE(4) files. The
SPHERE format is only supported if SPro has been compiled with the
SPHERE library (`--with-sphere' in configure
). Raw
format (i.e. with no header) with a 8 kHz sample rate is the default
assumed by SPro if not otherwise specified.
Waveform are considered as streams by SPro and are read via an input buffer which means they can be of arbitrary (even infinite) length. Even file formats for which the number of samples is known in advance from the header will not be entirely loaded into memory. In particular, this mechanism makes it possible to read waveforms from the standard input even though the number of signals is not known offhand. One particularly interesting consequence is the possibility to pipe the output of an external command into the input of a SPro command. For example, it is possible using a pipe to support file formats which are not supported by SPro. The following line
madplay --left --output=raw:- foo.mp3 | sfbcep -f 11025 - foo.mfcc |
sfbcep
tool, assuming the sample rate of the
MP3 file is 11,025 Hz.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
A feature streams is a file containing feature vectors. The format used to store the feature vectors is specific to SPro and consists of a header followed by data. The header itself is divided in two parts, an optional variable length header and a fixed length compulsory header.
To avoid byte-order problems, binary parts of the feature streams, such as the fixed length header and the feature vectors, are always stored in little-endian format (Intel-like processor) and therefore must be swapped if read on a big-endian (Motorola-like processor) machine. Byte swapping is automatically taken care of when using the library functions to read SPro streams. See section 4. The SPro library, for details on SPro stream I/O functions.
The variable length header is an optional ASCII header containing `attribute = value' statements, starting with a `<header>' tag and ending with `</header>'. The following is a sample variable length header:
<header> a_field = an arbitrary value; # a comment date = Wed Jul 23 14:59:12 CEST 2003; # this is the date snr = 20 dB; # SNR </header> |
cat
or bcat
command. For example, the command
bcat header.txt foo.mfcc > bar.mfcc |
The compulsory fixed length header is a 10 byte binary header containing
the feature vector dimension(5) (unsigned short
= 2 bytes), a flag
describing the content of the feature vector (long
= 4 bytes) and
the frame rate in Hz (float
= 4 bytes). The feature stream
description flag is actually a field of bits with the following meaning
bit | letter | description | |
1 | `E' | feature vector contains log-energy. | |
2 | `Z' | mean has been removed | |
3 | `N' | static log-energy has been suppressed (always with `E' and `D') | |
4 | `D' | feature vector contains delta coefficients | |
5 | `A' | feature vector contains delta-delta coefficients (always with `D') | |
5 | `R' | variance has been normalized (always with `Z') |
Feature vectors, or data, are stored after the header in time ascending
order. A feature vector is a binary vector of float
's as
illustrated in the following example
+-----------------+---+-----------------+----+-----------------+---+ | static | E | delta | dE | delta delta |ddE| +-----------------+---+-----------------+----+-----------------+---+ |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Here is a list of options common to all (or most of) the tools. The
scopy
feature manipulation tool options slightly differ from
the list below since most of the options are concerned with waveform
processing.
3.2.1 I/O options Common I/O options 3.2.2 Waveform framing options Common frame blocking options 3.2.3 Feature vector options Common feature vector extraction options 3.2.4 Miscellaneous options More common options
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The following options are used to control the waveform and feature I/Os:
-F, --format=str
str
is
one of `PCM16', `wave' or `sphere', the latter being
possible only if SPro was compiled with the SPHERE
library. Argument is case insensitive. Default value is `PCM16'.
-f, --sample-rate=f
f
Hz for `PCM16' waveform
files. This option is ignored for waveform file formats for which the
sample rate is specified in the header. Default value is 8,000 Hz.
-x, --channel=n
-B, --swap
-I, --input-bufsize=n
n
kbytes. The smaller the input
buffer size, the more disk access and therefore, the slower the program
is. So you will have to choose between speed and memory! Default is 10
Mbytes.
-O, --output-bufsize=n
n
kbytes. Again, you need a
compromise between speed and memory requirements. However, one important
point is that global processing such as mean subtraction, energy
normalization and delta computation are done on the buffer basis (i.e.
such processings are done only when the buffer is full or at the end of
the stream, whichever comes first) which introduces some inconsistencies
at the buffer boundaries(6). Using a small
output buffer size can then result in many boundary problems and it is
recommended not to diminish the output buffer size below a couple of
thousand frames. Default is 10 Mbytes.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Waveform framing is driven by the following options:
-k, --pre-emphasis=f
f
. Default is 0.95.
-l --length=f
f
ms. Default is 20.0 ms.
-d, --shift=f
f
ms. Default is
10.0 ms.
-w, --window=str
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The following options are used to control the content of the output feature vectors, enabling global normalizations and dynamic feature computation:
-Z, --cms
-R, --normalize
-L, --segment-length=n
-D, --delta
-A, --acceleration
-N, --no-static-energy
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Last but not least, here are some very practical options (specially the second one):
-v, --verbose
-h, --help
-V, --version
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Every SPro command requires that input and output files are explicitly
specified. However, in the very Unix philosophy, the special symbol
`-' (dash) can be used as input file to specify that input is to be
read from stdin
or as output file to specify that output should
be directed to stdout
.
The use of standard input and output makes it possible to pipe the SPro commands one after the other or even with external programs. The example
sfbcep foo.lin - | scopy -o ascii - - |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
3.4.1 Filter-bank analysis tools Tools for filter-bank derived features 3.4.2 LPC analysis tools Tools for linear prediction derived features
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The tools sfbank
and sfbcep
are dedicated to
filter-bank based speech analysis.
Filter-bank log-magnitude features All about sfbank
Filter-bank cepstral features All about sfbcep
Options sfbank
andsfbcep
options
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The first filter-bank analysis tool, sfbank
, takes as input a
waveform and output filter-bank magnitude features. For each frame, the
FFT is performed on the windowed signal, possibly after zero padding,
and the magnitude is computed before being integrated using a triangular
filter-bank. See section 2.3 Filter-bank analysis, for mathematical details. To avoid
numerical problems, a threshold is used to keep channel log-magnitudes
positive or null. The signal bandwidth may be artificially limited by
specifying lower and higher frequencies using the `--freq-min'
and `--freq-max' options respectively. In this case, the central
frequencies of the filter-bank channels are regularly taken in the
specified bandwidth. Even if frequency warping is used, the lower and
upper frequencies are specified in the linear frequency domain, though,
of course, the filter's central frequencies will be taken regularly in
the transformed domain. Both MEL and bilinear frequency warping
are possible with sfbank
.
First and second order derivatives can be appended to the filter-bank log-magnitude features using `--delta' and `--acceleration' respectively.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The second filter-bank analysis tool, sfbcep
, takes as input a
waveform and output filter-bank derived cepstral features. The
filter-bank processing is similar to what is done in sfbank
(see previous section). The cepstral coefficients are computed by
DCT'ing the filter-bank log-magnitudes and possibly liftered.
Optionally, the log-energy can be added to the feature vector. In
sfbcep
, the frame energy is calculated as the sum of the
squared waveform samples after windowing. As for the magnitudes in the
filter-bank, the log-energy are thresholded to keep them positive or
null. The log-energies may be scaled to avoid differences between
recordings.
Mean and variance normalization of the static cepstral coefficients can be specified with the global `--cms' and `--normalize' options but do not apply to log-energies. The normalizations can be global (default) or based on a sliding window whose length is specified with `--segment-length'.
Finally, first and second order derivatives of the cepstral coefficients and of the log-energies can be appended to the feature vectors. When using delta features, the absolute log-energy can be suppressed using the `--no-static-energy' option.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The following options are available for both sfbank
and
sfbcep
.
-n, --num-filters=n
-a, --alpha=f
f
(f
must be between 0 and 1). This option is
incompatible with `--mel' and will be overwritten by the
latter. Default is no warping.
-m, --mel
-i, --freq-min=f
f
Hz. Default is no band limiting.
-u, --freq-max=f
f
Hz. Default is no band limiting.
-b, --fft-length=n
n
samples. The FFT length must be a power of
two and greater than or equal to the number of samples in a frame. If
FFT length is greater, the windowed frame samples are padded with zeroes
before running the Fourier transform.
The following options are also available for sfbcep
.
-p, --num-ceps=n
n
. n
must be less or equal to the number of channels in the filter
bank. Default is 12.
-r, --lifter=n
n
. Default is no liftering.
-e, --energy
-s, --scale-energy=f
sfbank
supports the `--delta' and
`--acceleration' options. In addition, sfbcep
also
supports the `--cms' and `--normalize'
options. See section 3.2 Common options, for a description of these options and
for additional ones.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
SPro provides two different tools, slpc
and slpcep
,
for linear predictive analysis of speech signals.
Linear prediction coefficients All about slpc
Linear prediction cepstrum All about slpcep
Options slpc
andslpcep
options
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The tool slpc
takes as input a waveform and output linear
prediction derived features. For each frame, the signal is windowed
after pre-emphasis and the generalized correlation is computed and
further used to estimate the reflection and the prediction coefficients
which can, in turn, be transformed into log area ratios or line spectrum
frequencies. See section 4.7.1 Linear prediction, for mathematical details. The
default is to output the linear prediction coefficients however
reflection coefficients can be obtained with the `--parcor'
option, log-area ratios with `--lar' option and line spectrum
pairs with the `--lsp' one.
Optionally, the log-energy can be added to the feature vector. In
slpc
, the log-energy is taken as the linear prediction filter
gain, which is also the variance of prediction error, and thresholded to
be positive or null. The log-energies may be scaled to avoid
differences between recordings using the `--scale-energy'
option.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Program slpcep
takes as input a waveform and outputs cepstral
coefficients derived from the linear prediction filter coefficients. The
linear prediction processing steps are as in slpc
(see
previous section) and cepstral coefficients are computed from the linear
prediction coefficients using the recursion previously described. The
required number of cepstral coefficients must be less then or equal to
the prediction order.
As for slpc
, the log-energy, taken as the gain of the linear
prediction filter, can be added to the feature vectors.
Mean and variance normalization of the static cepstral coefficients can be specified with the global `--cms' and `--normalize' options but do not apply to log-energies. The normalizations can be global (default) or based on a sliding window whose length is specified with `--segment-length'.
Finally, first and second order derivatives of the cepstral coefficients and of the log-energies can be appended to the feature vectors. When using delta features, the absolute log-energy can be suppressed using the `--no-static-energy' option.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The following options are available for both slpc
and
slpcep
.
-n, --order=n
-a, --alpha=f
f
(f
must be between 0 and 1). Default is no warping.
-r, --parcor
-g, --lar
-p, --lsp
-e, --energy
-s, --scale-energy=f
The following options are also available for slpcep
.
-p, --num-ceps=n
n
. n
must be less or equal to the number of channels in the filter
bank. Default is 12.
-r, --lifter=n
n
. Default is no liftering.
Also, slpcep
supports the `--cms' and
`--normalize' normalization options as well as `--delta'
and `--acceleration'. See section 3.2 Common options, for a description of
these options and for additional ones.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
SPro provides a tool, scopy
for manipulating feature
streams. More than a mere copy tool, scopy
also allows to
normalize features, add dynamic features, scale the features, apply a
linear transformation to the feature vectors and extract some components
of the feature vector. All of these operations are detailed below. In
addition, scopy
can import feature files from previous SPro
release, export files to alien formats such as HTK, or view the content
of an SPro feature file in text format.
3.5.1 Operations on feature streams Maniuplating feature streams with scopy
3.5.2 Exporting features Exporting features to alien formats with scopy
3.5.3 Importing from a previous SPro release Compatibility questions 3.5.4 Copy options scopy
options
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
As mentioned in the introduction, scopy
may be used for
The two first transformations, i.e. normalization and dynamic feature computation, are actually done at once when loading the input features. If normalization is specified, the static coefficients, not including energy, are normalized before delta and acceleration features are computed. If dynamic feature are used, the static log-energy can be discarded using `--no-static-energy'. As in all the feature extraction tools, normalization is either global or based on a sliding window, depending on whether `--segment-length' was specified or not.
Multiplicative scaling is a simple operation which consists in multiplying every component of every feature vector by a scaling factor. This is sometimes used to reduce the variance of features with a high dynamic range in order to avoid numerical problems when computing a linear transformation for those features or when doing some modeling.
A linear transformation matrix can be specified using
`--transform' to project the input feature vectors according to
y'(t) = A z(t), where y'(t) is the transformed vector for
frame t and z(t) is a column vector containing the input
feature frame y(t) plus possibly some context
frames(7). For
example, assuming a context size k, z(t) will be the
concatenation of input feature vectors y(t-k) to
y(t+k). If m is the input feature dimension, possibly
after adding the dynamic features if this was asked, and n the
output dimension, the transformation matrix will have
nrows
=n rows and ncols
=(2 k + 1) * m
columns. The matrix A is stored in a text file with the following
syntax
nrows ncols nsplice A[1][0] A[1][1] ......... A[1][ncols] ......... A[nrows][0] ......... A[nrows][ncols] |
nsplice
is the context size.
Component extraction consists in extracting some components of the
feature vectors. The extraction pattern is specified using the
`--extract=str' option where str
is a comma separated list
of components to keep. The latter are specified either as a single
component index or as a index range using a dash (`-'). Component
indices start at 1. For example, the command
scopy --extract=1-12,25-36 foo.prm bar.prm |
When performing either linear transformation or component extraction, the content of the resulting feature vector can no longer be described using a feature description flag. Indeed, specifying if a vector as delta features after a linear transformation does make no sense. For this reason, the output stream description flag will be arbitrarily set to zero if at least one of this transformation is specified.
If several operations are specified, they are applied in the order in
which they are listed above. Therefore, delta coefficients are computed
before the linear transformation if both are specified. As for now,
there is unfortunately no direct and easy way to change the order of
these operations. In particular, it is not possible to add delta
coefficients after linear transformation which is an operation that does
not seem illogical. The easiest, though CPU consuming, way to change the
processing order is to use scopy
several times, possibly with
pipes. For example, the line
scopy --transform=pca.mat foo.prm - | scopy -ZD - bar.prm |
scopy
) and then
remove the mean of the static features before adding the delta features
and store the result in `bar.prm' (second scopy
).
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Exporting feature streams to alien formats is also possible with
scopy
. Currently, three alien formats are supported, namely
HTK(8),
Sirocco(9) and ASCII
text format.
Export to HTK and Sirocco file formats is only possible on
seekable streams, i.e. regular files in which the C function
fseek
works. The reason for this constraint is that those formats
include the number of frames in the header. Since the number of frames
is not in the SPro header, sopy
uses fseek
to seek to
the end of the input feature stream in order to determine the number of
frames. As a consequence, it is not possible to export to one of these
alien formats when reading from a pipe. On the other hand, no seek in
the output file is therefore necessary and the output of scopy
can be piped into another command. This is particularly usefull with
HTK, where setting the environment variable HPARMFILTER
to `scopy -o HTK $ -', enables to read directly read SPro files
with HTK. See section "Input/Output via Pipes and
Networks" in the HTK 3.2 book for details.
Export to ASCII is useful to list in a (almost) human-readable way the content of a feature stream. In particular, combining the ASCII output with the `--info' option which gives information about the content of the stream. This option is also useful to visualize the different operations performed on the input feature streams and their order. For example, the command
scopy -i -ZDA -t xxx.mat -x 1-3,7 -z foo.prm - |
sample_rate = 100.000000 input: dim=12 (<nil>) convert: dim=36 (ZDA) transform: dim=10 (xxx.mat) extract: dim=4 (1-3,7) |
As mentioned in 3.1 File formats, SPro feature files are always in little endian byte order. On the contrary, exported files are written in the machine's natural byte order. As both HTK and Sirocco expects files in big-endian byte order(10), the option `--swap' can be used to swap the byte order before writing the file in alien file formats. This option is ignored if the output file format is ASCII (obviously) or SPro.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The option `--compatibility' is provided for compatibility and enables to read feature files from previous versions of SPro. When this option is used, the entire feature file is loaded into memory at once as this used to be the case in previous versions. Using this options with large files may therefore be quite memory consuming (and slow by the same occasion). All the processing capabilities (normalization, dynamic features, linear transform, etc.) remains possible when importing files from previous SPro versions.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The following options are available in scopy
:
-c, --compatibility
-I, --bufsize=n
-i, --info
-z, --suppress
-B, --swap
-o, --output-format=str
str
is one of ascii
,
htk
or sirocco
. Default is the native SPro format.
-m, --scale=f
f
.
-t, --transform=str
str
.
-x, --extract=str
str
is a comma separated list of components to extract, where the
components are specified either as a single index or a range of indices
specified using a dash (`-'). The index of the first component is
1.
-s, --start=n
n
. Frame numbers start with
zero. Default is 0.
-e, --end=n
n
(included). Frame numbers start with
zero. Default is to copy to the end of stream.
[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |