Ivan MAGRIN-CHAGNOLLEAU - Joachim WILKE - Frédéric BIMBOT
Télécom Paris (E.N.S.T.), Dépt. Signal - C.N.R.S., URA 820
46, rue Barrault - 75634 Paris cedex 13 - FRANCE - European Union
email: ivan@sig.enst.fr and bimbot@sig.enst.fr
Auto-Regressive (AR) Vector Models have been a significant subject of interest in the field of Speaker Recognition [1] [2] [3] [4] [5] [6] [7]. Whereas the idea of modeling a speaker by an AR-vector model estimated on sequences of speech frames is common to these works, the way to measure the similarity between two speaker models is addressed very differently. Secondly, the use of AR-vector model is often motivated by the belief that such an approach is an efficient way to extract dynamic speaker characteristics, as opposed to static characteristics such as the distribution of speech frame parameters.
In this paper we report on a systematic investigation on similarity measures between AR-vector speaker models obtained as simple combinations of canonical quantities. We also design a protocol in order to examine the role of dynamic information on the performance of the AR-vector approach : we destroy the natural time order of speech frames by shuffling them randomly, and we evaluate the AR-vector approach on these temporally disorganised data. We finally compare both previous approaches to a (single) Gaussian Model [8] [9] [10] [11].
Let be a sequence of p-dimensional
vectors. Let us define the centered vectors
where
is the mean vector of
.
Let us denote the covariance matrix of
:
We consider now 2 speakers and
, and we present a general formalism
for expressing similarity measures between their AR-vector models.
Two families of similarity measures are investigated :
We use the first 63 speakers of TIMIT [14] and
NTIMIT [15] for our experiments (19 females and 44 males)1.
Each of them has read 10 sentences.
The signal is sampled at 16 kHz, on 16 bits, on a linear amplitude
scale. NTIMIT is a telephone-channel version of TIMIT.
Each sentence is analysed as follows : for each speech token, the speech signal is kept in its integrality; it is decomposed
into frames of 31.5 ms at a frame rate of 10 ms,
with no pre-emphasis.
A Hamming window is applied to each frame. Then the module of a 504 point Fourier Transform
is computed, from which
24 Mel-scale triangular filter bank coefficients are extracted.
The spectral vectors (of dimension p = 24) are formed from the logarithm of each filter output. These analysis conditions are identical to those used in [11].
For the TIMIT database, all 24 coefficients of are kept. For NTIMIT, 24-dimensional vectors are also extracted, but we keep only
the first 17 coefficients, which corresponds to the telephone
bandwidth. Experiments are also made on ``FTIMIT'', obtained by taking the 17 first
coefficients of the vectors
extracted from TIMIT.
A common training/test protocol is used for all the experiments. It is described
in detail in [11] (as protocol ``long-short'').
Training material consists of 5 sentences (i.e 14.4 s) which are concatenated into a single reference per speaker. Tests are carried out on
5
1 sentence per speaker (i.e
3.2 s per sentence) which are tested separately. The total number
of independent tests is therefore 63
5 = 315. The decision rule is the 1-nearest neighbour.
Results of the experiments are given by database (Tables 1, 2 and 3). Performances are reported in terms of closed-set speaker identification error rates
on the test set for the canonical measures and various combined measures in their
asymmetric and their best symmetric form. For the symmetrised measures, a superscript
indicates to which symmetrisation (,
or
) does the result
correspond.
function f | a | ![]() |
g | ![]() |
![]() |
![]() |
a-g |
AR-vector model - spectral frames in their natural time order | |||||||
fX(B/A) | fY(A/B) | 16.8 | 8.6 | 16.8 | 8.6 | 16.2 | 7.6 | 16.2 | 7.6 | 19.1 | 10.8 | 23.8 | 19.4 | 22.2 | 17.5 |
symmetrised | 3.5 ![]() |
4.1 ![]() |
4.1 ![]() |
4.1 ![]() |
3.2 ![]() |
7.9 ![]() |
7.3 ![]() |
fY/X(A) | fX/Y(B) | 75.6 | 51.4 | 75.6 | 51.4 | 88.3 | 73.0 | 88.3 | 73.0 | 15.2 | 34.3 | 7.6 | 18.7 | 15.2 | 14.6 |
symmetrised | 6.0 ![]() |
4.8 ![]() |
12.4 ![]() |
4.8 ![]() |
5.4 ![]() |
7.0 ![]() |
6.0 ![]() |
AR-vector model - spectral frames in a random time order | |||||||
fX'(B'/A') | fY'(A'/B') | 2.5 | 56.5 | 2.5 | 56.5 | 4.1 | 58.1 | 4.1 | 58.1 | 2.5 | 56.2 | 4.1 | 55.9 | 3.5 | 54.6 |
symmetrised | 3.5 ![]() |
3.5 ![]() |
5.7 ![]() |
5.7 ![]() |
2.5 ![]() |
4.1 ![]() |
4.1 ![]() |
fY'/X'(A') | fX'/Y'(B') | 42.5 | 45.4 | 42.5 | 45.4 | 98.1 | 82.9 | 98.1 | 82.9 | 1.3 | 22.9 | 1.0 | 6.7 | 3.2 | 8.9 |
symmetrised | 4.8 ![]() |
2.2 ![]() |
46.7 ![]() |
12.7 ![]() |
2.9 ![]() |
1.0 ![]() |
1.6 ![]() |
Gaussian model | |||||||
fYo /Xo(I) | fXo /Yo(I) | 37.5 | 47.0 | 37.5 | 47.0 | 98.4 | 98.4 | 98.4 | 98.4 | 0.6 | 7.9 | 0.6 | 3.2 | 2.9 | 6.4 |
symmetrised | 3.8 ![]() |
1.3 ![]() |
97.1 ![]() |
99.4 ![]() |
1.0 ![]() |
0.6 ![]() |
1.0 ![]() |
function f | a | ![]() |
g | ![]() |
![]() |
![]() |
a-g |
AR-vector model - spectral frames in their natural time order | |||||||
fX(B/A) | fY(A/B) | 38.7 | 30.2 | 38.7 | 30.2 | 37.1 | 29.5 | 37.1 | 29.5 | 42.5 | 35.2 | 51.1 | 50.8 | 49.5 | 49.5 |
symmetrised | 24.8 ![]() |
25.1 ![]() |
24.8 ![]() |
24.4 ![]() |
26.3 ![]() |
35.6 ![]() |
33.3 ![]() |
fY/X(A) | fX/Y(B) | 93.3 | 86.0 | 93.3 | 86.0 | 96.5 | 94.6 | 96.5 | 94.6 | 44.1 | 69.8 | 41.6 | 39.1 | 49.2 | 39.1 |
symmetrised | 23.5 ![]() |
21.3 ![]() |
32.4 ![]() |
25.4 ![]() |
24.4 ![]() |
34.6 ![]() |
33.0 ![]() |
AR-vector model - spectral frames in a random time order | |||||||
fX'(B'/A') | fY'(A'/B') | 35.9 | 82.2 | 35.9 | 82.2 | 36.8 | 81.3 | 36.8 | 81.3 | 32.4 | 83.5 | 34.6 | 82.2 | 34.3 | 81.6 |
symmetrised | 39.1 ![]() |
39.1 ![]() |
40.0 ![]() |
40.0 ![]() |
34.3 ![]() |
33.3 ![]() |
33.3 ![]() |
fY'/X'(A') | fX'/Y'(B') | 78.7 | 71.4 | 78.7 | 71.4 | 98.4 | 93.7 | 98.4 | 93.7 | 15.9 | 43.8 | 13.3 | 21.6 | 20.3 | 27.3 |
symmetrisation | 21.9 ![]() |
14.6 ![]() |
69.8 ![]() |
52.4 ![]() |
14.0 ![]() |
13.3 ![]() |
14.3 ![]() |
Gaussian model | |||||||
fYo /Xo(I) | fXo /Yo(I) | 77.1 | 71.8 | 77.1 | 71.8 | 98.4 | 98.4 | 98.4 | 98.4 | 14.6 | 27.3 | 12.7 | 17.1 | 20.3 | 21.3 |
symmetrised | 15.6 ![]() |
11.8 ![]() |
97.8 ![]() |
98.4 ![]() |
12.7 ![]() |
12.4 ![]() |
14.3 ![]() |
function f | a | ![]() |
g | ![]() |
![]() |
![]() |
a-g |
AR-vector model - spectral frames in their natural time order | |||||||
fX(B/A) | fY(A/B) | 71.8 | 54.6 | 71.8 | 54.6 | 67.3 | 54.3 | 67.3 | 54.3 | 78.1 | 58.4 | 83.8 | 69.5 | 82.9 | 67.9 |
symmetrised | 51.8 ![]() |
52.1 ![]() |
50.5 ![]() |
50.2 ![]() |
57.5 ![]() |
66.0 ![]() |
65.1 ![]() |
fY/X(A) | fX/Y(B) | 96.8 | 92.4 | 96.8 | 92.4 | 97.1 | 95.6 | 97.1 | 95.6 | 67.3 | 88.9 | 66.0 | 78.7 | 75.2 | 76.8 |
symmetrised | 61.9 ![]() |
56.5 ![]() |
68.3 ![]() |
53.0 ![]() |
59.7 ![]() |
63.2 ![]() |
66.4 ![]() |
AR-vector model - spectral frames in a random time order | |||||||
fX'(B'/A') | fY'(A'/B') | 64.4 | 92.1 | 64.1 | 92.1 | 65.4 | 91.8 | 65.4 | 91.8 | 61.9 | 92.4 | 64.8 | 93.3 | 64.4 | 93.0 |
symmetrised | 65.4 ![]() |
65.1 ![]() |
67.9 ![]() |
68.3 ![]() |
62.2 ![]() |
64.4 ![]() |
64.1 ![]() |
fY'/X'(A') | fX'/Y'(B') | 94.0 | 94.3 | 94.0 | 94.3 | 98.4 | 97.5 | 98.4 | 97.5 | 47.0 | 86.4 | 46.0 | 63.2 | 56.8 | 77.1 |
symmetrisation | 61.9 ![]() |
52.4 ![]() |
88.3 ![]() |
72.4 ![]() |
50.2 ![]() |
44.1 ![]() |
48.6 ![]() |
Gaussian model | |||||||
fYo /Xo(I) | fXo /Yo(I) | 93.0 | 94.6 | 93.0 | 94.6 | 98.4 | 98.4 | 98.4 | 98.4 | 44.1 | 75.9 | 42.5 | 59.7 | 56.2 | 73.3 |
symmetrised | 58.1 ![]() |
49.8 ![]() |
97.8 ![]() |
98.4 ![]() |
47.6 ![]() |
44.1 ![]() |
49.2 ![]() |
The following observations can be made :
In our experiments, we did not succeed in obtaining better speaker identification results with an AR-vector model based measure than with a single Gaussian model classifier. This observation is in contradiction with results reported in [7], but this divergence may be due to different signal pre-processing and analysis.
Moreover, we globally obtained better performances with the AR-vector model on spectral frames in a random time order rather than when we kept the natural time order. Therefore, the role of dynamic speaker characteristics in the success of the AR-vector model can be questioned, as our results suggest that AR-vector models tend to extract indirectly speaker characteristics of a static nature.
Finally, the influence of symmetrisation can be crucial, but its theoretical basis remains to be understood.