Next: Measuring Network Parameters in Up: Measuring Speech Quality in Previous: Measuring Speech Quality in Contents Index

Introduction

The recent uses of Voice over IP [32], IP telephony [60], and Voice and Telephony over ATM [152] have set forth a great need to assess audio quality, in real time, when the audio is transmitted over any packet network. In addition, the number of network applications that require audio quality assessment increases rapidly. Despite the importance of this point, few methods are available; furthermore, the few contributions in this field are mainly concentrating on the differentiation of encoding algorithms without taking into account network parameters. In the literature, there are some objective speech quality measures. The most commonly used one are Signal-to-Noise Ratio (SNR), Segmental SNR (SNRseg), Perceptual Speech Quality Measure (PSQM) [15], Measuring Normalizing Blocks (MNB) [139], ITU E-model [68], Enhanced Modified Bark Spectral Distortion (EMBSD) [155], Perceptual Analysis Measurement System (PAMS) [112] and PSQM+ [13]. The main purpose of these measures is to evaluate the quality of speech signals distorted by encoding impairments. Some of these measures work well for this case, but when using them to evaluate the quality of speech signals impaired by both encoding and network transmission, their performance degrades too much (the correlation with subjective measures becomes poor). In addition, the majority of these measures are not suitable to evaluate the quality in real time, as they are computationally intensive and they operate on both the original signal and the processed one. Transmitting audio signals over any packet network can fall into one of the following categories: a) unidirectional session consisting of a sender that emits frames of audio signals and a receiver that playbacks these frames (e.g. audio streaming [103]); b) bi-directional sessions, when both the sender and the receiver can emit and playback speech frames, producing interactivity between the two ends; c) multi-party conference, when there are more than two ends contributing to the same session. In this Chapter, we are only concerned with category a). The ``Quality-Affecting'' parameters that affect audio quality when transmitted over packet network can be classified as follows: $\bullet$

Encoding and Compression Parameters: they are due to the impairments resulting from the encoding or compression of the original audio/speech signals (see Section 3.5 for details about speech and audio compression and codecs). This can be classified into the type of the codec used (ex. PCM, GSM, ADPCM, etc.), the sampling rate, the packetization interval of the signal and the number of bits per sample [28,16,43,80,86].
Network Parameters: they are due to the transmission of speech streams over packet network. The most known parameters are the loss rate due to network congestion, the arrival jitter or delay variation, the loss distribution, the and end-to-end delay. Furthermore, effects of error concealment techniques (e.g. silence, noise, repetition, and waveform substitution for lost packets) [28,38,80] can affect audio quality (see Section 3.3.2).
Other Parameters: like echo (which may occur due to long end-to-end delay), crosstalk effect (when two or more people start talk at the same time), or the number of participating sources. These effects occur in bi-directional sessions or multi-party conferences [40].

In the previous Chapter, we presented a new method to evaluate the quality of multimedia streams transmitted in real time over packet networks in general, regardless of the media type. In this Chapter, we aim to validate our approach in the case of speech quality assessment. Like any problem to be solved by NN, the development of a tool to evaluate in real time speech quality consists of two steps: the first step is to prepare some examples (a set of input-output pairs), the second step is to train and test a suitable NN using these examples. For the first step, we have to identify the speech-quality-affecting parameters (representing the inputs to the system). Then, we have to build a database consisting of a set of impaired speech samples (the distortion in each one should correspond to selected values of the quality-affecting parameters). Then, the subjective quality test must be carried out to obtain the MOS of all the distorted speech signals (the quality level represents the desired output of the system). In another database, what we call the quality database, the values of the quality-affecting parameters and the corresponding quality score are to be stored. This constitutes the set of input-output pairs or the examples. For the second step, a suitable NN architecture must be selected. Then, the NN has to be trained and tested by the resulting quality database. The goal is to use the NN to learn the way human subjects evaluate the quality of certain levels of both network and encoding distortions. In addition, as a well-trained NN can generalize well, it can be used to evaluate or predict the quality for wide-range variations of all the input parameters. Therefore, it can be used to evaluate the quality for new (i.e. not within the quality database) values of the quality-affecting parameters in real time. This chapter is organized as follows: We describe the experiment we used to identify the values of network parameters and their ranges in Section 5.2. In Section 5.3, we present the subjective quality tests we did for different languages. We present the overall procedure used to generate the different speech databases for the MOS experiment, based on a specific network testbed. The results obtained and the validation of our methodology are given in Section 5.4. In Section 5.5, we analyze the performance of some of the existing objective speech quality mesures. Finally, in Section 5.6 we provide the conclusions of this Chapter.

Next: Measuring Network Parameters in Up: Measuring Speech Quality in Previous: Measuring Speech Quality in Contents Index

Samir Mohamed 2003-01-08