Abstract:
Single channel source separation is a quite recent problem of constantly growing interest in the scientific world. However, this problem is still very far to be solved, and even more, it cannot be solved in all its generality. Indeed, since this problem is highly underdetermined, the main difficulty is that a very strong knowledge about the sources is required to be able to separate them. For a grand class of existing separation methods, this knowledge is expressed by statistical source models, notably Gaussian Mixture Models (GMM), which are learned from some training examples.
The subject of this work is to study the separation methods based on statistical models in general, and then to apply them to the particular problem of separating singing voice from background music in mono recordings of songs. It can be very useful to propose some satisfactory solutions to this problem, which is quite difficult and has not been much studied yet, in order to simplify an automatic analysis of songs contents, for example in the context of audio indexing.
The existing model-based methods give satisfactory separation performances, provided that the models of the sources match accurately the statistical properties of the mixed signals. However, because of the shortage of representative training data and of calculation resources, it is not always possible to construct and use such models in practice.
To overcome this problem, it is proposed in this work to resort to an adaptation scheme which, for each recording, adjusts the source models to the properties of the signals observed in the mix. A general formalism for source model adaptation is developed. In a similar way as it is done for instance in speaker (or channel) adaptation for speech recognition, this formalism is introduced in terms of a Maximum A Posteriori (MAP) adaptation criterion. It is then shown how to optimize this criterion using the EM algorithm at different levels of generality.
This adaptation formalism is then applied in some particular forms to the voice~/~music separation task. The obtained results show that for this task an adaptation scheme can significantly improve (at least by 5 dB) the separation performance in comparison with non-adapted models. In addition, it is observed that the singing voice separation simplifies its fundamental frequency (pitch) estimation, and that the model adaptation leads to a further improvement of this result.