Audio Segmentation

6:10 PM 3/1/2007, by Shi Yong

Annotated Bibliography

Aarts, R. M., and R. T. Dekkers. 1999. A real-time speech-music discriminator. J. Audio Eng. Soc., 47 (9):720-5.

A real-time SMD system that consists three parts, namely Filtering and Normalization, Feature Extractor, Fuzzy Combiner, is presented in this paper. The input signal is normalized by a circuit made up of a band-pass filter, a band-stop filter, and a divider, and then is fed into the Feature Extractor. The speech and music feature are then extracted by a combined slope feature evaluation algorithm from the time domain signal (only speech feature evaluation is described). Finally the speech and music features are combined by a simple adder and estimation is made to tell the speech probability. A false-alarm probability of virtually zero is reported.

Alexandre, E., M. Rosa, L. Cuadra, and R. Gil-Pita. 2006. Application of Fisher Linear Discriminant Analysis to Speech/Music Classification. Paper read at the 120th Convention of Audio Engineering Society, at Paris, France.

On the classification side, Fisher Linear Discriminant (related to LDA) is used. The performance of FLD is compared to KNN and the result is promising. For FLD, the best results are obtained using MFCC and Voice2White features, with error probability of 4.09% and 4.91% being reported. It also shows that very good result can be obtained by only using one feature, which is useful to reduce the computational cost for real-time application.

Brown, J. C. 1999. Computer identification of musical instruments using pattern recognition with cepstral coefficients as features, J. Acoust. Soc. AM.

This paper reported a musical instrument classification experiment on oboe and saxophone (provided computational details of classifier). The goal is to classify the unknown sounds into one of the two classes. 18 cepstral coefficients extracted from 23-ms frames were used as the feature vector. Gaussian mixture model was used as the classifier, and the k-means algorithm was used to determine the cluster means and variances, whereas the number of clusters was determined by input parameter. The experiment showed the machine identification results were very similar to the human results.

Chen, S.S., and P.S. Gopalakrishnan. 1998. Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion: IBM T.J. Watson Research Center.

Huang, R., and J.H.L. Hansen. 2004. Unsupervised Audio Segmentation and Classification for Robust Spoken Document Retrieval. Paper read at IEEE ICASSP-2004: Inter. Conf. on Acoustics, Speech, and Signal Processing.

New combined criterion was proposed to evaluate the performance of automatic audio stream segmentation. Three new features (PMVDR, SZCR, and FBLC) were proposed. Weighted mean distance (T^2-Mean) was use as the segmentation distance metric. False alarm was compensated by alternatively using T^2-mean and regular T^2 for short segments and long segments respectively. A 30+% improvement over a traditional BIC with MFCC based segmentation algorithm was reported.

Kemp, T., M. Schmidt, M. Westphal, and A. Waibel. 2000. Strategies for automatic segmentation of audio data. Paper read at IEEE International Conference on Acoustics, Speech, and Signal Processing.

Three strategies for broadcast news signal segmentation were summarized and evaluated: model-based, metric-based, and energy-based. Precision ratio, Recall ratio, and F-measure were used as the evaluation metrics. The results showed Model-based and Metric-based algorithm outperformed Energy-based algorithm, and a new hybrid algorithm was proposed and a better performance was reported.

Omar, A.H. 2005. Audio Segmentation and Classification, Technical University of Denmark.

An automatic segmentation and classification system was implemented and tested. Three features extracted from overlapped frames (Short Time Energy, Zero Crossing Rates, and Mel Frequency Cepstral Coefficients) were used in the classification part. Two classification methods were evaluated: the K-NN and GMM. The GMM showed a better classification performance and a shorter time than K-NN. A 95.65% correct classification rate was reported. The RMS feature extracted from non-overlapped frames was used in the segmentation part. The transition was detected by measuring the similarity of feature distributions between two adjacent frames; Chi-Squared distribution model was used to reduce the computational task.

Saunders, J. 1996. Real-time discrimination of broadcast speech/music. Paper read at IEEE International Conference on Acoustics, Speech, and Signal Processing, at Atlanta, GA, USA.

Pioneer work of SMD. Average zero-crossing rate (ZCR) is used to discern voiced speech from the FM station program. Speech signals produce abrupt increases in the ZCR at the beginning and end of words. By counting the difference between the number of points below and exceed the low and high thresholds of mean ZCR, the speech can be discerned as a higher difference ratio. Recursive convex hull algorithm is used to further improve the performance. A classification of 98% was reported.

Tritschler, A., and R. Gopinath. 1999. Improved Speaker Segmentation and Segments Clustering Using the Bayesian Information Criterion: IBM T.J. Watson Research Center.

Two improvements were made to the model-based segmentation algorithm based on BIC. First, a variable window scheme makes the window increased slower in small size and faster in big size. Second, the BIC test can be avoided for some cases. The enhancements were reported to be used in a commercial real-time application.

Tzanetakis, G., and P. Cook. 1999. Multifeature Audio Segmentation for Browsing and Annotation. Paper read at IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Oct. 17-20, 1999, at New Paltz, New York.

A general temporal audio segmentation method based on Multiple Temporal Features. The derivative of Mahalonobis distance of the feature vectors of adjacent frames are used to detect the texture changes, and then the peaks of distance are used to mark the segmentation. The proposed feature vector includes: Spectral Centroid/Rolloff/Flux, Zero Crossings, RMS, etc. Some possible applications are proposed.

Tzanetakis, G., and P. Cook. 2002. Musical Genre Classification of Audio Signals. Paper read at IEEE Transactions on Speech and Audio Processing July 2002.

Feature sets based on Timbral Texture (can be used in real time), including spectral centroid, spectral rolloff, spectral flux, time domain zero crossings, Mel-frequency cepstral coefficients (MFCC), analysis and texture window, low-energy feature, resulting in a 19-dimensional feature vector.

Feature sets based on Rhythmic Content (beat histogram), including full-wave rectification, low-pass filtering, down sampling, mean removal, enhanced autocorrelation, peak detection and histogram calculation, beat histogram features. Features sets based on Pitch Content (pitch histogram) is also useful features for automatic music genre classification. Classification:

Standard statistical pattern recognition methods, such as Gaussian Classifier, GMM, K-NN classifier, were used to evaluate the feature sets.

Results: Classification of 61% (non-realtime) and 44% (realtime) were reported.

Wang, W.Q., W. Gao, and D.W. Ying. 2003. A fast and robust speech/music discrimination approach. Paper read at Proceedings of the 2003 Joint Conference of the Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia.

1, Feature extraction: Modified Low Energy Ratio (MLER), which introduce a coefficient to the low-energy feature evaluation.

2, Classifier: 1-D Bayes MAP classifier

3, Classification Refine: context based "Post-decision Method", by exploiting the relevance of neighboring clips

On-line Resources

K-Means Clustering Algorithm [accessed 2007 March 1]

Many other good tutorials about machine learning.

A Tutorial on Clustering Algorithms [accessed 2007 March 1]

Tutorial about k-means, fuzzy c-means, hierarchical, mixture of gaussians.

Statistical Data Mining Tutorial [accessed 2007 March 1]