Realtime recognition of orchestral instruments

This paper describes the culmination of a realtime timbre recognition project based on two previous experiments. For the recognition task, all experiements use the Lazy Learning Machine, which is an exemplar-based learning system using k-nearest neighbor (k-NN) classifier with a genetic algorithm to find the optimal set of weights for the features to improve its performance. Although a considerable amount of time is needed for the genetic algorithm to determine the set of weights, the calculation time of the actual k-NN classifier is insignificant and can be performed as soon as the required number of samples has been processed, thus making it ideal for realtime applications.

The training data comprised of over 1200 notes from 39 different timbre (23 orchestral instruments, some with different articulations) taken from the McGill Master Samples CD library.

In the first experiment, only the spectral shape from manually selected steady-state portion of the sounds were used as data to the recognition process. The features calculated from the spectral data included centroid and other higher order moments, such as skewness and kurtosis. The recognition rate for the 39-instrument group was 50%.

In the second experiment, two improvements were made. First, Miller Puckett's fiddle~ object was used as the basis of analysis for realtime recognition. Second, features of dynamically changing spectrum envelope, such as the velocity of the centroid, were added to improve the recognition rate to 63%.

In the current experiment, additional enhancements include, a more precise location of attack point, the addition of spectral irregularity and tristimulus as features describing the spectral shapes, and the incorporation of time-domain envelope as features. These changes have greatly enhanced the recognition rates. Most three to ten instrument groups were recognized in the 95%-100% range, while the recognition for the 39-instrument group increased by 5% to 68%. These results were achieved using only the first 500 ms of the sound.

A demonstration of the recognition system will be given using live input.