December 2015
This project was completed for MUMT 605, Digital Sound Synthesis and Audio Processing at McGill University. I provide an overview of Spectral Modeling Synthesis and develop a real-time spectral modeling synthesizer in Max/MSP.
Spectral Modeling Synthesis (SMS) is an analysis/synthesis technique that models time varying spectra as a collection of sinusoids by amplitude and frequency envelopes and a time-varying filtered noise component, which corresponds to a deterministic and stochastic part, respectively. The technique was introduced by Xavier Serra [1], then expanded upon with procedural details in a paper one year later by Serra and Smith [2], with updates in [8].
Before explaining the digital implementation, I will provide a spectral modeling synthesis procedural outline. At the input of an SMS system, an audio source is first converted to the frequency domain. From here, sinusoidal trajectories are extracted by tracking frequency peaks and transformed to sine waves in the time domain, making the deterministic part of the model. The spectral peaks are removed from the original spectrum, resulting in a residual noise floor.
This noise floor, defined as the stochastic part, is then modeled as white noise sent through a time varying filter, which is determined from a linear approximation of the upper spectral envelope, and then overlap added, resulting in the stochastic part of the signal. The stochastic and deterministic (sinusoidal trajectories) are added to form the complete synthesized model of the input signal.
A critical step in the SMS procedure is the peak detection and continuation algorithm, followed by additive synthesis. If the frequency peaks are not correctly tracked, and insufficiently subtracted from the original signal, the noise floor may contain harmonic data. With precise peak tracking, the output model will contain accurate harmonics and retain the tonal character of the input signal. Additionally, a number of control parameters should be provided for the peak tracking, in order to adapt the model for a variety of input sources such as voices with deep vibrato or audio with multiple instrument contributions.
Peak detection algorithm varieties are outlined in [2]. The one described by Serra and Smith involves detecting local maxima in frequency data and measuring the relative height of the peak in comparison to its neighboring bins, passing through the values that match those two criteria given a height threshold. As an alternative, Hidden Markov models can be used to determine peaks in the signal and provide improvements over the prior method, as described in [3]. Peak tracking powered by linear prediction coding or Kalman filtering is shown in [13] and [11], respectively.
Since it is rare for a peak frequency to be located at the center of a bin (in which case, it is accurately defined) interpolation is needed to find the actual frequency of the peak. Zero padding can be used to increase the number of bins per Hz, but parabolic interpolation is recommended in [2], which takes the two neighboring frequencies, fits a parabola of the three values, and outputs the parabola peak value.
Sinusoidal trajectories are extracted by connecting the peaks of each FFT frame. Trajectory algorithms connect the peaks with spectral guides that advance along the peaks in time which look for appropriate values and form trajectories out of them. Peak continuation algorithms usually have three steps, guide advancement, guide updates, and new guide initialization. An algorithm is needed to create these trajectories when considering the routing of this peak information.
For example, if an SMS program contains 8 oscillators for the additive synthesis part, each peak must be allocated to one of those oscillators. One peak could increase in frequency value (like a trombone player sliding up the scale), while another peak remains fixed. At the point where the increasing peak matches the other, the trajectories cross and continue on. Since the algorithm identifies the changing frequency as a sinusoidal track, the oscillator will continue to handle it rather than jumping to another peak. Besides this example, it is also important to continue guiding partial peaks rather than spurious ones, caused by noise, which are not as important to the deterministic part of the sound. Lastly, shortcomings in the peak detection algorithms can result in the partial discontinuity or formation of false partials [11]. Therefore, the selection process of peak trajectory algorithms is crucial.
The stochastic part of the signal can be represented by an approximate shape of the analyzed noise floor spectrum. This is because there are no frequency peaks contributing to the harmonic qualities of the audio. Only the shape of the noise contributes to the sound characteristics of the audio source. Curve fitting is done to find this spectral envelope, which fills in the gaps created from subtracting out the additive part. A spectral envelope should ideally pass through all prominent peaks and be relatively smooth (not containing many oscillations and sharp peaks). In [2], a line segment approximation is said to be accurate enough.
Other methods include method of least squares, spline interpolation and least squares approximation. Specifically in regards to the last option, linear prediction coding (LPC) can be used. It fits an nth order polynomial to a magnitude spectrum. Line-segment approximation is more flexible than LPC, which is why Serra and Smith used it. The line-segment approximation involves stepping through the magnitude spectrum, finding local maxima, and then linearly interpolating the points to create the spectral envelope.
Once the spectral envelope has been approximated and smoothed, the magnitude data is fed into an inverse Fourier transform algorithm. The phase that is used by the IFFT can be a random value between 0 and 2π because the phase spectrum of noise is a random signal. In this way, a time domain signal is output from the IFFT which is essentially filtered white noise with an envelope approximately the shape of the original signal’s stochastic part.
Non-real time audio signal processing programs, like AudioSculpt, provide high precision spectral transformations and resynthesis procedures [7]. There are open source Matlab and C++ packages available, such as SMTools, which include peak tracking and resynthesis algorithms. An interesting possibility within SMS techniques is the transformation of deterministic and stochastic parts data in the spectral domain before resynthesis and addition. These transformations could be more valuable to a composer if made available in real time environments.
Max/MSP facilitates real time spectral modeling synthesis processing. Jitter, which is used mainly for image processing, helped to implement some of the topics outlined in the SMS technique, primarily for storing and manipulating Fourier analysis data. Fourier analysis gives us a two-dimensional array of complex numbers [5]. Msp buffers are one dimensional, while jitter matrixes are two dimensional arrays and, therefore, more suited for the following procedures [12]. Also, matrix math can be done faster than the audio signal, given a metro object. So after storing FFT data, different transformations can be done without significant delay, then converted back to time domain with the short term Fourier transform.
My first major task was to create an algorithm that detects the peaks in a harmonic signal. I did this by first storing the frequency domain data into a matrix using a pfft~ object. I unrolled the data into lists to process values using zl objects. The peak finding algorithm sorts the magnitude data for a frame of the FFT from largest to smallest. The index, or cell location, for the largest value in an nth frame and its two adjacent cells are sent into a zl lookup object, which extracts the three magnitudes from the list of values for the nth frame.
Parabolic interpolation is performed across the three values and their magnitudes, based on the following equations [10] for peak location given in bins (1) and peak magnitude.
The interpolated frequency and magnitude is then output and converted into a sinusoid with the cycle object.
One way to extend the algorithm to multiple peak detection would be to use an iterative process, which removes the peaks after being detected and, therefore, shortening the list after every iteration. The detection would process through another step, which would coincidentally pick the next highest peak within that spectral frame. This approach may be computationally inefficient. Alternatively, a jit.iter object was used to test local maxima detection and peak location indexing with good results and showing promise of multiple peak detection with jitter matrixes in future work.
In order to test the concept of SMS on signals with multiple harmonic peaks, I used the sigmund~ external object [4]. This object was made by Miller Puckett, originally for PD and then ported to MaxMSP. Contained within it is a FFT, peak detector, and trajectory handler. The original signal is sent into the input, and the output of this, which contains a list of each peak and amplitude, is sent into a patch that unpacks the list and sends each track to an oscillator. The oscillator bank units have routing for new and continuing sinusoid tracks. Each oscillator is sent out a single output. The additive synthesis oscillator bank has thirty oscillators that handle sinusoidal trajectory tracks.
The original signal and the additive signal are sent into a pfft~ object. Each signal is converted to the frequency domain after windowing and short term Fourier transforming within the fftin~ object. After converting the amplitude and phase information to polar coordinates, the additive signal’s frequency magnitudes are subtracted from the original magnitudes. The original, additive, and residual are each sent to their own jitter matrix, which include two planes each for the frequency and phase information. By doing the subtraction within the pfft~ correct synchronization of the signals (i.e. the same bin and frame numbers for each signal) can be ensured. The jitter windows provide spectrograms that make it easy to spot partials, noise, and unwanted components in either the stochastic (spec3) or deterministic (spec2) parts.
Differences between my implementation steps and the ones described by Serra are important to note. In the paper, the original signal is converted to frequency domain using a single short term Fourier transform at the beginning, for which the data is sent to the peak detector and eventually the residual subtraction part. In my implementation, the peaks are sent through a short term Fourier transform in two instances, once in the sigmund~ object, and the other for the original signal. Replacing the external object with my own multiple peak detector will allow for the steps to be exactly followed per the original article.
Returning back to the pfft object feed, the original signal is delayed by the number of bins used by the STFT in the sigmund~ object. This delay ensures that the additive part is not lagging behind the original, and that changes in the peak are subtracted at the right moment. Without the delay, harmonic content is present in the stochastic signal momentarily after a change in the original signal fundamental frequencies.
The last step of the SMS technique is to synthesize the stochastic part using the envelope of the residual noise frequency data to filter white noise. To model this part of the sound, such as the breath noise in woodwind instruments, a good time resolution is needed, and frequency resolution can be partially sacrificed.
A line that approximates the outline of the frequency spectrum of the stochastic part can be defined as a smoothed cepstrum. The smoothed spectrum does not contain some of the spurious peaks and jagged edges. To get the cepstrum of the signal, the magnitude of the frequency data of the noise must be converted to the quefrency domain with an IFFT, low-pass filtered with a windowing function, then Fourier transformed [6]. Figure 4 shows the overall process involved in obtaining the smoothed cepstrum of a signal.
A difficult part of this implementation involved the cepstral smoothing process, mainly due to the multiple consecutive fast Fourier and inverse Fourier transforms. The pfft~ object consumes a considerable amount of processing power, so it was important for me to design the patch with as few pffts as possible. This led me to doing the Fourier analysis all within a single, main pfft sub-patch. In order to make this work, effort was put towards ensuring synchronizing signal streams and appropriate windowing.
The smoothed spectrum is fed into the pfft~ object, with no fft inlets, along with the phase being fed in as a random value between 0 and 2π. To avoid periodicity at the rate of frame updates, new random values are generated for each frame within the phase plane of the stochastic jitter matrix. Output from the object is the inverse Fourier transform, which is the stochastic time domain part of the model. The signal output of the ifft~ is then added back to the additive part calculated earlier with the bank of oscillators controlled with the sigmund~ object. Resulting from the addition is a synthetic model of the original audio sample.
Resulting synthesized sounds from my Max/MSP patch vary in quality based on the input waveform characteristics. Some deterministic parts can be heard in the resultant stochastic part if the input waveform has large vibrato sweeps (greater than 10 Hz in frequency and 10 Hz variation). A garbled sound is produced from the patch when there is a harmonically rich sound source, such as a strummed acoustic guitar. This unwanted effect could be from peak track variations and the sigmund~ object or oscillator bank sub-patch not handling every sinusoid sufficiently. Even though the oscillator banks ramp up the signal for each trajectory, the unwanted effect still exists.
Also, there should be some windowing in the smoothed cepstrum transform process, because there is an audible clicking from the stochastic part due to the abrupt changes between FFT windows. When I windowed this signal, the amplitude of the envelope dropped significantly. More work needs to be done to get the smoothed spectrum to a point where it can be used in the output signal.
The biggest implementation challenge for me was finding how to track peaks after extracting FFT data into jitter matrices. I think that the jit.iter method that I have started can be expanded to form multiple partial trajectories. Another challenge was normalizing the magnitude data from the FFT once it was stored in the matrices, especially when it came to taking the log of the magnitude for the spectral smoothing portion. Due to the non-constant data flow associated with this process, delays had to be used to synch the multiple signal paths. Using jitter matrices helped to control when each data package was to be used in the MSP domain, however, challenges still existed with regards to signal flow control.
In comparison to other spectral modification programs, the non-real-time software is more precise and can handle the peak detection and additive synthesis with more fine-tuning capabilities. A major factor in the overall quality of the modelled sound is how well the SMS implementation is tuned to a certain type of audio signal. In my implementation, some parameters are easily adjusted to suit different input characteristics, like the number of tracks and vibrato tracking within the sigmund~ object, which can help to refine the additive portion of SMS. More control parameters can be added with a peak tracking and continuation algorithm designed especially for this SMS system.
Spectral Modeling Synthesis is an analysis-based technique that captures the characteristics of sounds. At the end of [2], it is noted that a real-time implementation of this system would allow the use of this technique in performance, but that the representation would be computed ahead of time and stored. Using Max/MSP for SMS is an interesting concept and one that can be developed further for live spectral modeling performances. For demonstration of the subject and effects that don’t rely on precise harmonic partial tracking and residual synthesis, the MaxMSP implementation is useful. With transformation routines added to my patch and more precise peak tracking, this implementation may be valuable to musicians for live performance or composing.
[1] Serra, X. 1989. “A System for Sound Analysis/Transformation/Synthesis Based on a Deterministic plus Stochastic Decomposition.” Ph.D. diss., Stanford University.
[2] Serra, X. and Smith, J. 1990. “Spectral Modeling Synthesis: A Sound Analysis/Synthesis System Based on a Deterministic plus Stochastic Decomposition.” Computer Music Journal, Vol. 14, No. 4, Winter 1990.
[3] Depalle, P.; Garcia, G.; Rodet, X. 1993. “Tracking of partials for additive sound synthesis using hidden Markov models,” in Acoustics, Speech, and Signal Processing, ICASSP-93., IEEE International Conference, vol.1, no., pp.225-228 vol.1, 27-30.
[4] Puckette, M. 2005. “Sigmund~ 64-bit version based on v0.07.” URL: https://github.com/v7b1/sigmund_64bit-version/releases
[5] Puckette, M. 2006. “The Theory and Technique of Electronic Music.” URL: http://msp.ucsd.edu/techniques.htm
[6] Padovani, JH. “Spectral Envelope Extraction by means of Cepstrum Analysis and Filtering in Pure Data.” PdCon09, July 19-26, 2009, São Paulo, SP, Brazil.
[7] Bogaards, N., A. Röbel, and X. Rodet. 2004. “Sound Analysis and Processing with Audiosculpt 2.” Proceedings of the 2004 International Computer Music Conference. San Francisco, California: International Computer Music Association, pp. 462–465.
[8] X. Serra. 1997. “Musical Sound Modelling with Sinusoids plus Noise,” in Musical Signal Processing, Exton, PA: Swets & Zeitlinger, pp. 91-122.
[9] Purwins, H. and Geiger, G. 2009. http://www.dtic.upf.edu/~ggeiger/InfoAudioMusica/lab-5.html
[10] Smith, J. 2007. “Spectral Audio Signal Processing.” https://ccrma.stanford.edu/~jos/sasp/
[11] Satar-Boroujeni, H. and Shafai, B. 2005. “Peak tracking and partial formation of music signals,” in Signal Processing Conference, 2005 13th European , vol., no., pp.1-4, 4-8.
[12] Charles, JF. 2008. “A Tutorial on Spectral Sound Processing Using Max/MSO and Jitter.” Computer Music Journal, 32:3, pp. 97-102, Fall 2008.
[13] Ananthapadmanabha, T.V. and Yegnanarayana, B. 1979. “Epoch Extraction from Linear Prediction Residual for Identification of Closed Glottis Interval,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 27, no. 4, pp. 309¬319