Building a spectral envelop of FFT'd audio

Question

I'm trying to figure out how to create a good spectral envelope of a signal.

Basically I am, at present, taking a windowed section of audio applying an FFT and generating a bar chart style representation of the magnitude of the spectrum.

However to add to this I'd really like to have a good spectral envelope overlaid over the top. Unfortunately I just can't seem to come up with a good solution.

I tried a simple peak picking algorithm where I find every peak in the spectrum (I literally just look for points where the current value is greater than the values in the buckets either side). This returns me a set of peaks and I then use a catmull-rom interpolation to draw the line over the peaks.

This follows the spectrum reasonably as you can see in the following screenshot:

Reasonable envelope

However its not ideal. In that image for example you can see how it doesn't follow the spectrum on the lefthand side.

When you zoom out:

Too jaggy

You see that the envelope is pretty jaggy. Its not terrible but is this really the best way to build a spectral envelope?

Is there a better way of picking the ideal peaks so that it truly envelopes the spectrum? If so can anyone point me towards algorithms?

Also is catmull-rom interpolation the best way of doing it? Can anyone suggest a better interpolation method and show me how to implement it?

The question is language agnostic but I'm writing in C++.

Edit: Ok I found an implementation of Auto-Regressive modelling.

I then implemented the method as follows:

    std::vector< float > coefficients( 3 );
    AutoRegression( &mSpecBuffer.front(), kFFTSizeDiv2, 3, &coefficients.front(), MAXENTROPY );

    mPeakBuffer.clear();

    mPeakBuffer.push_back( Peak( 0, mSpecBuffer[0] ) );
    mPeakBuffer.push_back( Peak( 1, mSpecBuffer[1] ) );
    mPeakBuffer.push_back( Peak( 2, mSpecBuffer[2] ) );

    int x       = 3;
    int xMax    = kFFTSizeDiv2;
    while( x < xMax )
    {
        const float k3  = (mPeakBuffer.end() - 3)->peakHeight;
        const float k2  = (mPeakBuffer.end() - 2)->peakHeight;
        const float k1  = (mPeakBuffer.end() - 1)->peakHeight;

        const float k   = (k1 * coefficients[0]) + (k2 * coefficients[1]) + (k3 * coefficients[2]);

        mPeakBuffer.push_back( Peak( x, k ) );
        x++;
    }

This gives me the following result:

Auto-Regressive enveloping

Which to my eye looks a HELL of a lot better.

Zoomed right out it still looks pretty damned good:

Auto-Regressive enveloping (zoomed out)

So I just want to check, if this is correct? If so I'll work along this path further :)

@JasonR: They are rendered using my iPhone/iPad app (https://itunes.apple.com/app/spectrumview/id472662922?mt=8). I'm in the process of upgrading it and adding a "pro" version with extra functionality. — Goz, Oct 21 '12 at 07:39

score 2 · Answer 1 · answered Oct 21 '12 at 00:55

2

Two techniques commonly used for recovering the spectral envelope of an audio signal:

Low-order AR modeling. Estimate your AR coefficients, then plot the magnitude of the frequency response of the estimated all-pole filter.
Cepstral analysis. Take the logarithm of the magnitude of the FFT output. Take the FFT, truncate to keep only the first coefficients, take the inverse FFT, take the exponential. This is equivalent to applying a low-pass filter on the spectrum (with a log-scale magnitude).

answered Oct 21 '12 at 00:55

pichenettes

19,413
1
50
69

I've tried cepstral analysis and the envelope was utter crap. Can you go into more detail on the "Low-order AR modelling"? – Goz Oct 21 '12 at 07:41
I've also tried LPC analysis but I just can't get the LPC results to match up to the FFT based results. It also has waaaay too much overhead to be particularly useful in this case. – Goz Oct 21 '12 at 07:43
By low-order AR-modeling, I meant using a small number of coefficient. – pichenettes Oct 21 '12 at 08:46
My suggestions are signal analysis methods, aimed at recovering a coarse, low-dimensionality, pitch-robust representation of a signal similar to the filter in a source-filter representation. – pichenettes Oct 21 '12 at 08:47
Can you go into more detail on that? For one I'm not at all sure how to "plot the magnitude of the frequency response of the estimated all-pole filter" ... I've had a good look up on AR modelling and I now know what it is but I'm none-the-wiser on how to use it ... – Goz Oct 21 '12 at 09:10
I edited my question .. can you comment on the new results? – Goz Oct 21 '12 at 09:39
What you did is totally different from what I had in mind (Read more about LPC spectrum here: http://www.scribd.com/doc/29323516/39/LPC-Spectrum) but it seems to work well for your use case - it is akin to considering your spectrum as a time-series and smoothing it. – pichenettes Oct 21 '12 at 09:54
Thanks for the link on the LPC spectrum. I have a question about that over here: http://dsp.stackexchange.com/questions/4714/matching-an-lpc-magnitude-spectrum-to-fft-magnitude-spectrum. The big problem with the LPC is I just can't get it to line up to the FFT spectrum (Magnitudes are different). All in though ... LPC is waay too expensive to be doing on an iphone as well as an FFT ... this autoregressive modelling appears to be much much cheaper. – Goz Oct 21 '12 at 10:34
How does LPC analysis relate to this auto-regressive method btw? Are the 2 related somehow? MY understanding of LPC is its all about extracting some sort of data (this section is the "magic" section for me ;)) from an auto correlation. Seeing as an auto correlation can be generated very similarly to a cepstrum is LPC analysis similar to performing burg's maximum entropy method on the cepstrum? I'm probably talking out my arse here but would be interested to know anyhow :) – Goz Oct 21 '12 at 18:25
And oh bollox, I'm still running my old peak picking algorithm :( This means my magnitude calculation listed above is completely wrong :( – Goz Oct 21 '12 at 18:47

score 1 · Answer 2 · answered Oct 21 '12 at 01:14

This seems to be more a question of style than anything really based in theory if I read this correctly. It's not clear exactly what you want. What information would you hope that your user could extract about the audio signal by looking at the display? If your signal of interest truly has a "jaggy" spectrum (and your example does), then I wouldn't see a problem with plotting a "jaggy" envelope.

You did point out the issue in the first image, where your interpolated envelope isn't strictly greater than or equal to all of the bins in your FFT. I think you could fix that by tweaking your algorithm for selecting control points in your spline interpolation. Instead of only selecting control points where there appears to be a peak, you could insert a control point to your spline on every bin $B_n$ where $B_n > B_{n-1}$. That is, for every bin, if it is greater than its left neighbor, make it a control point on your spline.

In the first image, then, the first 5 bins would all be control points, and the interpolated envelope would travel through each bin's actual value, which would likely give a better-looking "enveloping" of the underlying spectrum.

Thanks, I would expect a somewhat smoother envelope. Probably with far less control points in it for a smoother overall look. Your suggestion of adding control points would help in the above case but wouldn't help at all in the case of descending pitch. Which leaves the only other option as making each bin a control point which, I'm sure you'll agree is not going to solve the smoothness problem. — Goz, Oct 21 '12 at 07:50

score 1 · Answer 3 · answered Oct 21 '12 at 09:36

It seems to me that your goal is to obtain a line which is always strictly above the spectrum. This goes against many signal processing methods which tend to have an energy conservation property - with these methods, lowering the resolution will spread a spectral peak over adjacent FFT bins (say a peak at 1.0 in bin n will become spread into 0.1 at bin n-1, 0.8 at bin n, 0.1 at bin n+1). Of course, "hacks", post-processing, etc., could certainly give something acceptable according to your visual criteria, the problem is that they would be meaningless from a signal processing perspective.

I would suggest a slightly different approach which would be of: a/ trying to stay true to a well-defined, established, signal-processing metric; and b/ have a "communication plan" about what you are plotting expresses, and how it is useful to the user.

First, you could communicate that you are plotting a low-resolution representation, which emphasizes less on individual spectral peaks, and more on the overall frequency distribution - which would be the "footprint" of the sound in the ear, independently of pitch. For this you could use:

Lower order FFT (say the square root of the size used for the big FFT), sinc-interpolated to fit your full-size.
Auditory spectrum. Take the output of your main FFT, sum the energy over overlapping triangular windows centered on the critical bands of hearing, convolve by a spread function, and translate back from Bark scale to linear frequency scale. See chapter 3 of Tristan Jehan's PhD for a quick introduction.

Secondly, you could make use of temporal information; and communicate that what you are plotting is the current spectrum plotted on the backdrop of its "context" from the past few seconds - allowing the user to notice changes in sound (this kind of visualization is actually useful when setting up a multiband compressor).

A low-pass filtered spectrum: background[n] = 0.9 * background[n] + 0.1 * current_fft_frame[n] or...
Maximum of each FFT bin over past $n$ analysis frames.

Note that this last option is the only one in my lists that comes with the guarantee that it will be always greater than the current frame spectrum.

Building a spectral envelop of FFT'd audio

3 Answers3