How to improve my FFT output?

Question

I am to reproduce the figure below but with a different blue line (source), which shows the ideal normalized frequency spectrum (grey) of a spectral ripple with 0.7 ripple-per-octave and an electric spectrum of a speech coding strategy. Below you can find the time signal of the spectral ripple sound. Unfortunately, the sound is only half a second so I understand the frequency resolution I can achieve with a sampling frequency of 44100 Hz is 2 Hz (Fs/num_samples = 44100/22050)

I have several questions on how to achieve this (in python):

Why does my frequency spectrum show such "spiky" behaviour instead of a sinusoid as the first figure?

import numpy as np
from scipy.io import wavfile
import matplotlib.pyplot as plt
Fs, audio_signal_single = wavfile.read(sound_name) # Fs = 44100
FFT = np.fft.rfft(audio_signal)
abs_fourier_transform = np.abs(FFT)
power_spectrum = np.square(abs_fourier_transform)
frequency = np.linspace(0, Fs/2, len(abs_fourier_transform))
max_power = power_spectrum.max()
normalized_power = power_spectrum/max_power
plt.plot(frequency, np.squeeze(normalized_power), color='grey')

2. I thought about zero padding, but I read this does not improve the frequency resolution ["zero padding can lead to an interpolated FFT result, which can produce a higher display resolution."] 5 But I'm not really sure what this entails, but with 22050 samples and zero padding up to 2^15 I think this would give an (interpolated) resolution of 0.67 Hz. Why does it seem that the spectrum is shifted to the left? Am I right to skip zero padding to improve my figure? Similarly, I tried repeating the signal, but this also seemed to distort the spectrum and I read here that this also does not help.

FFT = np.fft.fft(audio_signal, 2**15)

I have applied a blackman-hanning window prior to performing the FFT, but I noticed this only affects this spiking in the lower frequencies. Why is this? How can I extend this effect to higher frequencies?

window = 0.5 * (np.blackman(len(audio_signal)) + np.hanning(len(audio_signal))) 
audio_signal *= window

4. I don't recall FFT having an effect on the power in higher frequencies, why does the power decrease with higher frequency in my image? How can I counter this? I have applied a pre-emphasis filter which slightly helps, but is there a better solution? Why is the spectrum suddenly cut-off? This could have been present before, but the power was too small to notice.

audio_signal = scipy.signal.lfilter(coeff_numerator, coeff_denominator, audio_signal)

In short, is there a way I can improve my spectral output to make it more similar to the image I am trying to replicate?

Hi! Welcome here :) I don't understand what the blue line shows, even after looking at the poster. Could you explain? — Marcus Müller, May 11 '22 at 08:30
To be honest, I'm not completely sure but I interpreted it as the normalized power in that particular frequency channel extracted by the speech coding strategy. However, that is separate issue from my questions listed above. — NonIntellego, May 11 '22 at 08:50
Thanks for explaining it anyway! (it helps me, as I'm really not familiar with the mental framework of psychacoustics, I might have overattributed too much to that line, so your comment helps me read the rest of the question, so again, thanks :) ) — Marcus Müller, May 11 '22 at 08:52
hm, do you have any insight into the algorithms of the "Nucleus" Processor? To me it looks like that's just the result of how that device is designed to represent octave-spaced frequencies: Using discrete tones (quite possibly the output of an (I)FFT) on a coarser grid than you use to observe – hence you see "dips" between these tones. — Marcus Müller, May 11 '22 at 09:04
Not sure I can follow, but the Nucleus has 22 channels and that is why you see 22 frequency bands in the first image. Per channel/band you see a blue line that probably is the average power the speech coding strategy (by means of FFT or bandpass filters) is able to extract in that band. I think they chose an " octave spaced" axis to represent the characteristics of the sound (not of the Nucleus). What do you mean by the dips? Do you mean the spiky behaviour in my figures? Because those are unrelated to the blue line as they belong to the sound. — NonIntellego, May 11 '22 at 09:19
yes, exactly. So, the Nucleus gets the requirement to put this and that much power (or amplitude) in each of its 22 frequency bands. Then, it generates a (discrete) time-domain signal out of it, which is saved to a .wav file and you do your analysis on. The question I have is "Do you have an indication what specifically Nucleus does to go from "power per channel" to "time-domain signal"?". — Marcus Müller, May 11 '22 at 10:11
The nucleus does not generate a discrete time-domain signal, it generates an electrodogram/pulse trains (see: https://www.semanticscholar.org/paper/Cochlear-Implant-Signal-Processing-ICs-Swanson-Baelen/20759e2cf380b76a1615d9a48c708dee62de601b/figure/1). An electrodogram consists of pulses of a certain amplitude that convey the power of the sound in that frequency band to the cochlear nerve via the electrode in the cochlear implant. If you look at Figure 2 of that source you see the general process of a speech coding strategy. — NonIntellego, May 11 '22 at 11:12
Principally, you have a filter bank (FFT/bandpass) to extract the information, some additional processing to improve the sound perception for the cochlear implant user and then the loudness/power is conveyed by the pulse train amplitude. I am actually unsure what specific speech coding strategy this cochlear implant uses, I found, amongst others, MPEAK in articles. Here is an overview of that strategy: https://ecs.utdallas.edu/loizou/cimplants/tutorial/loifig21.gif, but here you can find more on the strategies in general: https://ecs.utdallas.edu/loizou/cimplants/tutorial/tutorial.htm — NonIntellego, May 11 '22 at 11:12
ok, but where does your wav file come from, then? (not asking about the actual device, I'm asking about the simulation results, which is what you're analyzing here, it seems!) — Marcus Müller, May 11 '22 at 11:13
It's from a spectral ripple test. I think the original authors of this test are listed as the first reference on the poster (https://link.springer.com/content/pdf/10.1007/s10162-007-0085-8.pdf). I did not generate the wav file, I merely want to reproduce the spectrum of the sound. — NonIntellego, May 11 '22 at 11:19
First of all: that Cochlear implant processing is interesting! Thanks. Second of all: oh, dear. Yeah, that is pretty much what you describe and I simply hadn't wrapped my head around it the way you wrote it; sorry for the noise. Anyways, Won 2007 is pretty interesting; I could imagine the shape of the 500ms windowing (time-domain windowing = convolution with the window Fourier transform in frequency domain) leading to destructive interference on regular intervals – hence the gaps (what I called dips above) between your spiky spikes :) — Marcus Müller, May 11 '22 at 11:44
Ah thank you! So the issue is inherent to the FFT and in particular the limited number of samples? Do you perhaps have a source where I can read more on this? And do you know why windowing removes this destructive interference at low frequencies only? — NonIntellego, May 11 '22 at 11:55
ah, the problem might be inherent to the way the signal is generated — Marcus Müller, May 11 '22 at 11:59
And why does windowing remove this interference at low frequencies only? — NonIntellego, May 11 '22 at 12:11

How to improve my FFT output?

0 Answers0