Hello,
Sorry about my ignorance, I am trying to learn this subject for a finals project I am undertaking.
Brief background:
I am developing a Speech Recognition algorithm that identifies whether someone is saying a particular word, in this case "Yes" or "No".
I am computing an MFCC (From this paper:
https://arxiv.org/pdf/1003.4083.pdf) and what I have done so far is:
- Pre-emphasis
- Framing
- Hamming Windowing
The equation I am struggling on is "Step 4" .. Now ok, if I take the FFT of each of the "Windows" in the Time-domain and multiply by the Mel filters' frequency response, would this be enough?
I also have a problem with this equation:
For example, what does F represent? Does it represent the FFT of the "Window" or the "Window" in the time-domain?
I hope someone can help, sorry for my lack of understanding.. I am learning here.