Real-time ASR Dev Blog: February 2022

Monday, 7 February 2022

Speech Recognition:

Deleted all IIR filters as they seemed to be adding noise. Now only using pre-calculated FIR filters.

There are 3 FIR filters (low-passes) at 1100, 2200, 4200, with the sample data split into 0-250hz, 250-1000hz, 1000-1500hz, 1500-3000hz. This reduces noise as much as possible, as the processing is mostly how Busy the wave data is.

"Busyness" is calculated for the 250-1000hz range. (Sample data difference / RMS power). If busy then the FFT is processed.

The FFT runs on each of the four frequency ranges seperately. - If the four ranges are merged with only one FIR the output is too noisy for the lower ranges.

After this the FFT output data is smoothed with the top frequencies averaged together, but I will be changing this to 12 fixed frequency ranges to suit the phoneme ranges.

There's still a lot to change but thought I would post this as it's completely different, and still only takes ~1.5ms to process (from a 102.4ms sample buffer). The NLP only takes 0.1-0.2ms on top. So it is 98.5% idling in a thread::yield().

Normal Speech Recognition takes 0.5-0.7 seconds per sample. The problem seems to be in the autoregression/prediction modelling after the FFT and MFCC/Mel Filter algorithms. Precision but the speed is slow.

In the picture "a-" and "e-" long vowels are detected.