Real-time ASR Dev Blog

Wednesday, 6 July 2022

Speech Recognition:

The frequencies for "cake" and "tree" are reasonably consistant now.

Response time is only 5-10ms from the end of recording the last sample to the end of processing the data.

"K-ay" (in "cake") and "t-r-ee" (in "tree") are 90-100% consistant with good volume. The second "k" sound in 'cake' is often a low-volume miss.

Robust in *mild* accent change, vagueness/clarity change, volume change, pitch change. Low volume is the worst.

16khz fixed input frequency.

FIR Hilbert, and FIR Low pass filters have been modified to work with factors of 8 to fit the frame sizes. Dont know why I didn't do that before. One Low Pass at 1000hz, the other Low Pass at 8000hz (dual samples feed the FFT).

Frame sizes are one 16ms frame, the rest are 32ms each.

16khz/368ms max. 5888 samples: (G1) 256-512, (G2) 512-512-512-512-512, (G3) 512-512-512-512-512.

Added a Plosive Detector to determine whether G1 samples are a plosive consonant.

Non-plosive consonants are now checked with vowels.

Seeing all the frequencies I need to for the consonants & vowels I want. I just need to add more and detect low volume samples better.

I'm aiming for a small list of words such as "up, down, left, right, start, stop, hi, bye..." so it can become useful straight away. It's also easier to handle strong accents and missing sounds if the word list is smaller. I have many ideas for solving it, but it'll at least be as good as current keyword-spotters. If this works out then more is possible.