Real-time ASR Dev Blog

Tuesday, 9 August 2022

Speech Recognition:

This month was fine tuning.

Processing takes approx 1ms per 32ms frame.

Framing is the same. One 16ms frame, the rest 32ms.

Resolution has been increased to the below values. (These take the place of an MFC):
438, 469, 500, 532, 563, 594, 625, 657, 688, 719, 750, 782, 1000, 1250, 1500, 1625, 1750, 1875, 2000, 2125, 2250, 2500, 3000, 3500, 4000, 5500, 8000

Fletcher-Munson/equal loudness calibration values for the above are better.

Frequency Transition detection is the same.

More phonemes added (26 unique patterns. 23 vowels. 1 non-plosive consonant. 2 consonants)

That is basically all the vowels I need to add, which include some duplicates for accents. I have more patterns for consonants but I'm not testing those yet.

Finally added Peak Volume Normalisation which doubled the reliability/robustness.

Saying "tree":