Real-time ASR Dev Blog

Tuesday, 7 November 2023

Speech Recognition (benchmark of "yes"/"no"):

C version/Console:
Added filtering for "knock/tap" noise
Changed Vol Peak Normalisation to use RMS x 2.4 instead of highest Vol Peak. Slightly more accurate versus background noise, faster, and phoneme/viseme loudness is represented better.
Updated framing. Now using 256 samples up to the first plosive in a syllable, then 512 samples after. This helps in Consonant framing.
Replaced fixed framing 256-512-512-512-[...] with dynamic method mentioned above.
Updated Consonant Plosive Find step values. Initial min step, and F1 group power from last frame.
Updated Consonant identification
Updated Vowel Formant 1 & 2 Focusing
Updated Equal Loudness (increased/doubled some 4khz resonance numbers to better match spectrograms in Praat. 656hz (550-650), 1333hz (1000-1333), 2000-2333hz (1666-2333), 8000hz was already double).

Speech Commands Benchmark:
NO (405 records):
"no" = 44.69% (goal 50%) . "n" or "oh" = 81.7% (goal 90%)
Error | "yes" = 0.99%

YES (419 records):
"ye"/"yeah" = 22.43% (goal 50%). "y", "e", or "air" = 67.5% (goal 90%)
Error | "no" = 2.63%

Time: 0-2ms each.

A lot of changes were made, especially in Consonant framing to tell the difference between "y" and "n" better. Small improvements in vowel detection.

Many of the problems in detecting more "yes" (the "y") are noise or volume related. Other issues, are in vowel transition detection, and finding the trailing "s" in "yes".

If "yes" improves to >40% I'll then test "on" & "off".