Real-time ASR Dev Blog: June 2022

Friday, 3 June 2022

Speech Recognition:

Last month I intended to output "cake" and "tree" but it was too inconsistant.

This month is half-way there.

Framing and Resolution was updated to support 16khz instead of 10khz, to better detect consonants at 4000-5500hz.

Sample sizes are double but frames are less, so the speed is roughly the same.

Sample framing is: (G1) 16, 32, 32, 32, 32ms. (G2) 64, 64, 64ms.

Resolution groups have been updated slightly. 15 instead of 13, starting from 400hz-5500hz.

No changes to vowel Transition model, but now also outputs which consonants/vowels are detected.

Consonant detection is mostly fine with a clear voice. Only "k" tested.

Vowel detection is inconsistant. As vowels rely on multiple frames, changing to 16khz as well as reducing frames didn't help, so will be testing 12khz in the future with shorter/more frames.

Saying "cake" (below) with phoneme output.