Real-time ASR Dev Blog: March 2022

Wednesday, 9 March 2022

Speech Recognition:

Rewrote everything to better frame the data around noise at the beginning and end.

Using RMS derived 'Busyness' and Volume Peak to auto adjust the noise floor - You can gradually raise your voice and it will only output "*noise*", but when you suddenly speak from quiet it will then process voice.

The main differences now are in framing, as this and pitch detection are the most important.

So in not using prediction, the main time consumer, all I have is the FFT output to work with. This makes it difficult to detect many sounds, including short lasting sounds (plosive consonants), as FFT's need many samples for reliability.

To detect consonants and vowels together, I'm trying a theory of framing, resolution, and sound transition. Frame lengths are 16ms, 32ms, 64ms, 64ms (for 8khz: 128, 256, 512, 512 samples) fixed length starting from when the voice starts. Consonants are a one or two part short burst (16ms, and 32ms frames). Vowels are two part (64ms, and 64ms). The theory is to use ~5 group resolution in hz for Consonants, and 12 for vowels. The 12 groups for vowels are based on general vowel speaking ranges. Consonants aren't always at the beginning of something you say, but most are involved with plosives so will always have silence before them, creating a new speech frame with the consonant at the beginning.

Another problem was a low hz bias in the FFT output (linearly stronger at lower frequencies), so levelling those out made a big difference.

Getting good results from vowels, but framing, FFT quality, and post processing still need work.