Real-time ASR Dev Blog: August 2023

Speech Recognition / Phoneme Extraction Tool:

Redesigned consonant plosive-find
Redesigned consonant identification
Updated consonant frequency definitions. Added 2000 (512, 1000, 1500, 2000, 3000, 4000, 5500, 8000)
Consonant viseme upper tone/cheek emphasis changed from boolean to variable.
Updated vowel frequency definitions. Added 2333, 2666, 3500.
Updated vowel identification
Added Vowel format 2 focusing:
   1) F2 frequencies starting above 1000hz cannot transition to 1000hz, to avoid interference
     from nasal tone, and
   2) F2 frequencies where the last two values are the same can only transition up/down
    by one.
Improved Vowel F1 transition analysis to include power & average centre, and for flat transitions to re-alias centre frequency to a defined frequency group. This improves fine frequency transition detection, and for flat frequencies resilience to errors.
Updated Phoneme Emphasis
Updated equal-loudness
Updated GUI

Working:
"kay, kee, kai"
"no" x3 variations
"yes" x3 variations ("ye")

A lot of range in "no" and "yes" (and by extension all other vowels & consonants) is supported now, and will be setting up a benchmark to test it more.

Auto Phoneme Emphasis depends upon the peak volume in a range of frames. There are three ranges (2 - 5 - 5 frames). Volume detection in the first two frames is important as many loud plosives (eg "k", "t", "p") exist there and need to be represented accurately, as well as account for mouth-open, meaning emphasis travels high in advance. They are all precise to the audio and don't use general curves, so if they're wrong they're wrong, but when they're right they're really right. An emphasis multiplier determines max emphasis.

"Yes" and "No":

Monday, 7 August 2023