Friday, 8 April 2022

Speech Recognition:

This month was mostly theory and expanding on Framing, Resolution, and sound Transitions.

An FIR Hilbert filter (at 2500hz) was added in finding the initial Framing/Noise Floor level, this replaced the entire 'Busyness' code as the Hilbert filter removes bass, top end, and 2500hz breath noise in one go. Busyness calculations that relied on those to work now shows no difference. So Volume Peaks are relied on instead.

Framing is now expanded to cover both short & long voice transitions. At idle/noise a standard 1408 samples are captured (~0.7% duty cycle, ~1ms/140.8ms frame at 10khz). During work, up to 10 frames or 3456 samples are captured (at ~25% duty cycle, including waiting for samples).

Frames are: G1: 12.8, 25.6, 25.6, 25.6, 25.6, 25.6ms. G2: 51.2, 51.2, 51.2, 51.2ms.

The FFT still works well at 256 samples.

Frequency groups are now 13 for vowels: 250, 350, 450, 550, 650, 750, 1000, 1333, 1666, 2000, 2500, 3500, 5000hz.

The Low Hz Removal filter was updated for natural ear resonance/equal loudness as our ears naturally remove bass and boost 2-5khz, but mics don't.

Consonant & Vowel detection model is: Silence-ConsonantGroup-VowelGroup-Silence ... or Silence-VowelGroup-Silence.

Added 15 new consonants and a lot of vowel groups as combinations. Vowel groups are a WIP as there's potentially hundreds. The longest and most difficult connected vowels without silence are words/phrases like "who are you", "are you 'aight"/"are you alright", "oh yeah", "you wait"/"you ate". They contain consonants but if they're not pronounced with spaces then the only way to detect them is as one long vowel with multiple sound transitions.

So being able to identify frequency transitions clearly, then patterns of vowels are much easier to detect.

It's a nightmare, but I'm close to word output now. If I can detect "k" regularly I can output "cake", fast, and be reliably different to "tree".



Speech Recognition (benchmark of "yes"/"no"): C version/Console: Redesigned Volume Normalisation to improve clarity abov...