Real-time ASR Dev Blog

Thursday, 21 December 2023

Speech Recognition (benchmark of "yes"/"no"):

C version/Console:
Redesigned Volume Normalisation to improve clarity above 500hz, and per-frame clarity. Samples are now run through a Hilbert filter to reduce bass < 500hz, and Peak Volume Normalisation is at the end of each frame instead of all frames.
The new Hilbert filter and frame-based Peak Vol Normalisation improve these issues:
1: Deep voices no longer interfere with/reduce volume normalisation of frequencies above 500hz.
2: Initial frame is now normalised to itself. Other frames are normalised to the peak volume of current & previous frames (which ever is louder - peak volume does not adjust lower). As Vowels are detected to follow Consonants and are generally louder, both are now maximally normalised.
Deleted first frame 1.5x boost.
Updated Consonant Formants and identification.
Redesigned Plosive detection. Frame one plosive ("k"/"p"/"t") loudness minimum must match a fixed value. Frame two(+) plosive ("n"/"m"/"y"/"r") loudness minimum must match ~50% of the last plosive minimum + ~50% of the power of the last frame.
Updated Vowel F1 & F2 transition detection to gradually favour frequencies towards the end of the sound.
Range tuning. Better detection of plosives in consonants "k", "n", "y", detecting both sudden sound increases and rolling increases at any input sound volume range, noise, etc.
Other fixes.

Speech Commands Benchmark:
NO (405 records):
"no" = 48.40% (goal 50%) . "n", "oh", "o", "uh" = 82% (goal 90%).
Error | "yes" = 0.99%

YES (419 records):
"ye"/"yer"/"yeah" = 28.64% (goal 50%). "y", "e", "er", "air", "ah", "a", "uh" = 70-80% (goal 90%).
words.
Error | "no" = 2.15%

Time: 1-2ms each

+4% increase to "no", and +6% increase to "yes". Consonant identification needs to be reworked as the definition for "y" and "n" are too close. "kay-key-kai" works.

The benchmark is more of a test of noise rejection and volume normalisation than anything else. Very happy with the robustness of this now.

New consonant identification should see at least a 10%+ improvement to "yes" with less error. If error stays around <= 1% then more alternate vowels can be used and a higher result is possible.

Tuesday, 7 November 2023

Speech Recognition (benchmark of "yes"/"no"):

C version/Console:
Added filtering for "knock/tap" noise
Changed Vol Peak Normalisation to use RMS x 2.4 instead of highest Vol Peak. Slightly more accurate versus background noise, faster, and phoneme/viseme loudness is represented better.
Updated framing. Now using 256 samples up to the first plosive in a syllable, then 512 samples after. This helps in Consonant framing.
Replaced fixed framing 256-512-512-512-[...] with dynamic method mentioned above.
Updated Consonant Plosive Find step values. Initial min step, and F1 group power from last frame.
Updated Consonant identification
Updated Vowel Formant 1 & 2 Focusing
Updated Equal Loudness (increased/doubled some 4khz resonance numbers to better match spectrograms in Praat. 656hz (550-650), 1333hz (1000-1333), 2000-2333hz (1666-2333), 8000hz was already double).

Speech Commands Benchmark:
NO (405 records):
"no" = 44.69% (goal 50%) . "n" or "oh" = 81.7% (goal 90%)
Error | "yes" = 0.99%

YES (419 records):
"ye"/"yeah" = 22.43% (goal 50%). "y", "e", or "air" = 67.5% (goal 90%)
Error | "no" = 2.63%

Time: 0-2ms each.

A lot of changes were made, especially in Consonant framing to tell the difference between "y" and "n" better. Small improvements in vowel detection.

Many of the problems in detecting more "yes" (the "y") are noise or volume related. Other issues, are in vowel transition detection, and finding the trailing "s" in "yes".

If "yes" improves to >40% I'll then test "on" & "off".

Sunday, 15 October 2023

Speech Recognition (benchmark of "yes"/"no"):

C version/Console:
Custom wave file loader for benchmarking
Benchmarking a target word
Noise Floor now based on RMS instead of Volume Peak.
Noise Floor Raise 'step' changed from fixed value to 2x current noise floor RMS.
Voice volume minimum changed to 5x current noise floor RMS (range: 3x - 6x).
Removed one-frame "click/pop" noise.
Combined Vol Peak Normalisation with FFT loading for a speed improvement.
Updated Vowel F1 & F2 transition analysis
Updated Vowel frequency group definitions. First three (281, 375, 468hz...).
Updated Equal Loudness
Updated Consonant identification
Other fixes

Speech Commands Benchmark:
BEFORE                        AFTER
"no" (405 records):
"no" = 9.14%                  to: 36.30% (goal 50%)
"n" or "oh" = 36.5%        to: 86.9% (goal 90%)

Time: 0-1ms each.

Very tough benchmark. Many samples are thwart with noise (blips, click/pop, static, paper sounds). There was a large change to detection by only changing noise floor from Peak to RMS. Another large change after redesigning consonant identification.

Work on noise elimination is needed, and testing "yes".

Wednesday, 13 September 2023

Lipsync:

Creating a Unity script for Face, Blink, and Viseme morphs to use with Daz3D models.
Face blend shapes (happy, sad/lost, angry, shock), plus four others through combination.
Blink blend shapes. Auto-blink with ranged values. Min, max, stop after open/close (stare/sleep), blink once now, squint.
Viseme blend shapes. Emphasis modifier.

Speech Recognition:

Added Visemes to Speech Recognition engine.
Added/Updated Word Hashing and Word Search.

Working: Viseme lipsync "no" x 3, "yes" x 3. (simulated from speech recognition data)

Lipsync is already as good if not better than current SOTA lip syncing.

The default viseme blendshapes from Daz3D work well. However they're single frames and some of the phonemes need two (start and end). So for "oh"; "ah" and "w" are used.

Both the timing and the two-frame "oh" in "no" boost quality by quite a lot.

The audio used is from a few samples of the Google Speech Commands dataset.

"no" x 3, "yes" x 3.

Monday, 7 August 2023

Speech Recognition / Phoneme Extraction Tool:

Redesigned consonant plosive-find
Redesigned consonant identification
Updated consonant frequency definitions. Added 2000 (512, 1000, 1500, 2000, 3000, 4000, 5500, 8000)
Consonant viseme upper tone/cheek emphasis changed from boolean to variable.
Updated vowel frequency definitions. Added 2333, 2666, 3500.
Updated vowel identification
Added Vowel format 2 focusing:
   1) F2 frequencies starting above 1000hz cannot transition to 1000hz, to avoid interference
     from nasal tone, and
   2) F2 frequencies where the last two values are the same can only transition up/down
    by one.
Improved Vowel F1 transition analysis to include power & average centre, and for flat transitions to re-alias centre frequency to a defined frequency group. This improves fine frequency transition detection, and for flat frequencies resilience to errors.
Updated Phoneme Emphasis
Updated equal-loudness
Updated GUI

Working:
"kay, kee, kai"
"no" x3 variations
"yes" x3 variations ("ye")

A lot of range in "no" and "yes" (and by extension all other vowels & consonants) is supported now, and will be setting up a benchmark to test it more.

Auto Phoneme Emphasis depends upon the peak volume in a range of frames. There are three ranges (2 - 5 - 5 frames). Volume detection in the first two frames is important as many loud plosives (eg "k", "t", "p") exist there and need to be represented accurately, as well as account for mouth-open, meaning emphasis travels high in advance. They are all precise to the audio and don't use general curves, so if they're wrong they're wrong, but when they're right they're really right. An emphasis multiplier determines max emphasis.

"Yes" and "No":

Monday, 10 July 2023

Speech Recognition / Phoneme Extraction Tool:

Updated equal-loudness
Reduced noise floor dynamic step up/down amount
Lowered initial volume pick up
Replaced Inverse Blackman window with a 2x amplified Hann window. Similar quality, large speed gain.
Improved consonant framing (plosive find)
Redesigned consonant identification
Added Vowel formant 1 Focusing. Frequencies starting at or above 656hz are now soley detected to transition at or under this amount (to avoid interference from strong nasal tone).
Reduced nasal tone volume (875-1000hz) by half

Working:
"kay, kee, kai"
"no" x3 variations

Saturday, 17 June 2023

Speech Recognition / Phoneme Extraction Tool:

Conversion of main processing code (including FFT) from C to C# - 100%.
GUI/code: Loudness Curve
GUI/code: Phoneme output
GUI
Updated frequency group definitions (312, 468, 562, 656, 750, 875, 1000, 1333, 1666, 2000, 2500, 3000, 4000, 5500, 8000)
Updated equal-loudness
New equal-loudness calibration for the first 16ms frame (+50%)
Removed last FIR filter @ 8000hz. Did improve frequencies (loudness) slightly but not enough to justify the processing time.
Combined Find Zero-Crossing function with Noise Find. Speed & quality improvement.
Replaced Hann window with a custom wide lobe cosine window/Inverse Blackman window.
Improved syllable finding
Updated Wav file loading

Satisfied now with the "Kay, Kee, Kai" test.

Next target is finding "yes" & "no". This is part of the Google Speech Commands (Keyword Spotting) benchmark.