ASR / Speech Recognition

Use: Keyword Spotting. Phoneme Extraction. Lipsyncing.

Type: Non-prediction, dataless, procedural frequency & transition matching.

 

Features:

Pros: Offline, real-time, no data, no training, no learning. GDPR compliant. Supports mild accents, vagueness, nasal tone, pitch, volume change, noise (static, click/pop, knock/tap).

Cons: Less accurate. No speaker identification. Custom word dictionary.

Speed: 1-5ms processing time per 80-368ms syllable. 87%-98% efficient (waiting for microphone buffer), or 48-73x over playback speed.

Requirement: Windows/Linux. MCU with MEMS microphone.

Language: C and C#

 

Details:

Frequency: 16khz

Noise: Auto noise floor by RMS. <2x step up.

Voice find: First sample above 5x noise floor RMS.

Framing: 1x 16ms frame, up to 11x 32ms frames. Minimum 80ms, maximum 368ms syllables.

FFT:

  • Modified KissFFT.
  • Frames: 1x 256 sample frame, up to 11x 512 sample frames.
  • Input: Vol Peak Normalised, with 2x amplified Hann window.
  • Output: Low resolution frequency groups (no MFCC). Equal-loudness modifier.

Consonant & Vowel identification:

  • Separate frequency groups/spectrum resolution, formant ranges, identification.

Consonants: 

  • Very low resolution. 8 frequency groups.
  • Two frames only, identified by a plosive.
  • Plosives:
    • Type one: Initial kick (eg: K/P/T)
    • Type two: 2x volume boost over previous frame (eg: N/M/L/Y/W)
  • Formant 0:
    • 0-1000hz boolean
  • Formant 1:
    • 1000-3000hz loudest frequency (1500 / 2000 / 3000) 
  • Formant 2:
    • 4000hz boolean
  • Formant 3:
    • 5500-8000hz loudest frequency (5500 / 8000)
  • High tone identification (cheeks activated/emphasis in viseme). 0-255.

Vowels:

  • Low resolution. 15 frequency groups.
  • Formant 0:
    • 0-281hz boolean
  • Formant 1:
    • 375-875hz min, max, transition direction (up / down / straight).
  • Formant 2:
    • 1000-3000hz min, max, transition direction (up / down / straight).
  • High tone identification (cheeks activated/emphasis in viseme). Boolean.

Word identification:

  •  Two-phoneme syllables.
  •  Vowel-swapping to common alternatives.
  •  Word short-listing (custom dictionary).
  •  Word short-listing (dictionary of synonyms of expected words - NLP function).

NLP functions: Synonyms, homophones, sentence start, missing word, intention/direction, chatbot.


Applications:

Realtime virtual character lipsyncing (network or internet based clients):

  • Enables client apps to calculate lip sync data in short enough time to send with microphone audio chunk data, in a single network packet without delay.
  • Enables speed-of-sound/distance based lipsync audio delay. Example: At 50 virtual metres, upon receiving each client audio chunk, lipsync animation may be played immediately while the audio is delayed by 145 milliseconds. Compatible with multiple people standing at multiple distances due to all clients receiving lipsync & audio data in the same packet.

Realtime subtitles in any language:

  • Enables each client to translate received language symbols to another language (NLP function).

Realtime chatbot/companion:

  • Enables realistic interruption (excitement/shock) of NPC prerecorded audio in response to user voice. Example: "Yes" or "no" responses from the user can result in shock-interruption from the NPC for improved immersion, affinity, and story telling.


Speech Recognition (benchmark of "yes"/"no"): C version/Console: Redesigned Volume Normalisation to improve clarity abov...