Real-time ASR Dev Blog: November 2021

Friday, 26 November 2021

Speech Recognition:

I changed the IIR filters to Resonators centered around 600hz, 1250hz, and 3150hz and now have double the signal-to-noise with more stable numbers.

This amplifies the signal for certain types of sounds, but in order for this to work I feel like I need about 10 filters, centered around different frequencies.

Signal is only 12.5%-25% over noise background, and you need to speak close to the mic, so SNR needs to be improved by at least twice again to work.

Wednesday, 24 November 2021

Speech Recognition:

Update on audio speech recognition.

Traditional speech recognition uses FFT/DCT/DTT Fast Fourier Transform's to decode audio into voice phonemes. These capture 3 voice formants (frequency ranges specific to a phoneme) from one 'signature'. However these use nested-loops and are slow to process. DTT is the fastest but I want to try it another way.

Most spoken phonemes have a range of different frequency areas combined to make the sound - bass/warmness, middle range, high range. EG. "oh" is mostly bass. "ee" bass-middle. "ss" high.

The way I want to try is separating common frequency ranges first initially, then measure the power & complexity afterwards to tell if one range is loud/complex versus the others.

Separating the frequency ranges (band-passing) can be done in real time using just a few instructions, using pre-calculated IIR filters (http://www.schwietering.com/jayduino/filtuino/index.php). FIR filters are better quality but slow.

There is 20ms inbetween recorded audio frames to process the data, so I'm aiming to get both phoneme and NLP processing out the way in 1-10ms. Using the same thread as the one capturing the data.

This is some captured data for the word "wikipedia". The asterisks (*) represent good power & complexity levels versus background noise.

Monday, 8 November 2021

NLP / WIC Benchmark:

Still working on this. I recently completed the concept for English words into new Grammar categories.

There are:

Four main groups for Nouns, verbs/actions, and adjectives/modifiers. The groups are: Moving/living things, Analytical/laws/concepts, Logical subparts/binary actions, Light sense/stories/beautiful terms.

Eight other groups for the remainder: Numeric Quantifier, Scale Quantifier, Person/Agent, Question/Interrogative, Time-spatial, Direction-spatial, Conjuction/sentence breakers, Exlamation/greet/grunt.

The Twelve groups are also encoded with present/future/past and precise/optimistic/explaining. All Present things are Precise, all Future things are Optimistic, all Past things are Explaining.

Finally, all words must only be in one category only.

The concept is speed over quality... but as there's nothing for virtual experiences/high speed scenarios between hard-coded dialogue trees and GPT-3, this will sit right inbetween.

There are 3500 words to individually convert over so will take until the new year to do as I'm also looking at Speech Recognition software.

Speech Recognition:

Speech Recognition software today uses Algorithms/Ngrams/NN's, is really slow (1-3 seconds response time) and uses a lot of power... The speed of the NLP is roughly 0.1ms to process a sentence, so if the speech recognition is fast then it's suitable for virtual enviroments even at a lower quality.

Combining the NLP with speech rec is as simple as writing phonemes next to each word in the dictionary. If the user is speaking via voice then the NLPs word-to-token search can be skipped, as the word is already found.

One of the benefits if it works is that a responding chatbot can completely vary the response time to suit the situation, including interrupting the user, which adds another layer of humanness.