Real-time ASR Dev Blog

Wednesday, 24 November 2021

Speech Recognition:

Update on audio speech recognition.

Traditional speech recognition uses FFT/DCT/DTT Fast Fourier Transform's to decode audio into voice phonemes. These capture 3 voice formants (frequency ranges specific to a phoneme) from one 'signature'. However these use nested-loops and are slow to process. DTT is the fastest but I want to try it another way.

Most spoken phonemes have a range of different frequency areas combined to make the sound - bass/warmness, middle range, high range. EG. "oh" is mostly bass. "ee" bass-middle. "ss" high.

The way I want to try is separating common frequency ranges first initially, then measure the power & complexity afterwards to tell if one range is loud/complex versus the others.

Separating the frequency ranges (band-passing) can be done in real time using just a few instructions, using pre-calculated IIR filters (http://www.schwietering.com/jayduino/filtuino/index.php). FIR filters are better quality but slow.

There is 20ms inbetween recorded audio frames to process the data, so I'm aiming to get both phoneme and NLP processing out the way in 1-10ms. Using the same thread as the one capturing the data.

This is some captured data for the word "wikipedia". The asterisks (*) represent good power & complexity levels versus background noise.