Help me understand Voice Recognition tech
I am interested in getting an app that would allow me to make notes via voice-to-text. I work in a field with HIPAA protections. I’m having trouble figuring out the nuances of privacy related to these apps.
First off, is this kind of software considered “AI”? How does it even recognize that a sound equals a word? Do they use LLM tech? Does the tech learn to recognize my voice better over time? Does it use my recordings to learn to understand other’s voices? Is this all a black box? How can I take precautions such that no one except me hears the things I transcribe?
This is just such confusing tech! It seems like it’s fairly old and common but the more I think about it in relation to current age AI, the more creeped out I get! And yet my doctor uses one regularly… I’ll be asking her about it too, don’t worry.
Thank you!
It’s AI and your voice won’t be used for training if you use a local model.
Use Whisper stt. It run on your computer so nothing will be out. You can adapt the model size based on how powerful your computer is. The bigger the model the better at transcribing it will be.
That sounds interesting. I was hoping for something that I could use on a mobile app. I’m not sure what “adapting the model size” means so this might be more complicated than I’m looking for.
I work in a field with HIPAA protections.
Definitely need to be careful then.
is this kind of software considered “AI”?
The best voice recognition is based on AI — yes.
Before AI, voice recognition existed but it was generally pretty shit and really struggled with accents, low quality microphones, background noise, people saying things that don’t strictly make sense. E.g. if you say “We’ll burn that bridge when we get to it.” a good AI might replace “burn” with the word “cross”… it will at least have the capability to do that, wether or not it will would depend on your settings - is “accuracy” about what someone said or what someone actually meant? That’s configurable in the best systems.
Does the tech learn to recognize my voice better over time?
Old software did. These days systems work so well that would just add cost with zero benefit. Good speech recognition will understand your speech perfectly as long as your microphone is decent and “learning” wouldn’t help much with that one potential problem area.
Some speech systems do learn in order to recognise/identify people (for example, a voice assistant might use it to figure out who “me” is in a command like “remind me to do get milk when I get to the shops”. And a good transcription service will recognise different people talking in a single recording, and provide an appropriately annotated transcript. That’s about the extent of “recognising” your voice, it doesn’t generally learn from you over time.
Is this all a black box?
Kinda yeah. The researchers paid a huge number of people in third world countries to compare recordings to transcriptions, and make a “correct / incorrect” judgement call. Then fed all of that, and a whole bunch of other things (it’s believed every YouTube video ever uploaded might have been involved…) into a very complex model.
Tweaks are made but it’s just too much data (OpenAI says they used 680,000 hours of audio) to fully get your head around all of it. A bit like trying to understand how the human brain recognises speech — we have a broad idea but don’t really know.
Does it use my recordings to learn to understand other’s voices? How can I take precautions such that no one except me hears the things I transcribe?
Check the privacy statement for the service. They might, for example, send your recordings to be assessed for accuracy by employees/subcontractors. AFAIK (not a lawyer) that would be a breach of HIPAA.
AFAIK some Apple speech recognition features are HIPAA compliant. Look that up to verify it but in general iPhones and Macs Apple have AI speech processing hardware on the device allowing fully local processing… but not all features are done locally and in some cases they may transmit “anonymised” (useless if you speak someone’s name…) speech to employees/contractors to improve the software. That can be disabled in settings.
Amazon and OpenAI do everything in the cloud but have fully HIPAA compliant versions of their services (I assume those are not cheap…)
You could try open source models — I don’t know how good they are in practice.
I work for a Canadian EMR company and we deal with a couple of options for medical voice software. I know Dragon naturally speaking has a medical offering that likely would meet any regulatory requirements. There are also some subscription based ones that I don’t know if there are US versions of, but If you google the medical options you should be able to find some options.
There are offline solutions that never transmit the data
Futo voice input https://app.futo.org/fdroid/repo
I haven’t tried it, but I really like the “local only” approach they’ve been using in other apps.
I just wish their licenses were more open (even AGPL would be fine). Here’s the source code in case someone is interested.
My kid’s doctor had service to transcribe the visits. Patients may opt out verbally. This is all through the hospital, so presumably it is HIPAA compliant.
Instead of creating your own solution that complies with HIPAA, it is probably easier to use one that already exists.
Well that’s why I said I will be asking my doctor what she uses! And I likely wont be transcribing anything professional, but I do still have my phone on me in those settings. It’s more about the fact that I don’t want my own personal notes to be automatically handed to an LLM and regurgitated out into the world without my knowledge. If it can recognize and transcribe my speech, what’s to stop it from using that to train an LLM, which in turn notoriously plagiarizes its training data?
There is a lot to unpack in your post and this will be very long, sorry about that:
First off, what you are requesting is called “Automated Speech Recognition” or ASR in short and the fundamental idea behind it to receive a speech signal and convert it into a workable format. Usually this workable format means text or prompted tasks. Whether this is AI or not depends largely on how broad you define AI. I wouldn’t classify it as AI as, in its core, it’s just statistical analysis. But AI can help fixing errors, more on that later.
ASR works on a Hidden Markov Model (HMM), a statistical model that is only dependent on the state attained in a previous event, so it’s recognizing previously observed patterns. These patterns are taught to the model by a training process.
The generalized process works like this:
-
cut the audio signal into small frames and analyze them according to a set of features like tonality, voicedness, formants. This process is called feature extraction. Create data vectors that contain information about the features of the raw signal.
-
load these features into a decoder. The decoder is an acoustic model that looks up the phonems it recognized through the features in a dictionary and computes the most likely word in its dictionary. These results are retained and sequences of words are compared to the decoders language model. What it recognizes and how well it recognizes signals is based on its own dictionary and the language model used afterwards.
Language models are essentially just presets that dictate what is accepted as a valid signal input. For an activation phrase, this would be a very simple grammar-based model that recognizes only the exact predefined token for the activation and rejects everything else. For general use, you can write a more adaptive grammar, or many different grammars at the same time, but you will still run into cases where the model rejects an input because it cannot find a grammar that matches the signal. This is called out-of-grammar (OOG) speech.
To reduce OOG errors, you can train a statistical language model (SLM) which is basically just a huge library of natural language data so it doesn’t rely on fixed grammars. An large language model (LLM) is like a very advanced SLM with a ridiculous amount of training data and trained, contextual connections between subjects. It’s called large because it requires an insane amount of data to function on even a very basic level. You can easily mix grammar-based and SLM approaches, so that you only need to use the SLM when an input is not recognized.
Source: Writing programs that recognize speech inputs and do tasks based upon them, like what your doctor probably has, was my last job until I quit. Whether we used a grammar-based approach or an SLM approach was entirely up to the specific use case. Purely grammar-based is more privacy-friendly because the computational work required is easily managed by most smartphones or other small portable devices and can easily be done offline. SLM solutions were generally not portable to handheld devices without relying on a cloud service doing the recognition (or at least not if you wanted an acceptable speed of input processing).
tl;dr If you want just plain text-to-speech where the program just writes down what it thinks you said and does not do any error correction, then you can do that offline (the language model my workplace used was from Dragon). If you want your assistant to “understand” what you were trying to say, you will require AI of some form and they are not very privacy-friendly.
That’s fascinating! Really cool explanation, thank you.
It sounds like Dragon has gotten a couple shoutouts. haven’t heard of them before. I wouldn’t mind starting with some plain offline TTS program. I suppose Samsung already has that feature built-into the phone, but that leads us back to the privacy concerns.
deleted by creator
-
Since this is for work, I would start by asking whoever does IT stuff. You really don’t want to be sending HIPAA data off to who knows where without permission.
it is not for work.
Okay, if it’s personal use for yourself or friends or family, then I don’t think HIPAA is a concern because you’re not a HIPAA Covered Entity (https://www.hhs.gov/hipaa/for-professionals/covered-entities/index.html). You should be able to use any of the recommendations here, or others you may find in app stores like Google Play or F-Droid.
Makes sense. I think my main concern is how can I be certain it doesn’t listen to stuff when I dont let it.
Ever have that weird phenomenon when you’re discussing something and somehow what you were talking about is the first suggested search result?
At least on my phone, with Android 14, there is an indicator when an app is using the microphone or camera.