Speech-to-Text (STT)
AI technology that converts spoken words into written text in real-time or from recordings.
What is Speech-to-Text (STT)?
Speech-to-Text (STT) is AI technology that converts spoken words into written text.
It uses automatic speech recognition (ASR) models that analyze audio patterns, match them to phonemes, and predict words based on language context. Modern STT can handle multiple speakers, accents, and background noise.
Builders use STT to transcribe customer calls, create meeting notes, add voice commands to apps, or build voice-first products. Popular options include OpenAI's Whisper, Google Cloud Speech-to-Text, and Gladia.
Pricing varies from free tiers (Whisper is open source) to pay-per-minute for cloud APIs. Most services charge $0.006-0.024 per minute of audio.
Good to Know
How Vibe Coders Use Speech-to-Text (STT)
Frequently Asked Questions
Your Idea to AI Business In Days
Join Dan, Zehra and 0 others building AI businesses in days with video tutorials and 1 on 1 support.
Related Terms
Technology that converts written text into spoken audio using AI-generated voices that sound increasingly human-like.
The practice of crafting specific instructions to get better outputs from AI models like ChatGPT, Claude, or Gemini.
AI technology that creates a digital replica of someone's voice from audio samples, capturing their unique speech patterns and tone.
AI voice platform that generates ultra-realistic speech from text, clones voices, and dubs content into 29+ languages.
OpenAI's open-source speech-to-text model that converts audio to text with high accuracy across 99 languages.
Join 0 others building with AI