Skip to main content
The Speech to Text node converts audio input into text using AI transcription models. Use it to transcribe meeting recordings, earnings call audio, client voicemails, or any audio content — enabling downstream text processing like summarization, search indexing, or compliance review.

Core Functionality

  • Transcribe audio files to text using AI speech recognition models
  • Support multiple providers including OpenAI (Whisper) and Deepgram
  • Select specialized model tiers for different accuracy and speed tradeoffs
  • Use personal API keys for dedicated access

Tool Inputs

  • Provider * — (Enum (Dropdown), default: OpenAI) Select the transcription provider (OpenAI, Deepgram)
  • Model * — (Enum (Dropdown), default: whisper-1) Select the speech-to-text model
  • Audio * — (Audio) The audio file to transcribe
  • Tier — (Enum (Dropdown), default: general) Deepgram-specific: select the transcription tier. Only visible when provider is Deepgram
  • Use Personal API Key — (Boolean, default: No) Toggle to use your own API key
  • Api Key — (String) Your API key. Only visible when Use Personal API Key is enabled
* indicates a required field

Tool Outputs

  • text — (String) The transcribed text from the audio

Overview

The Speech to Text tool in agents allows the AI to transcribe audio files shared during conversations. The agent can process voice messages, meeting recordings, or any audio content and return the transcribed text.

Use Cases

  • Meeting transcription — Users share meeting recordings and the agent provides full transcriptions for note-taking.
  • Voicemail processing — Transcribe client voicemails for documentation and follow-up tracking.
  • Audio content search — Convert audio to text to enable search across audio archives.
  • Compliance recording review — Transcribe compliance-relevant recordings for review and documentation.

How It Works

  1. Add the tool to your agent. In the agent builder, click Add Tool and select Speech to Text from the available tools.
Agent tool panel showing Transcribe Speech (Speech to Text) tool in the tool list
  1. Configure input fields. Each field can either be filled automatically by the agent based on conversation context, or locked to a fixed value:
    • Provider — Select the transcription provider (e.g., OpenAI)
    • Model — Choose the model (e.g., whisper-1)
    • Audio — The agent uses audio files shared in the conversation
Transcribe Speech tool configuration showing fields with sparkle icon to toggle between dynamic and static values
  1. Write the Tool Description. Describe what the tool does so the agent knows when to use it.
  2. Set Auto Run behavior. Choose: Auto Run, Require User Approval, or Let Agent Decide.
  3. Test the tool. Share an audio file with the agent and ask it to transcribe.

Settings

SettingTypeDefaultDescription
ProviderDropdownOpenAIThe transcription provider.
ModelDropdownwhisper-1The speech-to-text model.
Use Personal API KeyBooleanNoUse your own API key.

Best Practices

  • Use OpenAI Whisper for general transcription. It provides strong accuracy across languages and accents.
  • Select Deepgram for specialized needs. Deepgram offers different tiers optimized for specific use cases.
  • Chain with an LLM node. Connect the text output to an LLM node for automatic summarization of transcribed content.

Earnings Call Insight and Sentiment Analyzer

Analyzes earnings call transcripts for sentiment, key themes, and forward-looking signals.

Common Issues

For troubleshooting common issues with this node, see the Common Issues documentation.