Skip to main content
The Image to Text node uses AI to analyze images and generate text descriptions, summaries, or structured data. Use it to extract information from financial charts, read text from document scans, describe visual content, or convert image-based data into machine-readable formats — for example, transcribing handwritten notes from meeting whiteboards, extracting data from screenshot tables, or describing chart trends from financial reports.

Core Functionality

  • Analyze images and generate text descriptions using vision-capable AI models
  • Support multiple providers including OpenAI, Anthropic, Google, and xAI
  • Process system instructions and prompts to guide image analysis
  • Return structured JSON output with optional schema enforcement
  • Stream responses for real-time output
  • Track token usage per run

Tool Inputs

  • Provider * — (Enum (Dropdown), default: OpenAI) Select the model provider (OpenAI, Anthropic, Google, xAI)
  • Model * — (Enum (Dropdown), default: chatgpt-4o-latest) Select the vision model. Options vary by provider
  • System (Instructions) — (String) Instructions guiding how the model should analyze the image
  • Prompt — (String) Specific instructions for what to analyze in the image
  • Image * — (Image) The image to analyze
  • Use Personal API Key — (Boolean, default: No) Toggle to use your own API key
  • Api Key — (String) Your API key. Only visible when Use Personal API Key is enabled
  • JSON Schema — (String) JSON schema for structured output. Only visible when JSON Response is enabled
* indicates a required field

Tool Outputs

  • text — (String (or Stream<String> when streaming)) The generated text analysis of the image
  • tokens_used — (Integer) Total number of tokens consumed

Overview

The Image to Text tool in agents allows the AI to analyze images shared during conversations. The agent can automatically interpret images based on conversation context or follow specific analysis instructions you configure.

Use Cases

  • Financial chart interpretation — Users share charts and the agent describes trends, key data points, and anomalies.
  • Document scanning — Extract text content from photographed or scanned financial documents.
  • Receipt processing — Analyze expense receipts to extract amounts, vendors, and dates.
  • Visual compliance checks — Review marketing materials or document images for compliance issues.

How It Works

  1. Add the tool to your agent. In the agent builder, click Add Tool and select Image to Text from the available tools.
  2. Configure input fields. Each field can either be filled automatically by the agent based on conversation context, or locked to a fixed value:
    • Provider — Select the vision model provider
    • Model — Choose the vision model
    • System (Instructions) — Set analysis instructions
    • Prompt — The agent fills this based on the user’s request
    • Image — The agent uses images shared in the conversation
  3. Write the Tool Description. Describe what the tool does so the agent knows when to use it. For example: “Use this tool to analyze the content of images. Describe what you see in detail.”
  4. Set Auto Run behavior. Choose: Auto Run, Require User Approval, or Let Agent Decide.
  5. Test the tool. Share an image with the agent and ask it to analyze the content.

Settings

SettingTypeDefaultDescription
ProviderDropdownOpenAIThe vision model provider.
ModelDropdownchatgpt-4o-latestThe vision model.
Max TokensInteger128000Maximum output tokens.
TemperatureFloat0.7Controls response creativity.
Top PFloat0.9Controls token sampling diversity.
JSON ResponseBooleanOffReturn structured JSON output.
Stream ResponseBooleanOffStream the response.

Best Practices

  • Write specific analysis prompts. Instead of “describe this image,” use “extract all numerical data from this financial chart including axis labels, data points, and trends.”
  • Choose the right provider for your task. GPT-4o models excel at detailed image analysis; Claude models are strong at document interpretation.
  • Use JSON mode for data extraction. When extracting structured data from images, enable JSON Response with a schema.

Document Classification Agent

Automatically categorizes and tags incoming documents based on content and type.

Contract AI Analyst

Analyzes contracts to extract key terms, flag risks, and summarize obligations.

Validation Agent

Validates data and documents against predefined rules, schemas, or compliance standards.

Term Sheet Agent

Generates and reviews term sheets by extracting and validating key deal terms.

Common Issues

For troubleshooting common issues with this node, see the Common Issues documentation.