Image to Text
Leverage the Multimodal image processing capabilities of OpenAI's GPT4 Vision model
Last updated
Leverage the Multimodal image processing capabilities of OpenAI's GPT4 Vision model
Last updated
The ImageToText node has system and prompt inputs similar to the OpenAI LLM and an additional image input. The ImageToText node can accept images loaded from files as input as well as images generated by an Image Generation node. You can also load PDFs to be processed as images.
The following example pipeline generates an image using DALLE-3 and then uses the vision model to generate a description.
PDF file inputs will be converted into images. For example, we can create a pipeline to analyze the pdf of a research paper
Within the "System" prompt, provide instructions on how the model should behave. Within the "Prompt", you have the option to pass additional information to the "Prompt", such as a user message from an input node.
The ImageToText node is useful for the following applications
image classification
pdf processing
table extraction from images
For information on prompting the model and its limitations see the OpenAI documentation
The OpenAI vision model is billed based on input and output token usage similar to the LLM nodes. In addition, each image input adds cost in terms of input tokens.