Large Language Models (LLMs)

Large Language Models (LLMs) are AI models trained on large corpuses of data, and can generate text, images, videos, etc. In this section, we will discuss LLMs that generate text.

Through careful prompting LLMs can accomplish a variety of tasks. The VectorShift platform is LLM agnostic, meaning you can choose which model to use in your workflows (OpenAI, Anthropic, Google,m Mistral, Llama, etc.). You can choose the model and prompt best suited for your application.

The AI landscape changes very fast and we can expect new models to be released by various research labs. The VectorShift team will add new LLMs as soon as they are released. Within pipelines, you can then swap between LLM models / providers with ease.

At the core, the LLM needs at least a "System" prompt (instructions on how the LLM should behave) and a Prompt (where data is provided). Within either, you can create variables by using double curly braces ("{{}}"). When the pipeline is run, whatever is connected to the variables will "replace" the variables. Finally, the "output" edge of the LLM provides the response to the LLM.

QuickStart

We can create an LLM in seconds using the vectorshift platform. Let's create a simple application designed to answer the user's questions.

Here, we have selected the OpenAI LLM node, chosen the model gpt-3.5-turbo (pull down menu on the LLM node) and attached an input, output, and text node (to place the system prompt) to the LLM. Based on the system instructions and the question, the LLM will write an answer and output it from the pipeline.

To see how we can easily create and run a ChatBot using a LLM see the Chatbot tutorial.

How to use an LLM?

To use an LLM, you must do the following things:

  1. System prompt: instruct how the LLM should behave in the system prompt within the LLM node or write it in a text box and connect it to the "System" input edge. Reference data sources you use in the Prompt within System (Answer the User Question using Context)

  2. Prompt: define variables using double curly braces ("{{User_Question}}", "{{Context}}"). Label the datasources within the prompt ("User Question", "Context"). See example below

  3. Connections: connect the relevant data sources to the input edges (e.g., the input node to the "User Question", the output of knowledge base to "Context") and the LLM to relevant output edges (e.g., the output of the node to the output node).

LLM Settings

System and Prompt

Some models (e.g., OpenAI) are trained to take two inputs, a "system" prompt that contains instruction for the model to follow and a "prompt" input with various data sources (e.g., the user message, context, data sources, etc.). Other models (e.g., Gemini) have one singular prompt where you place both the instructions and the data sources.

Token Limits

Each model has a limit max number of input and output tokens. To adjust the limit for a particular model you can alter the max tokens parameter. Note: you cannot increase the max tokens beyond the maximum supported for a particular model. This setting is found in the gear on the LLM node.

Streaming

To stream output, click on the gear and then check "Stream Response". This setting is found in the gear on the LLM node.

Citations

You can display the sources the LLM uses by checking off "Show Sources" in the gear. This setting is found in the gear on the LLM node.

JSON Response

To have the model return a structured JSON output rather than pure text check the "Json output" box. This setting is found in the gear on the LLM node.

Temperature

Temperature controls the diversity of LLM generation. You can adjust the temperature settings for your models. To have more diverse or creative generations increase the temperature. To have more deterministic response decrease the temperature. This setting is found in the gear on the LLM node.

Top P

The TopP parameter constrains how many tokens the LLM considers for generation at each step. For more diverse responses increase top p towards a maximum value of 1.0. This setting is found in the gear on the LLM node.

For more details see Understanding LLM Parameters

Add Memory to LLMs

While LLMs do not hold an internal state, many applications require being able to interact with previous messages (e.g., chatbots, search). To accomplish this, you can use the "Chat Memory" node and pass the chat history into the Prompt of the LLM to give LLMs relevant historical context.

LLM Providers

OpenAI

The following models are available from OpenAI

  • gpt-3.5-turbo : The original model powering ChatGPT offers good performance at high speed and low cost

  • gpt-3.5-16k : The gpt-3.5 model trained to support longer inputs

  • gpt-3.5-turbo-instruct : a model trained to follow instructions, may be better for some applications

  • gpt-4 : The largest and most capable model from OpenAI

  • gpt-4-32k : gpt-4 trained to support longer inputs

  • gpt-4-turbo also referred to as gpt-4-1106-preview this is the latest model from OpenAI that is faster and cheaper than gpt-4 and supports a long context length

To use an OpenAI model drag the OpenAILLM node into the pipeline builder and select the model that you want to use

Anthropic

We provide access to models trained by Anthropic.

  • claude-v2 : an advanced model trained by Anthropic that supports 100k input tokens

  • claude-instant : a fast and low priced model

  • claude-v2.1 : model with 200k input tokens

  • claude-3:Anthropic's latest model

Meta

Llama2 models trained by Meta are available.

  • llama2-13b : 13 billion parameter model

  • llama2-70b : Larger 70 billion parameter model

  • llama2-chat-13b : Model trained for chat applications

  • llama2-chat-70b : Larger model trained for chat applications

Note: These model are fully open-sourced by Meta . anyone is free to download and run them on their own hardware.

Cohere:

  • command : Cohere's flagship text generation model

AWS

  • titan-text-express : Small model trained by AWS

  • titan-text : AWS model with multilingual support

Open Source

We offer integrations with the following Open Source models trained by Mistral AI.

  • mistralai/Mistral-7B-v0.1 : Mistral AI's base 7B parameter model

  • mistralai/Mistral-7B-Instruct-v0.1 : Base model instruction tuned for improved performance

  • mistralai/Mistral-7B-Instruct-v0.2 : Instruction tuned model with context length extended to 32k

  • mistralai/Mixtral-8x7B-v0.1 : Base Mixture of Exports model

  • mistralai/Mixtral-8x7B-Instruct-v0.1 : Instruction tuned mixture of experts model

Note: These model are fully open-sourced by Mistal AI. anyone is free to download and run them on their own hardware.

Google

The following models are available from google:

  • gemini-pro : Google's most advanced publicly available model

  • text-bison : model based on Google's PALM model

  • text-bison-32k : Text model extended to 32K context length

  • text-unicorn : Powerful model based on Google's largest PALM model

Model Comparison

LLM models have different characteristics based on their training and hosting. Generally larger models (e.g., GPT 4 Turbo, Claude 3) are more expensive and slower, but have higher reasoning capabilities. Smaller models (e.g., GPT 3.5 Turbo) are less expensive and faster, but have lower reasoning capabilities.

Want to use another Open Source LLM or Provider? Request additional models in our discord server

Model Pricing

Model usage is billed based on the number of tokens that you use, both in the model input and the tokens generated in the model output. One token is equal to 4 characters.

ProviderModelPrice ($per 1000 tokens)

OpenAI

gpt-3.5-turbo

0.002

OpenAI

gpt-3.5-turbo-16k

0.002

OpenAI

gpt-3.5-turbo-instruct

0.002

OpenAI

gpt-4

0.06

OpenAI

gpt-4-32k

0.12

OpenAI

gpt-4-turbo

0.03

Anthropic

claude-v2

0.024

Anthropic

claude-instant

0.0024

Anthropic

claude-v2.1

0.024

Meta

llama2-13b

0.001

Meta

llama2-70b

0.00256

Meta

llama2-chat-13b

0.001

Meta

llama2-chat-70b

0.00256

Cohere

command

.002

AWS

titan-text-express

.0004

AWS

titan-text-lite

.0016

Open Source

mistralai/Mistral-7B-v0.1

.0002

Open Source

mistralai/Mistral-7B-Instruct-v0.1

.0002

Open Source

mistralai/Mistral-7B-Instruct-v0.2

.0002

Open Source

mistralai/Mixtral-8x7B-v0.1

.0006

Open Source

mistralai/Mixtral-8x7B-Instruct-v0.1

.0006

Google

gemini-pro

0.0005

Google

text-bison

0.0005

Google

text-bison-32k

0.0005

Google

text-unicorn

.0075

Custom LLMs

Want to connect to a specialized model provider or a locally hosted LLM? Use the custom LLM node.

We have support for sending requests to models that are compatible with the Open AI chat api format. You can use models from your own accounts with LLM providers such as TogetherAI and Replicate. The custom LLM node requires the following parameters.

  • model

  • api key

  • base url

For example using Together as an model provider the base_url will be "https://api.together.xyz". The API Key will be the key you find on your account. For the model you can choose from the available models for a provider.

Local Models

Models hosted locally on your computer are good for prototyping, experimenting with new models and cost savings. You can access your local models be setting up a connection to a locally running LLM server.

Make sure to find a secure way to forward your locally running server's port to the internet.

LM Studio

Follow the instructions to start a local LM studio server

The standard API key for LM studio is lm-studio

Ollama

Start a local LLM server using the Ollama cli.

Prompt Engineering Guidelines

Be as specific as possible - if the output should be one sentence or if the output should be in the first person, include the instructions in the text block connected to the system prompt. Within the system prompt, you can also mention things like:

  • The tone you want the model to use (e.g., Respond in a professional manner).

  • Reference data sources provided in the prompt and how the model should use them (e.g., Use datasource X when the question is related to sales; use datasource Y when the question is related to customer support).

  • Provide specific information related to your company / situation that the model can reference (e.g., calendly link)

  • Specific text that you want the model to output in certain situations (e.g., if you are unable to answer the question, respond with "I am unable to answer the question").

  • What type of reasoning to use (here - it is important to think through step by step how you would actually perform the action. You then, want to encode this into the system prompt).

FAQs

  • What is the best model for my tasks?

    • The best model depends on your use case as well as cost/latency constraints. Make sure to evaluate the performance of the model on your task.

  • What should I to if I keep running into token limit errors?

    • Try reducing the amount of text passed to the LLM. You can use semantic search to feed only the most relevant input to the language model. See the vector db documentation and the Document Search Assistant tutorial

  • Why won't the LLM follow my instructions?

    • Many language model applications require iterative development to find an effective prompt. Try stating your instruction in clear language. Additionally, you can try using a more powerful model (e.g., GPT 4 turbo)

For more questions about developing applications with LLMs check out the resource below or drop a question in our discord server.

Further Reading

Last updated