Document Search Assistant

A chatbot that allows you to ask questions about a document

Scenario: We want to build a pipeline that answers questions about a document in a chat-like format. In this case, we have questions about Apple’s annual report (> 80-page document with various charts) and want them quickly answered. We anticipate that we will have questions about other large documents and decide to build a pipeline to automate this workflow.

At a high level, we need to create the following pipeline components:

  1. A way to feed files into the pipeline and embed them into a semantic search database.

  2. A way for the LLM to receive 1) queries from the semantic search database, 2) store conversation history, and 3) receive a question from the user

  3. A large language model instructed to be an analyst that answers questions based on a document

Step 1 - Open the Pipeline Builder

Within the pipeline tab, click "+ New" >> "Create pipeline" .

Step 2 - Feeding the File into a Semantic Search database

We need to create a structure that allows the pipeline to accept a file, feed the file into a semantic search database, and allow for relevant information to be queried from the semantic search database.

  1. We use a file data loader (under "data loader" tab). We click "upload file" to upload the file we wish to embed in the semantic database. Note, files uploaded will appear under your storage tab and can be used in other pipelines as well.

    1. Note: this can also be accomplished through the "files" sub-tab by clicking "+New" >> "Upload Files".

  2. Connect the File node to the semantic database node.

    • File node: ensure you check "Process Files into Text" as the semantic database must accept text.

    • Semantic search node (within knowledge base tab): this loads the file into a temporary semantic database and allows for queries and the return of relevant context that a LLM can leverage.

We connect the file input node to the file loader which is then connected to "documents" edge of the semantic search node.

Step 3 - Rest of pipeline: LLM, chat memory, prompt

We need the following nodes to complete the pipeline:

  1. We will use Open AI's GPT 3.5 Turbo for this use case. The Open AI node can be found in the LLMs tab in the no-code builder. This node has two fields, the "System" and the "Prompt" field. The "system" field allows you to tell the LLM how to behave. The "prompt" field allows you to input the prompt for the LLM. Within either you can use double curly braces ""{{}}" to create variables. Whatever you place within the braces will automatically appear on the left hand side of the block (as an edge). Hence, the associated data connected to the named nodes will “replace” the curly brackets when the pipeline runs.

    1. Prompt field within the Open AI node: in order for the pipeline to be able to answer questions in a chat-like format about the document, the LLM will need 1) conversational history, 2) context provided by the semantic database, and 3) the user question. Thus, we label each piece of data (e.g., "Context") and then create a variable (e.g., "{{Context}}") for each of the three pieces of information that the LLM needs. See below for reference.

    2. System field within the Open AI node: here, we explain how the LLM should behave. The crux of the system prompt is "You are a analyst chatbot specializing in answering a Question given Conversation History and Context". Note, we use the labels from the prompt within the system prompt (and maintain the same spelling / capitalization).

  2. The Chat Memory node (found in "Chat" tab) will allow the LLM to “remember” the previous conversation history. We connect it to the openAI node (to the created "Conversation History" edge) so that the previous conversation is also passed to the LLM.

  3. The Input node (found in "General" tab) here will allow the user to ask questions. We need to connect it both to the semantic search node (“query” edge) and the OpenAI node (created "Question" edge). This is because:

    • Passing the question into the semantic search node will allow for the semantic search to query relevant information, which is outputted and passed to the LLM and

    • We need to pass the question into the prompt so the LLM knows what question to answer.

  4. Finally, we connect an output node (found in "General" tab) to the "response" edge of the LLM node.

Step 5 - Run and Share

There are four ways to deploy this pipeline. You can:

  1. Run within pipeline builder

  2. Run as a form

  3. Generate an API call

  4. Use this pipeline as the "backend" for a chatbot

Run within the pipeline builder

Click "Run" in the top right of the pipeline builder. In this case, you can directly ask a question in the input box and click "Run" to run the pipeline.

Run as a form

Access the pipeline in a “form” type format. In the “Pipelines” tab, select the pipeline you just created, and click “Run”. Ask questions directly in the input box.

Generate an API call

Click on the three dots of the pipeline and click "Generate API Call". You can find your API key under "Settings" (by clicking on your profile on the top right).

Backend for a chatbot

Access the pipeline in chatbot format. Click the “Chatbot” tab and click “Add”. In the popup and in the “pipeline” field, select the pipeline that you just created. After creating the chatbot, click “Run” to start the chatbot. Chatbot can also be accessed via API (click on your profile on the top right and click settings to get your API key).

Finally, to publish your pipeline to the marketplace, go back to the “Pipelines” tab and find the pipeline you would like to publish. Then, click the three dots on the right-hand side and click “publish”.

Additionally, if you would like to make a new document search pipeline on a new file, you can duplicate the pipeline (by clicking on the three dots on the right hand side of the pipeline and clicking "Duplicate pipeline") and switch out the file within the pipeline.

Last updated