Creating a New Knowledge Base

Ready to make your content searchable? Click the + New Knowledge Base button in the top-right corner of the Knowledge Base listing page.

Knowledge Base listing page with + New Knowledge Base button highlighted

You can be up and running in under a minute — just give it a name and click Create. Or, if you want more control over how your documents are processed and searched, configure advanced settings and metadata before creating.

Naming your knowledge base

Name: Choose a clear, descriptive name so your team can easily find this knowledge base later. It appears in listings, search interfaces, and anywhere else it’s referenced.

VectorShift auto-fills a default name (e.g., “New Knowledge Base 1”). You can rename it at any time. Click Create to get started with default settings, or explore the Advanced Settings and Configure Metadata tabs to customize behavior upfront.

Advanced settings

These settings shape how your documents are processed, chunked, and searched. They’re divided into two groups: permanent settings that are locked after creation (because they affect the underlying data structure), and default settings you can adjust anytime.

Permanent settings

These are locked once you create the knowledge base, so choose carefully. Embedding model (required) The embedding model determines how well your search understands the meaning behind queries. It converts document chunks into vectors — and since all your data is embedded with this model, changing it later would require re-processing everything. The default is openai/text-embedding-3-small. Models are available from OpenAI, VoyageAI, Cohere, and Google, grouped by provider in the dropdown.

The default (openai/text-embedding-3-small) works well for most use cases. If you’re working with multilingual content or specialized domains, explore VoyageAI and Cohere options for potentially better results.

Advanced document analysis (beta) Get richer, more accurate search results by enabling enhanced document analysis. This uses advanced techniques to understand document structure — especially helpful for complex documents like PDFs with tables, headers, and mixed layouts. When enabled, VectorShift generates a short summary and a long summary for each indexed item, which improve metadata extraction accuracy and search relevance.

This feature uses LLM calls to generate summaries for each indexed item, which will incur additional charges.

Hybrid Help your users find exactly what they’re looking for, even when specific terms matter. Hybrid search combines semantic (meaning-based) search with keyword matching — so searches for product names, error codes, or policy numbers return the right results even when the exact wording matters.

Default settings

These apply to all new documents but can be adjusted later — either globally in Settings or per individual document. Start with sensible defaults and fine-tune as you learn what works best for your content. Chunk size Controls how much content goes into each searchable piece (measured in tokens). The default is 400.

Smaller chunks (200–300) → more precise answers for fact-based questions like “What is our refund policy?”
Larger chunks (500–800) → better for questions that need surrounding context like “Summarize the Q3 report findings”

Chunk overlap Prevents important information from being lost when a document is split across chunk boundaries. The default is 0 (no overlap). Increase this if you notice search results missing context that spans two chunks. Must be less than the chunk size. Splitter method (required) Choose the method that best matches your document structure for the most relevant search results:

Method	Best for
Sentence	Unstructured text like emails, transcripts, or plain-text docs
Markdown	Documents with clear heading structure — splits respect headings, paragraphs, and lists
Dynamic	Mixed or varied formats — automatically adapts its splitting strategy to the content

Not sure? Start with Dynamic — it handles most document types well. Switch to Markdown if your docs are well-structured, or Sentence for raw text.

Code files (Python, JavaScript, TypeScript, Go, Rust, SQL, YAML, Dockerfiles, and 100+ other formats) are automatically split along meaningful boundaries like functions and classes, regardless of the splitter you choose here.

Processing model (required) The processing model determines how accurately text is extracted from your files. Choosing the right one for your content type can significantly improve search quality:

Model	Best for
Default	General purpose text extraction
Llama Parse	Structured documents with complex layouts
Textract	AWS-powered extraction, good for forms and tables
Docling	Document understanding with layout awareness
Mistral OCR	Scanned documents and images with text
Contextual AI	Context-aware document processing
Reducto	High-fidelity document parsing with layout understanding
Unstructured	Flexible extraction for a wide range of unstructured document types

Generate chunk level metadata Automatically tag each chunk with metadata during indexing so your users can filter search results by specific attributes — helpful for narrowing results by category, date, or document type. Max rows per chunk (tabular files) Keep spreadsheet search results focused by limiting how many rows from CSVs or Excel files go into a single chunk. Without this, large spreadsheets can produce oversized chunks that dilute search relevance. Apify key If you plan to scrape URLs and want to use your own Apify account instead of VectorShift’s built-in scraping, enter your API key here. This is optional.

Configure metadata

Save hours of manual tagging by letting VectorShift automatically extract structured metadata from every document. Set up fields like category, author, date, or any custom property — and the AI labels each document for you as it’s indexed.

Click the Configure Metadata tab in the creation dialog to define your schema.

Builder mode

Define your metadata schema visually — no JSON required. Just add fields, pick types, and describe what the AI should extract.

Auto-generated metadata schema properties Define what metadata fields the LLM should extract from each document:

Column	What to enter
Property Name (required)	The name of the metadata field (e.g., “category”, “author”, “document_type”)
Type	The data type of the field (see supported types below)
Description	A description to guide the LLM on what to extract

Click + Add Property to add more fields. Supported property types:

Type	Description
String	Text values. Allows additional configuration for Pattern (regex validation) and Format.
Number	Numeric values. Allows setting constraints (see below).
Boolean	True/false values.
Object	Nested structured data.
Array	A list of values.
Enum	A predefined set of allowed values.

String type options When you select String as the type, two additional fields appear:

Pattern: A regex pattern to validate extracted values (e.g., ^\d{4}-\d{2}-\d{2}$ for date strings).
Format: A predefined format constraint. Available formats: DateTime, Date, Time, Duration, Email, Hostname, IPv4, IPv6, UUID.

Number type constraints When you select Number as the type, constraint fields appear:

Constraint	What it does
Minimum	The lowest allowed value (inclusive)
Maximum	The highest allowed value (inclusive)
Exclusive Minimum	The lowest allowed value (exclusive, meaning the value must be greater than this)
Exclusive Maximum	The highest allowed value (exclusive, meaning the value must be less than this)
Multiple Of	The value must be a multiple of this number

Advanced configuration

Fine-tune how the AI extracts metadata to get more accurate and consistent results:

LLM provider: Choose the AI provider (e.g., OpenAI).
Model: Pick the specific model (e.g., gpt-4.1-mini).
Schema description: Give the AI additional context about your intent — what kinds of documents it will see and what the metadata is for.
Extraction instructions: Guide the AI’s behavior with specific rules (e.g., “Always extract dates in ISO 8601 format” or “If the author is not explicitly stated, leave the field empty”).
Query instructions: Tell the search system how to apply metadata filters when users search — this ensures filters work the way your users expect.

Context configuration

Give the AI more surrounding context to improve metadata accuracy — especially useful when a document’s meaning depends on its relationship to other documents.

These options require Advanced Document Analysis to be enabled in the knowledge base’s permanent settings.

Document item context Choose how much of each document the AI sees when extracting metadata:

Option	What it provides
Short Summary	A brief overview — fast and cost-effective
Long Summary	A detailed summary — better accuracy for nuanced content
Full Document	The complete content — most accurate but uses more tokens

You can select one or more of these options. Sibling context Enable this when related documents in the same folder share context. For example, if a folder contains multiple chapters of a report, the AI can use the other chapters to better understand and tag each one. Parent context Enable this when your folder structure carries meaning. For example, documents inside a “Legal” folder can automatically be recognized as legal documents, improving tagging accuracy.

JSON mode

Click the JSON tab to switch to a read-only JSON view of your metadata schema. This shows the generated JSON schema based on the properties you defined in the Builder. You can copy the JSON from here for use in API calls or external tools, but edits must be made through the Builder tab. Click Save to save your metadata configuration, or Cancel to discard changes.

You do not need to configure metadata during creation. You can always set it up later using the Configure Metadata button on the knowledge base detail page.

​Naming your knowledge base

​Advanced settings

​Permanent settings

​Default settings

​Configure metadata

​Builder mode

​Advanced configuration

​Context configuration

​JSON mode

Naming your knowledge base

Advanced settings

Permanent settings

Default settings

Configure metadata

Builder mode

Advanced configuration

Context configuration

JSON mode