Core concepts behind splitting and segmenting text
["Text", "splitting", "is", "useful."]
["Text splitting is useful.", "It helps AI process information."]
["Text splitting is useful.\nIt helps AI process information.", "Another key concept is chunk overlap."]
Chunk Size
and Chunk Overlap
settings.
Let’s walk through a simple example using Sentence Segmentation and aiming for a Chunk Size
of around 15 words, with a Chunk Overlap
of 5 words.
Original Text:
“Text splitting breaks down large documents. This makes them easier for AI to understand. AI models often have input limits. Therefore, chunking is essential for analysis. Proper splitting maintains context.”
1. Segmentation (by Sentence):
# Headers
, * Lists
, code blocks
, tables). It tries to keep these structural elements intact and often starts new chunks at major headers. It might segment by headers, then code blocks, then tables, then paragraphs, and finally sentences if needed to meet size constraints.# Top Level Header
would likely signal the start of a new chunk.Chunk Size
while minimizing awkward breaks.