Once you’ve loaded documents, you’ll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model’s context window.
When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together.
LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents. In this two-part practical article, we will explore the importance of document splitting, and the available LangChain text splitters and will explore four of them in depth.
Table of Contents:
Why do we need document splitting?
Different types of LangChain splitters
Introduction to recursive character text splitter & the character text splitter
Diving deep in recursive splitting
PDF loading & splitting [Covered in part 2 ]
Token splitting [Covered in part 2 ]
Context-aware splitting [Covered in part 2 ]