Once you’ve loaded documents, you’ll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model’s context window.
When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together.
LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents. In this two-part practical article, we will explore the importance of document splitting, and the available LangChain text splitters and will explore four of them in-depth.
Table of Contents:
Introduction to recursive character text splitter & the character text splitter [Covered in Part 1]
PDF loading & splitting
Token splitting
Context-aware splitting