
Let’s walk through the example code from LangChain’s website. I’ll explain how the
ParentDocumentRetriever works along the way.
Two splitters instead of one
Previously we only used one TextSplitter to split the raw text into multiple documents.
Now we need two of them. One for the larger chunks with more context (let’s call these larger chunks
parent or just docs) and another splitter for the smaller chunks with better semantic meaning (let’s
call these smaller chunks children or sub docs)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
Storing both larger and smaller chunks
Now let’s create the vectorstore for storing the smaller chunks and an InMemoryStore docstore to store
the larger chunks.
vectorstore = Chroma(collection_name="split_parents", embedding_function=OpenAIEmbeddings())
store = InMemoryStore()
vectorstores are used to store embeddings. And as we’re only embedding the smaller chunks (as they
capture better semantic meaning after embedding them), we use vectorstore to only store the smaller
chunks, not the larger ones.
For the larger ones tho, we use InMemoryStore. It’s like a dictionary type KEY-VALUE pair data
structure, that stays in the memory while the program is running.
In the InMemoryStore
Each key is a unique uuid for each large chunk
Each value is the actual text content of the corresponding large chunk
In the vectorstore
For each embedding of the smaller chunks, we store that unique uuid of the parent large chunk as a
metadata. This large chunk is from where this small chunk is originated.
Create the ParentDocumentRetriever
Now let’s create the ParentDocumentRetriever
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
we pass the following arguments to the Constructor:
vectorstore: the vectorstore where the embeddings for the small chunks will be stored
docstore: where the larger chunks will be stored as KEY-VALUE pairs
child_splitter: the text splitter to use on the small chunks
parent_splitter: the text splitter to use on the large chunks
Adding the documents
retriever.add_documents(docs)
The following happens under the hood when the add_documents() method is called:
the docs is split into large chunks using parent_splitter
For each large chunk above
a unique uuid is generated
the Key-Value pair of that uuid and the large chunk is stored in the docstore
then that larger chunk is further split into smaller chunks using child_splitter
all these smaller chunks are then added to the vectorstore
while adding, the parent_id metadata for each of these smaller chunks is set to that unique uuid of
their parent large chunk
After adding, we can see there are 66 keys in the store. So 66 large chunks have been added.
len(list(store.yield_keys()))
# outputs 66
sub_docs = vectorstore.similarity_search("justice breyer")
# only returns small chunks
Also, if we apply similarity search on the vectorstore itself, we’ll get the small chunks only.
Retrieving relevant documents
retrieved_docs = retriever.get_relevant_documents("justice breyer")
len(retrieved_docs[0].page_content)
# outputs 1849, which ensures we're indeed getting larger chunks as final output
The following happens when we call get_relevant_document() on ParentDocumentRetrieverGet
First all the small chunks are fetched from the vectorstore
Then we iterate over those small chunks in the order they’re returned by the vectorstore
For each small chunk, we add its parent large chunk’s id to a list
Then using the parent ids as keys, we get the list of large chunks from the docstore and return those.
Thus we use small chunks (with better semantic meaning) for vector similarity matching and return
their corresponding larger chunks that have the bigger picture and more context.
Hopefully the ParentDocumentRetriever will help you to get better set of relevant documents while
using LangChain for Retrieval Augmented Generation (RAG).