RAG

When embedding text for later retrieval in RAG (Retrieval Augmented Generation) systems the quality of the retrieved text is significant in the quality of the response the Large Language Model generates. The more quality the retrieved text in relation to answering the user query the more grounded and relevant your answer can be and more easier to prevent LLM hallucinations (generating false or answers not grounded in your domain specific text). This really puts the spotlight on the quality of the retrieved text and the retrieving mechanism, in fact I would say retrieval is the most important aspect of a RAG pipeline, everything else depends on it. One of the techniques used to improve retrieval is the Small-to-Big Retrieval retrieval technique.

Text Embeddings

Text embeddings encode information about sentences and documents, not just individual words, into vectors. This allows you to compare larger bodies of text to each other just like you did with word vectors. Because they encode more information than a single word embedding, text embeddings are a more powerful representation of information. Text embeddings are typically the fundamental objects stored in vector databases like ChromaDB, and in this section, you’ll learn how to create and compare them. The most efficient way to generate text embeddings is to use pretrained models. These models vary in size, but they’re all typically trained on a large corpus of text, enabling them to pick up on complex semantic relationships. The SentenceTransformers library in Python is one of the best tools for this. You can install sentence-transformers with the following command:


                  !python -m pip install sentence-transformers
                  from sentence_transformers import SentenceTransformer
                  from cosine_similarity import compute_cosine_similarity

                  model = SentenceTransformer("all-MiniLM-L6-v2")
                  texts = [
                      "The canine barked loudly.",
                      "The dog made a noisy bark.",
                      "He ate a lot of pizza.",
                      "He devoured a large quantity of pizza pie.",
                  ]

                  text_embeddings = model.encode(texts)

                  print(type(text_embeddings))
                  print(text_embeddings.shape)
                  print("-------------------------")
                  text_embeddings_dict = dict(zip(texts, list(text_embeddings)))

                  dog_text_1 = "The canine barked loudly."
                  dog_text_2 = "The dog made a noisy bark."
                  print(
                      compute_cosine_similarity(
                          text_embeddings_dict[dog_text_1], text_embeddings_dict[dog_text_2]
                      )
                  )

                  pizza_text_1 = "He ate a lot of pizza."
                  pizza_text_2 = "He devoured a large quantity of pizza pie."
                  print(
                      compute_cosine_similarity(
                          text_embeddings_dict[pizza_text_1], text_embeddings_dict[pizza_text_2]
                      )
                  )
                  print("-------------------------")

                  print(
                      compute_cosine_similarity(
                          text_embeddings_dict[dog_text_1], text_embeddings_dict[pizza_text_1]
                      )
                  )

ChromaDB

We'll store ten documents to search over. To illustrate the power of embeddings and semantic search, each document covers a different topic, and you’ll see how well ChromaDB associates your queries with similar documents. We'll start by importing dependencies, defining configuration variables, and creating a ChromaDB client:


                  import chromadb
                  from chromadb.utils import embedding_functions

                  CHROMA_DATA_PATH = "chroma_data/"
                  EMBED_MODEL = "all-MiniLM-L6-v2"
                  COLLECTION_NAME = "demo_docs"

                  client = chromadb.PersistentClient(path=CHROMA_DATA_PATH)

You first import chromadb and then import the embedding_functions module, which you’ll use to specify the embedding function. Next, you specify the location where ChromaDB will store the embeddings on your machine in CHROMA_DATA_PATH, the name of the embedding model that you’ll use in EMBED_MODEL, and the name of your first collection in COLLECTION_NAME. You then instantiate a PersistentClient object that writes your embedding data to CHROMA_DB_PATH. By doing this, you ensure that data will be stored at CHROMA_DB_PATH and persist to new clients. Alternatively, you can use chromadb.Client() to instantiate a ChromaDB instance that only writes to memory and doesn’t persist on disk. Next, you instantiate your embedding function and the ChromaDB collection to store your documents in:


                  embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
                      model_name=EMBED_MODEL
                  )

                  collection = client.create_collection(
                      name=COLLECTION_NAME,
                      embedding_function=embedding_func,
                      metadata={"hnsw:space": "cosine"},
                  )

You specify an embedding function from the SentenceTransformers library. ChromaDB will use this to embed all your documents and queries. In this example, you’ll continue using the "all-MiniLM-L6-v2" model. You then create your first collection. A collection is the object that stores your embedded documents along with any associated metadata. If you’re familiar with relational databases, then you can think of a collection as a table. In this example, your collection is named demo_docs, it uses the "all-MiniLM-L6-v2" embedding function that you instantiated, and it uses the cosine similarity distance function as specified by metadata={"hnsw:space": "cosine"}. The last step in setting up your collection is to add documents and metadata:


                  documents = [
                      "The latest iPhone model comes with impressive features and a powerful camera.",
                      "Exploring the beautiful beaches and vibrant culture of Bali is a dream for many travelers.",
                      "Einstein's theory of relativity revolutionized our understanding of space and time.",
                      "Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens.",
                      "The American Revolution had a profound impact on the birth of the United States as a nation.",
                      "Regular exercise and a balanced diet are essential for maintaining good physical health.",
                      "Leonardo da Vinci's Mona Lisa is considered one of the most iconic paintings in art history.",
                      "Climate change poses a significant threat to the planet's ecosystems and biodiversity.",
                      "Startup companies often face challenges in securing funding and scaling their operations.",
                      "Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'",
                  ]

                  genres = [
                      "technology",
                      "travel",
                      "science",
                      "food",
                      "history",
                      "fitness",
                      "art",
                      "climate change",
                      "business",
                      "music",
                  ]

                  collection.add(
                      documents=documents,
                      ids=[f"id{i}" for i in range(len(documents))],
                      metadatas=[{"genre": g} for g in genres]
                  )

In this block, you define a list of ten documents in documents and specify the genre of each document in genres. You then add the documents and genres using collection.add(). Each document in the documents argument is embedded and stored in the collection. You also have to define the ids argument to uniquely identify each document and embedding in the collection. You accomplish this with a list comprehension that creates a list of ID strings. The metadatas argument is optional, but most of the time, it’s useful to store metadata with your embeddings. In this case, you define a single metadata field, "genre", that records the genre of each document. When you query a document, metadata provides you with additional information that can be helpful to better understand the document’s contents. You can also filter on metadata fields, just like you would in a relational database query. With documents embedded and stored in a collection, you’re ready to run some semantic queries:


                  query_results = collection.query(
                      query_texts=["Find me some delicious food!"],
                      n_results=1,
                  )

                  query_results.keys()
                  query_results["documents"]
                  query_results["ids"]
                  query_results["distances"]
                  query_results["metadatas"]

Here, you pass two queries into collection.query(), Teach me about history and What’s going on in the world. You also request the two most similar documents for each query by specifying n_results=2. Lastly, by passing include=["documents", "distances"], you ensure that the dictionary only contains the documents and their embedding distances. Calling query_results["documents"][0] shows you the two most similar documents to the first query in query_texts, and query_results["distances"][0] contains the corresponding embedding distances. As an example, the cosine distance between Teach me about history and Einstein’s theory of relativity revolutionized our understanding of space and time is about 0.627. Similarly, query_results["documents"][1] shows you the two most similar documents to the second query in query_texts, and query_results["distances"][1] contains the corresponding embedding distances. For this query, the two most similar documents weren’t as strong of a match as in the first query. Recall that cosine distance is one minus cosine similarity, so a cosine distance of 0.80 corresponds to a cosine similarity of 0.20. Another awesome feature of ChromaDB is the ability to filter queries on metadata. To motivate this, suppose you want to find the single document that’s most related to music history. You might run this query:


                collection.query(
                    query_texts=["Teach me about music history"],
                    n_results=1
                )

Your query is Teach me about music history, and the most similar document is Einstein’s theory of relativity revolutionized our understanding of space and time. While Einstein is a historical figure who was a musician and teacher, this isn’t quite the result that you’re looking for. Because you’re particularly interested in music history, you can filter on the "genre" metadata field to search over more relevant documents:


                collection.query(
                  query_texts=["Teach me about music history"],
                  where={"genre": {"$eq": "music"}},
                  n_results=1,
              )

In this query, you specify in the where argument that you’re only looking for documents with the "music" genre. To apply filters, ChromaDB expects a dictionary where the keys are metadata names and the values are dictionaries specifying how to filter. In plain English, you can interpret {"genre": {"$eq": "music"}} as filter the collection where the "genre" metadata field equals "music". As you can see, the document about Beethoven’s Symphony No. 9 is the most similar document. Of course, for this example, there’s only one document with the music genre. To make it slightly more difficult, you could filter on both history and music:


                query_results = collection.query(
                    query_texts=["Teach me about music history"],
                    where={"genre": {"$in": ["music", "history"]}},
                    n_results=2,
                )

                query_results["documents"]
                query_results["distances"]

This query filters the collection of documents that have either a music or history genre, as specified by where={"genre": {"$in": ["music", "history"]}}. As you can see, the Beethoven document is still the most similar, while the American Revolution document is a close second. These were straightforward filtering examples on a single metadata field, but ChromaDB also supports other filtering operations that you might need. If you want to update existing documents, embeddings, or metadata, then you can use collection.update(). This requires you to know the IDs of the data that you want to update. In this example, you’ll update both the documents and metadata for "id1" and "id2":


                collection.update(
                    ids=["id1", "id2"],
                    documents=["The new iPhone is awesome!",
                              "Bali has beautiful beaches"],
                    metadatas=[{"genre": "tech"}, {"genre": "beaches"}]
                )

                query_results = collection.get(ids=["id1", "id2"])

                query_results["documents"]


                query_results["metadatas"]

Here, you rename the documents for "id1" and "id2", and you also modify their metadata. To confirm that your update worked, you call collection.get(ids=["id1", "id2"]) and can see that you’ve successfully updated both documents and their metadata. If you’re not sure whether a document exists for an ID, you can use collection.upsert(). This works the same way as collection.update(), except it’ll insert new documents for IDs that don’t exist. Lastly, if you want to delete any items in the collection, then you can use collection.delete():


                  collection.delete(ids=["id1", "id2"])

                collection.count()


                collection.get(["id1", "id2"])

In this block, you use collection.delete(ids=["id1", "id2"]) to delete all data associated with "id1" and "id2". You then verify the deletion of these two documents by calling collection.count(), and you can see that collection.get(["id1", "id2"]) has no data. You’ve now seen many of ChromaDB’s main features, and you can learn more with the getting started guide or API cheat sheet. You used a collection of ten hand-crafted documents that allowed you to get familiar with ChromaDB’s syntax and querying functionality, but this was by no means a realistic use case. In the next section, you’ll see ChromaDB shine while you embed and query over thousands of real-world documents!

Collection & Reviews


                  import pathlib
                  import chromadb
                  from chromadb.utils import embedding_functions
                  from more_itertools import batched

                  def build_chroma_collection(
                      chroma_path: pathlib.Path,
                      collection_name: str,
                      embedding_func_name: str,
                      ids: list[str],
                      documents: list[str],
                      metadatas: list[dict],
                      distance_func_name: str = "cosine",
                  ):
                      """Create a ChromaDB collection"""

                      chroma_client = chromadb.PersistentClient(chroma_path)

                      embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
                          model_name=embedding_func_name
                      )

                      collection = chroma_client.create_collection(
                          name=collection_name,
                          embedding_function=embedding_func,
                          metadata={"hnsw:space": distance_func_name},
                      )

                      document_indices = list(range(len(documents)))

                      for batch in batched(document_indices, 166):
                          start_idx = batch[0]
                          end_idx = batch[-1]

                          collection.add(
                              ids=ids[start_idx:end_idx],
                              documents=documents[start_idx:end_idx],
                              metadatas=metadatas[start_idx:end_idx],
                          )

In lines 1 to 4, you import the dependencies needed to define build_chroma_collection(). This function accepts the path where you’ll store the embeddings, the name of the collection to create, the name of the embedding function to use, the data to store in the collection, and the name of the distance function to use. You then instantiate a PersistentClient() object, create the collection, and add data to the collection. In lines 29 to 39, you add data to the collection in batches using the more-itertools library. Calling batched(document_indices, 166) breaks document_indices into a list of tuples, each with size 166. ChromaDB’s current maximum batch size is 166, but this might change in the future. As before, you import dependencies, define some configuration variables, and transform the raw reviews data. You then build a collection called car_review_embeddings using build_chroma_collection(). Notice that you’re now using the "multi-qa-MiniLM-L6-cos-v1" embedding function. The model behind this embedding function was specifically trained to solve question-and-answer semantic search tasks. Building the collection will take a few minutes, but once it completes, you can run queries like the following:

Connect to an LLM Service

As you know, you’re going to use the car reviews as context to an LLM. This means that you’ll ask the LLM a question like How would you summarize the most common complaints from negative car reviews?, and you’ll provide relevant reviews to help the LLM answer this question. To do this, you’ll first need to install the openai library:


                  import os
                  import json
                  import openai
                  os.environ["TOKENIZERS_PARALLELISM"] = "false"

                  with open("config.json", mode="r") as json_file:
                      config_data = json.load(json_file)

                  openai.api_key = config_data.get("openai-secret-key")
                  context = "You are a customer success employee at a large car dealership."
                  question = "What's the key to great customer satisfaction?"

                  chat_response = openai.ChatCompletion.create(
                      model="gpt-3.5-turbo",
                      messages=[
                          {"role": "system", "content": context},
                          {"role": "user", "content": question},
                      ],
                      temperature=0,
                      n=1,
                  )

                  print(chat_response["choices"][0]["message"]["content"])

Provide Context to the LLM

As you can see, the LLM gives you a fairly generic description of what it takes to promote customer satisfaction. None of this information is particularly useful to you because it isn’t specific to your car dealership. To make this response more tailored to your business, you need to provide the LLM with some reviews as context:


                  import os
                  import json
                  import openai
                  import chromadb
                  from chromadb.utils import embedding_functions
                  os.environ["TOKENIZERS_PARALLELISM"] = "false"

                  DATA_PATH = "data/archive/*"
                  CHROMA_PATH = "car_review_embeddings"
                  EMBEDDING_FUNC_NAME = "multi-qa-MiniLM-L6-cos-v1"
                  COLLECTION_NAME = "car_reviews"

                  with open("config.json", mode="r") as json_file:
                  config_data = json.load(json_file)

                  openai.api_key = config_data.get("openai-secret-key")

                  client = chromadb.PersistentClient(CHROMA_PATH)
                  embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
                  model_name=EMBEDDING_FUNC_NAME
                  )

                  collection = client.get_collection(
                  name=COLLECTION_NAME, embedding_function=embedding_func
                  )

                  context = """
                  You are a customer success employee at a large
                  car dealership. Use the following car reviews
                  to answer questions: {}
                  """

                  question = """
                  What's the key to great customer satisfaction
                  based on detailed positive reviews?
                  """

                  good_reviews = collection.query(
                  query_texts=[question],
                  n_results=10,
                  include=["documents"],
                  where={"Rating": {"$gte": 3}},
                  )

                  reviews_str = ",".join(good_reviews["documents"][0])

                  good_review_summaries = openai.ChatCompletion.create(
                  model="gpt-3.5-turbo",
                  messages=[
                  {"role": "system", "content": context.format(reviews_str)},
                  {"role": "user", "content": question},
                  ],
                  temperature=0,
                  n=1,
                  )

                  print(good_review_summaries["choices"][0]["message"]["content"])

As before, you import dependencies, define configuration variables, set your OpenAI API key, and load the car_reviews collection. You then define context and question variables that you’ll feed into an LLM for inference. The key difference in context is the {} at the end, which will be replaced with relevant reviews that give the LLM context to base its answers on. You then pass the question into collection.query() and request ten reviews that are most related to the question. In this query, where={"Rating": {"$gte": 3}} filters the collection to reviews that have a rating greater than or equal to 3. Lastly, you pass the comma-separated review_str into context and request an answer from "gpt-3.5-turbo". Notice how much more specific and detailed ChatGPT’s response is now that you’ve given it relevant car reviews as context. For example, if you look through the documents in good_reviews, then you’ll see reviews that mention smooth acceleration and federal tax credits, both of which are incorporated into the LLM’s response. Now, even though ChatGPT used relevant reviews to inform its response, you might still be thinking that the response was fairly generic. To really see the power of using ChromaDB to provide ChatGPT with context, you can ask a question about a specific review:


                context = """
                You are a customer success employee at a large
                car dealership. Use the following car reviews
                to answer questions: {}
                """

                question = """
                Which of these poor reviews has the
                worst implications about our dealership?
                Explain why.
                """

                poor_reviews = collection.query(
                query_texts=[question],
                n_results=5,
                include=["documents"],
                where={"Rating": {"$lte": 3}},
                )

                reviews_str = ",".join(poor_reviews["documents"][0])

                poor_review_analysis = openai.ChatCompletion.create(
                model="gpt-3.5-turbo",
                messages=[
                {"role": "system", "content": context.format(reviews_str)},
                {"role": "user", "content": question},
                ],
                temperature=0,
                n=1,
                )

                print(poor_review_analysis["choices"][0]["message"]["content"])

In this example, you query the collection for five reviews that have the worst implications on the dealership, and you filter on reviews that have a rating less than or equal to 3. You then pass this question, along with the five relevant reviews, to ChatGPT. ChatGPT points to a specific review where a customer had a poor experience at the dealership, quoting the review directly. ChatGPT has no knowledge of this review without your providing it, and you may not have found this review without a vector database capable of accurate semantic search. This is the power that you unlock when combining vector databases with LLMs. You’ve now seen why vector databases like ChromaDB are so useful for adding context to LLMs. In this example, you’ve scratched the surface of what you can create with ChromaDB, so just think about all the potential use cases for applications like this. The LLM and vector database landscape will likely continue to evolve at a rapid pace, but you can now feel confident in your understanding of how the two technologies interplay with each other.

RAG

Vector Similarity

Building A Basic RAG Pipeline

Text Embeddings

Vector Database

ChromaDB

Add Context for a Large Language Model (LLM)

Small-to-Big Retrieval

Parent Document Retrieval

Sentence Window Retrieval

Collection & Reviews

Connect to an LLM Service

Provide Context to the LLM

Conclusion