Integrating Vector Databases with LLMs: A Hands-On Guide

LLMs have been a game-changer in the tech world, driving innovation in application development. However, their full potential is often untapped when used in isolation. This is where Vector Databases step in, enhancing LLMs to produce not just any response, but the right one. Typically, LLMs are trained on a wide array of data, which gives them a broad understanding but can lead to gaps in specific knowledge areas. Sometimes, they might even churn out information that’s off-target or biased — a byproduct of learning from the vast, unfiltered web. To address this, we introduce the concept of Vector Databases. These databases store data in a unique format known as ‘vector embeddings,’ which enable LLMs to grasp and utilize information more contextually and accurately. This guide is about How to build an LLM with a Vector Database and improve LLM’s use of this flow. We’ll look at how combining these two can make LLMs more accurate and useful, especially for specific topics. Next, we offer a brief overview of Vector Databases, explaining the concept of vector embedding and its role in enhancing AI and machine learning applications. We’ll show you how these databases differ from traditional databases and why they are better suited for AI-driven tasks, particularly when working with unstructured data like text, images, and complex patterns. Further, we’ll explore the practical application of this technology in building a Closed-QA bot. This bot, powered by Falcon-7B and ChromaDB, demonstrates the effectiveness of LLMs when coupled with the right tools and techniques. By the end of this guide, you’ll have a clearer understanding of how to harness the power of LLMs and Vector Databases to create applications that are not only innovative but also context-aware and reliable. Whether you’re an AI enthusiast or a seasoned developer, this guide is tailored to help you navigate this exciting field with ease and confidence.

Addressing Diversity — MMR (Maximum Marginal Relevance)

Similarity search returns the most close responses to your question. But to provide the complete information to the model, you might want not to focus on the most similar texts. For example, for the question “breakfast in Travelodge Farringdon”, the top five customer reviews might be about coffee. If we look only at them, we will miss other comments mentioning eggs or staff behaviour and get somewhat limited view on the customer feedback. We could use the MMR (Maximum Marginal Relevance) approach to increase the diversity of customer comments. It works pretty straightforward: First, we get fetch_k the most similar docs to the question using similarity_search. Then, we picked up k the most diverse among them. If we want to use MMR, we should use max_marginal_relevance_search instead of similarity_search and specify fetch_k number. It’s worth keeping fetch_k relatively small so that you don’t have irrelevant answers in the output. That’s it.

query_docs = vectordb.max_marginal_relevance_search('politeness of staff', 
                    k = 3, fetch_k = 30)

Let’s look at the examples for the same query. We got more diverse feedback this time. There’s even a comment with negative sentiment.

Addressing specificity — LLM-aided retrieval

The other problem is that we don’t take into account the metadata while retrieving documents. To solve it, we can ask LLM to split the initial question into two parts: semantical filter based on document texts, filter based on metadata we have. This approach is called “Self querying”. First, let’s add a manual filter specifying a source parameter with the filename related to Travelodge Farringdon hotel.

query_docs = vectordb.similarity_search('breakfast in Travelodge Farrigdon', 
                  k=5,
                  filter = {'source': 'hotels/london/uk_england_london_travelodge_london_farringdon'}
                )

Now, let’s try to use LLM to come up with such a filter automatically. We need to describe all our metadata parameters in detail and then use SelfQueryRetriever.

from langchain.llms import OpenAI
                from langchain.retrievers.self_query.base import SelfQueryRetriever
                from langchain.chains.query_constructor.base import AttributeInfo
                
                metadata_field_info = [
                    AttributeInfo(
                        name="source",
                        description="All sources starts with 'hotels/london/uk_england_london_' \
                          then goes hotel chain, constant 'london_' and location.",
                        type="string",
                    )
                ]
                
                document_content_description = "Customer reviews for hotels"
                llm = OpenAI(temperature=0.1) # low temperature to make model more factual
                # by default 'text-davinci-003' is used
                
                retriever = SelfQueryRetriever.from_llm(
                    llm,
                    vectordb,
                    document_content_description,
                    metadata_field_info,
                    verbose=True
                )
                
                question = "breakfast in Travelodge Farringdon"
                docs = retriever.get_relevant_documents(question, k = 5)

Our case is tricky since the source parameter in the metadata consists of multiple fields: country, city, hotel chain and location. It’s worth splitting such complex parameters into more granular ones in such situations so that the model can easily understand how to use metadata filters. However, with a detailed prompt, it worked and returned only documents related to Travelodge Farringdon. But I must confess, it took me several iterations to achieve this result. Let’s switch on debug and see how it works. To enter debug mode, you just need to execute the code below.

import langchain 
                langchain.debug = True

The complete prompt is pretty long, so let’s look at the main parts of it. Here’s the prompt’s start, which gives the model an overview of what we expect and the main criteria for the result. Then, the few-shot prompting technique is used, and the model is provided with two examples of input and expected output. Here’s one of the examples. We are not using a chat model like ChatGPT but general LLM (not fine-tuned on instructions). It’s trained just to predict the following tokens for the text. That’s why we finished our prompt with our question and the string Structured output: expecting the model to provide the answer.

Pinecone, LangChain

One difficulty with LLMs is that they only know what they learned during training. So how do we get them to use private data? One way is to make new text data discoverable by the LLM. The typical way to do this is to convert all private data into embeddings stored in a vector database. The process is as follows: Chunk the data into small pieces Pass that data through an LLM. The resulting final layer of the network can be used as a semantic vector representation of the data The data can then be stored in a database of the vector representation used to recover that piece of data To store the data, I use Pinecone. You can create a free account and automatically get API keys with which to access the database: In the “indexes” tab, click on “create index.” Give it a name and a dimension (matching embeddings). I used “1536” for the dimension as it is the size of the chosen embedding from the OpenAI embedding model. Use 384 if using HuggingFace embeddings. I use the cosine similarity metric to search for similar documents

# to vector
                  import os
                  import pinecone 
                  from pinecone import Pinecone
                  from langchain_community.vectorstores import Pinecone as PineconeVectorStore
                  from langchain.embeddings.openai import OpenAIEmbeddings
                  from langchain_community.embeddings import LlamaCppEmbeddings, HuggingFaceEmbeddings
                  
                  pc = Pinecone(api_key=os.environ.get("PINECONE_API_KEY"))
                  
                  # we use the HuggingFace embedding model
                  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", model_kwargs={'device': 'cpu'})
                  
                  doc_db = PineconeVectorStore.from_documents(
                      docs_split, 
                      embeddings, 
                      index_name='d384'
                  )

Query

We can now search for relevant documents in that database using the cosine similarity metric

query = "What were the most important events for Google in 2021?"
                search_docs = doc_db.similarity_search(query)
                search_docs

Retrieving data with LLMs

from langchain_community.llms       import LlamaCpp, CTransformers
                  llm = LlamaCpp(
                      model_path = "e:/models/llama/llama-2-7b-chat.Q6_K.gguf",
                      n_gpu_layers=40,
                      n_ctx=2048,
                      n_batch=256,  # Batch size for model processing
                  )
                  
                  query = "What were the earnings in 2022?"
                  result = qa.run(query)
                  
                  result

RetrievalQA is actually a wrapper around a specific prompt. The chain type “stuff“ will use a prompt, assuming the whole query text fits into the context window. It uses the following prompt template:

Use the following pieces of context to answer the users question. 
                  If you don't know the answer, just say that you don't know, don't try to make up an answer.
                  ----------------
                  {context}
                  
                  {question}

Here the context will be populated with the user’s question and the results of the retrieved documents found in the database. You can use other chain types: “map_reduce”, “refine”, and “map-rerank” if the text is longer than the context window.

Split them into chunks. Each chunk corresponds to an embedding vector.

from langchain.text_splitter import CharacterTextSplitter

                  text_splitter = CharacterTextSplitter(
                      chunk_size=1000, 
                      chunk_overlap=0
                  )
                  docs_split = text_splitter.split_documents(docs)
                  docs_split

Integrating Vector Databases with LLMs: A Hands-On Guide

Overview of Vector Databases

What is a Vector Database?

What’s the difference between a vector index and a vector database?

Vector Databases before the rise of LLMs

Similarity Search:

Recommendation Systems:

Content-Based Retrieval:

Enhancing LLMs with Contextual Understanding:

Vector vs. Traditional Databases

Improving Vector Database Performance

Indexing Strategies

Addressing Diversity — MMR (Maximum Marginal Relevance)

Addressing specificity — LLM-aided retrieval

Addressing size limitations — Compression

Facebook AI Similarity Search

Building Some Vectors

IndexFlatL2

Pinecone, LangChain

Query

Retrieving data with LLMs

Split them into chunks. Each chunk corresponds to an embedding vector.

ChromaDB