Evolution, Visualisation & Applications of Embeddings

As human beings, we can read and understand texts (at least some of them). Computers in opposite “think in numbers”, so they can’t automatically grasp the meaning of words and sentences. If we want computers to understand the natural language, we need to convert this information into the format that computers can work with — vectors of numbers. People learned how to convert texts into machine-understandable format many years ago (one of the first versions was ASCII). Such an approach helps render and transfer texts but doesn’t encode the meaning of the words. At that time, the standard search technique was a keyword search when you were just looking for all the documents that contained specific words or N-grams. Then, after decades, embeddings have emerged. We can calculate embeddings for words, sentences, and even images. Embeddings are also vectors of numbers, but they can capture the meaning. So, you can use them to do a semantic search and even work with documents in different languages. In this article, I would like to dive deeper into the embedding topic and discuss all the details: - what preceded the embeddings and how they evolved, - how to calculate embeddings using OpenAI tools, - how to define whether sentences are close to each other, - how to visualise embeddings, - the most exciting part is how you could use embeddings in practice. Let’s move on and learn about the evolution of embeddings.

Tokens and Embeddings in LLMs

In the world of natural language processing, it is the smallest unit of analysis that we define. What you call a token depends on your tokenization method; plenty of such methods exist. Creating tokens is basically the first step to perform for most NLP tasks.

Tokenization Methods in NLP

# Example string for tokenization
                  example_string = "It's over 9000!"
                  
                  # Method 1: White Space Tokenization
                  # This method splits the text based on white spaces
                  white_space_tokens = example_string.split()
                  
                  # Method 2: WordPunct Tokenization
                  # This method splits the text into words and punctuation
                  from nltk.tokenize import WordPunctTokenizer
                  wordpunct_tokenizer = WordPunctTokenizer()
                  wordpunct_tokens = wordpunct_tokenizer.tokenize(example_string)
                  
                  # Method 3: Treebank Word Tokenization
                  # This method uses the standard word tokenization of the Penn Treebank
                  from nltk.tokenize import TreebankWordTokenizer
                  treebank_tokenizer = TreebankWordTokenizer()
                  treebank_tokens = treebank_tokenizer.tokenize(example_string)
                  
                  white_space_tokens, wordpunct_tokens, treebank_tokens

                  (["It's", 'over', '9000!'],
                  ['It', "'", 's', 'over', '9000', '!'],
                  ['It', "'s", 'over', '9000', '!'])

Why do we need to tokenize strings?

To break down complex text into manageable units. To present text in a format that is easier to analyze or perform operations on. Useful for specific linguistic tasks like part-of-speech tagging, syntactic parsing, and named entity recognition. Uniformly preprocess text in NLP applications and create structured training data. Most NLP systems perform some operations on these tokens to perform a specific task. For example, we can design a system to take a sequence of tokens and predict the next token. We can also convert the tokens into their phonetic representation as part of a text-to-speech system. Many other NLP tasks can be done, like keyword extraction, translation, etc.

How do we actually use these tokens to build these systems in the first place?

1. Feature Extraction: Tokens are used to extract features that are fed into machine learning models. Features might include the tokens themselves, their frequency, their part-of-speech tags, their position in a sentence, etc. For instance, in sentiment analysis, the presence of certain tokens might be strongly indicative of positive or negative sentiment. 2. Vectorization: In many NLP tasks, tokens are converted into numerical vectors using techniques like Bag of Words (BoW), TF-IDF (Term Frequency-Inverse Document Frequency), or word embeddings (like Word2Vec, GloVe). This process turns text data into numbers that machine learning models can understand and work with. 3. Sequence Modeling: For tasks like language modeling, machine translation, and text generation, tokens are used in sequence models like Recurrent Neural Networks (RNNs), Long Short-Term Memory Networks (LSTMs), or Transformers. These models learn to predict sequences of tokens, understanding the context and the likelihood of token occurrence. 4. Training the Model: In the training phase, models are fed tokenized text and corresponding labels or targets (like categories for classification tasks or next tokens for language models). The models learn patterns and associations between the tokens and the desired output. 5. Context Understanding: Advanced models like BERT and GPT use tokens to understand context and generate embeddings that capture the meaning of a word in a specific context. This is crucial for tasks where the same word can have different meanings based on its usage. In very simple terms, we have text strings that we convert to independent units called tokens. This makes it easier to convert them to “numbers,” later, which the computer understands.

ChatGPT and Tokens

What do tokens look like in the context of LLMs like ChatGPT? The tokenization methods used for LLMs differ from those used in general NLP. Broadly speaking, we can call it “subword tokenization,” where we create tokens that need not necessarily be complete words as we see in whitespace tokenization. This is precisely why one word is not equal to one token. When they say GPT-4 Turbo has 128K tokens as its context length, it is not exactly 128K words but a number close to it. Why use such different and more complicated tokenization methods? 1. These tokens are more intricate representations of language than complete words. 2. They help address a large range of vocabulary, including rare and unknown words. 3. Working with smaller subunits is computationally more efficient. 4. They help with better contextual understanding. It’s more adaptable across languages that can be quite different from English.

Byte-Pair-Encoding (BPE)

Many open-source models, like Meta’s LLAMA-2 and the older GPT models, use a version of this method. In a real-world context, BPE analyzes a large amount of text to determine the most common pairs.

from transformers import GPT2Tokenizer

                    # Initialize the tokenizer
                    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
                    
                    text = "It's over 9000!"
                    
                    # Tokenize the text
                    token_ids = tokenizer.encode(text, add_special_tokens=True)
                    
                    # Output the token IDs
                    print("Token IDs:", token_ids)
                    
                    # Convert token IDs back to raw tokens and output them
                    raw_tokens = [tokenizer.decode([token_id]) for token_id in token_ids]
                    print("Raw tokens:", raw_tokens)
                    Token IDs: [1026, 338, 625, 50138, 0]
                    Raw tokens: ['It', "'s", ' over', ' 9000', '!']

Let’s break down how this process works. Building the “Vocabulary” (this is basically part of the BPE method) Starting with Characters: Initially, the vocabulary consists of individual characters (like letters and punctuation). Finding Common Pairs: The training data (a large corpus of text) is scanned to find the most frequently occurring pairs of characters. For example, if ‘th’ appears often, it becomes a candidate to be added to the vocabulary. Merging and Creating New Tokens: These common pairs are then merged to form new tokens. The process continues iteratively, each time identifying and merging the next most frequent pair. The vocabulary grows from individual characters to common pairings and eventually to larger structures like common words or parts of words. Limiting the Vocabulary: There’s a limit to the vocabulary size (e.g., 50,000 tokens in GPT-2). Once this limit is reached, the process stops, resulting in a fixed-size vocabulary that includes a mix of characters, common pairings, and more complex tokens. Assigning Token IDs Indexing the Vocabulary: Each unique token in the final vocabulary is assigned a unique numerical index or ID. This is done straightforwardly, much like indexing in a list or array. Token ID Representation: In the context of GPT-2, each piece of text (like a word or part of a word) is represented by the ID of the corresponding token in this vocabulary. If a word is not in the vocabulary, it’s broken down into smaller tokens that are in the vocabulary. Special Tokens: Special tokens (like those representing the start and end of a text or unknown words) are also assigned unique IDs. The key point is that the assignment of token IDs is not arbitrary but based on the frequency of occurrence and combination patterns in the language data the model was trained on. This allows GPT-2 and similar models to efficiently process and generate human language using a manageable and representative set of tokens. Here, the “vocabulary” refers to all the unique tokens that the model can recognize and work with. It’s essentially the tokens created with the help of training data using the given tokenization method. Phew! That’s a lot of stuff to process. Most current generation of LLMs use some variation of BPE. For example, the Mistral model uses the byte fallback BPE tokenizer. Some other methods beyond BPE include unigram, sentencepiece, and wordpiece. Let’s not worry about all that. For now, what’s important to know is that creating tokens is one of the first steps when dealing with NLP or LLMs. Different tokenization methods exist to create tokens, which are also assigned some token IDs.

Token to Embedding Conversion

Just like different tokenization methods, we have several approaches to make the token-embedding conversion. Here are some of the popular ones: Word2Vec — a neural network model GloVe (Global Vectors for Word Representation) — an unsupervised learning algorithm FastText — an extension of Word2Vec BERT (Bidirectional Encoder Representations from Transformers) ELMo (Embeddings from Language Models) — a deep bidirectional LSTM model. Let’s not worry about the internal workings of each method for now. All you need to know is that you can use them to create numerical representations of text that computers can make sense of. Let me use BERT to create embeddings as an example.

from transformers import BertTokenizer, BertModel
                    import torch
                    
                    # Load pre-trained model tokenizer
                    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
                    
                    # Load pre-trained model
                    model = BertModel.from_pretrained('bert-base-uncased')
                    
                    # Text to be tokenized
                    text = "It's over 9000!"
                    
                    # Encode text
                    input_ids = tokenizer.encode(text, add_special_tokens=True)
                    
                    # Output the token IDs
                    print("Token IDs:", input_ids)
                    
                    # Convert token IDs back to raw tokens and output them
                    raw_tokens = [tokenizer.decode([token_id]) for token_id in input_ids]
                    print("Raw tokens:", raw_tokens)
                    
                    # Convert list of IDs to a tensor
                    input_ids_tensor = torch.tensor([input_ids])
                    
                    # Pass the input through the model
                    with torch.no_grad():
                        outputs = model(input_ids_tensor)
                    
                    # Extract the embeddings
                    embeddings = outputs.last_hidden_state
                    
                    # Print the embeddings
                    print("Embeddings: ", embeddings)
                    Token IDs: [101, 2009, 1005, 1055, 2058, 7706, 2692, 999, 102]
                    Raw tokens: ['[CLS]', 'it', "'", 's', 'over', '900', '##0', '!', '[SEP]']
                    Embeddings:  tensor([[[ 0.1116,  0.0722,  0.3173,  ..., -0.0635,  0.2166,  0.3236],
                             [-0.4159, -0.5147,  0.5690,  ..., -0.2577,  0.5710,  0.4439],
                             [-0.4893, -0.8719,  0.7343,  ..., -0.3001,  0.6078,  0.3938],
                             ...,
                             [-0.2746, -0.6479,  0.2702,  ..., -0.4827,  0.1755, -0.3939],
                             [ 0.0846, -0.3420,  0.0216,  ...,  0.6648,  0.3375, -0.2893],
                             [ 0.6566,  0.2011,  0.0142,  ...,  0.0786, -0.5767, -0.4356]]])

Carefully observe the code. Like in the previous example with GPT-2, we first tokenize the text. The BERT Tokenizer uses wordpiece method for the same. It basically breaks down words into smaller pieces based on certain criteria. We get the token IDs and then print the raw tokens. Notice how it’s different compared to the GPT-2 tokenizer output. We create a tensor from the token IDs and pass it to a pre-trained BERT model as input. We take the final output from the last hidden state. As you can see, embeddings are basically arrays of numbers. When you say, “It’s over 9000!” the computer essentially reads a very large N-dimensional tensor array with real numbers.

Why are embeddings so large and complex? What do they signify?

Each token’s embedding is a high-dimensional vector. This allows the model to capture a wide range of linguistic features and nuances, like the meaning of a word, its part of speech, and its relationship to other words in the sentence. Contextual Embeddings: Unlike simpler word embeddings (like Word2Vec), BERT’s embeddings are contextual. This means the same word can have different embeddings based on its context (its surrounding words). The embeddings need to be rich and complex to capture this contextual nuance. In our example, the sentence “It’s over 9000!” is tokenized into multiple tokens (including special tokens added by BERT for processing). Each token gets its own embedding vector. In more complex models like BERT, you not only get the final embeddings but also have access to the embeddings from each layer of the neural network. Each layer captures different aspects of the language, adding to the complexity and size of the tensor. Input for Further Tasks: These embeddings are used as input for various NLP tasks like sentiment analysis, question answering, and language translation. The richness of the embeddings allows the model to perform these tasks with a high degree of sophistication. Model’s Internal Representation: The complexity of these tensors reflects how the model ‘understands’ language. Each dimension in the embedding can represent some abstract language feature the model learned during its training. In short, embeddings are the secret sauce that makes the LLMs work so well. If you find ways to create better embeddings, you will likely create a better model. When these numbers are processed with the architecture of a trained AI model, it computes new values in the same format, representing the answer to the task for which the model was trained. In LLMs, it’s the prediction of the next token. The result you see on the user interface is basically the text retrieved from the output numbers produced. When you are training an LLM, you are essentially trying to optimize all the mathematical computations that happen in the model with the input embeddings to create the desired output. All such computations include some parameters called model weights. They determine how the model processes input data to produce output. Embeddings are, in fact, a subset of the model’s weights. They are the weights associated with the input layer (in the case of feedforward networks) or the embedding layer (in models like Transformers) (generally the first layer). Model weights and embeddings can be initialized (or computed) as random variables or taken from pre-trained models. These values are then updated during the training phase. The goal is to find the right values for the model weights, such that the computations it does, given an input, produce the most accurate output for the given context.

Evolution of Embeddings

We will start our journey with a brief tour into the history of text representations.

Bag of Words

The most basic approach to converting texts into vectors is a bag of words. Let’s look at one of the famous quotes of Richard P. Feynman “We are lucky to live in an age in which we are still making discoveries”. We will use it to illustrate a bag of words approach. The first step to get a bag of words vector is to split the text into words (tokens) and then reduce words to their base forms. For example, “running” will transform into “run”. This process is called stemming. We can use the NLTK Python package for it. 02.Embeddings_BagofWords.py Now, we have a list of base forms of all our words. The next step is to calculate their frequencies to create a vector.

import collections
                  bag_of_words = collections.Counter(stemmed_words)
                  print(bag_of_words)
                  # {'we': 2, 'are': 2, 'in': 2, 'lucki': 1, 'to': 1, 'live': 1, 
                  # 'an': 1, 'age': 1, 'which': 1, 'still': 1, 'make': 1, 'discoveri': 1}

Actually, if we wanted to convert our text into a vector, we would have to take into account not only the words we have in the text but the whole vocabulary. Let’s assume we also have “i”, “you” and ”study” in our vocabulary and let’s create a vector from Feynman’s quote. This approach is quite basic, and it doesn’t take into account the semantic meaning of the words, so the sentences “the girl is studying data science” and “the young woman is learning AI and ML” won’t be close to each other.

TF-IDF

A slightly improved version of the bag of the words approach is TF-IDF (Term Frequency — Inverse Document Frequency). It’s the multiplication of two metrics. Term Frequency shows the frequency of the word in the document. The most common way to calculate it is to divide the raw count of the term in this document (like in the bag of words) by the total number of terms (words) in the document. However, there are many other approaches like just raw count, boolean “frequencies”, and different approaches to normalisation. You can learn more about different approaches on Wikipedia. Inverse Document Frequency denotes how much information the word provides. For example, the words “a” or “that” don’t give you any additional information about the document’s topic. In contrast, words like “ChatGPT” or “bioinformatics” can help you define the domain (but not for this sentence). It’s calculated as the logarithm of the ratio of the total number of documents to those containing the word. The closer IDF is to 0 — the more common the word is and the less information it provides. So, in the end, we will get vectors where common words (like “I” or “you”) will have low weights, while rare words that occur in the document multiple times will have higher weights. This strategy will give a bit better results, but it still can’t capture semantic meaning. The other challenge with this approach is that it produces pretty sparse vectors. The length of the vectors is equal to the corpus size. There are about 470K unique words in English (source), so we will have huge vectors. Since the sentence won’t have more than 50 unique words, 99.99% of the values in vectors will be 0, not encoding any info. Looking at this, scientists started to think about dense vector representation.

Word2Vec

One of the most famous approaches to dense representation is word2vec, proposed by Google in 2013 in the paper “Efficient Estimation of Word Representations in Vector Space” by Mikolov et al. There are two different word2vec approaches mentioned in the paper: Continuous Bag of Words (when we predict the word based on the surrounding words) and Skip-gram (the opposite task — when we predict context based on the word). The high-level idea of dense vector representation is to train two models: encoder and decoder. For example, in the case of skip-gram, we might pass the word “christmas” to the encoder. Then, the encoder will produce a vector that we pass to the decoder expecting to get the words “merry”, “to”, and “you”. This model started to take into account the meaning of the words since it’s trained on the context of the words. However, it ignores morphology (information we can get from the word parts, for example, that “-less” means the lack of something). This drawback was addressed later by looking at subword skip-grams in GloVe. Also, word2vec was capable of working only with words, but we would like to encode whole sentences. So, let’s move on to the next evolutional step with transformers.

Transformers and Sentence Embeddings

The next evolution was related to the transformers approach introduced in the “Attention Is All You Need” paper by Vaswani et al. Transformers were able to produce information-reach dense vectors and become the dominant technology for modern language models. I won’t cover the details of the transformers’ architecture since it’s not so relevant to our topic and would take a lot of time. If you’re interested in learning more, there are a lot of materials about transformers, for example, “Transformers, Explained” or “The Illustrated Transformer”. Transformers allow you to use the same “core” model and fine-tune it for different use cases without retraining the core model (which takes a lot of time and is quite costly). It led to the rise of pre-trained models. One of the first popular models was BERT (Bidirectional Encoder Representations from Transformers) by Google AI. Internally, BERT still operates on a token level similar to word2vec, but we still want to get sentence embeddings. So, the naive approach could be to take an average of all tokens’ vectors. Unfortunately, this approach doesn’t show good performance. This problem was solved in 2019 when Sentence-BERT was released. It outperformed all previous approaches to semantic textual similarity tasks and allowed the calculation of sentence embeddings. It’s a huge topic so we won’t be able to cover it all in this article. So, if you’re really interested, you can learn more about the sentence embeddings in this article. We’ve briefly covered the evolution of embeddings and got a high-level understanding of the theory. Now, it’s time to move on to practice and lear how to calculate embeddings using OpenAI tools.

Calculating embeddings

The most commonly used technique for general texts is recursive split by character. In LangChain, it’s implemented in RecursiveCharacterTextSplitter class. Let’s try to understand how it works. First, you define a prioritised list of characters for the splitter (by default, it’s ["\n\n", "\n", " ", ""]). Then, the splitter goes through this list and tries to split the document by characters one by one until it gets small enough chunks. It means that this approach tries to keep semantically close parts together (paragraphs, sentences, words) until we need to split them to achieve the desired chunk size. Let’s use the Zen of Python to see how it works. There are 824 characters, 139 words and 21 paragraphs in this text. You can see the Zen of Python if you execute import this.

zen = '''
                Beautiful is better than ugly.
                Explicit is better than implicit.
                Simple is better than complex.
                Complex is better than complicated.
                Flat is better than nested.
                Sparse is better than dense.
                Readability counts.
                Special cases aren't special enough to break the rules.
                Although practicality beats purity.
                Errors should never pass silently.
                Unless explicitly silenced.
                In the face of ambiguity, refuse the temptation to guess.
                There should be one -- and preferably only one --obvious way to do it.
                Although that way may not be obvious at first unless you're Dutch.
                Now is better than never.
                Although never is often better than *right* now.
                If the implementation is hard to explain, it's a bad idea.
                If the implementation is easy to explain, it may be a good idea.
                Namespaces are one honking great idea -- let's do more of those!
                '''
                
                print('Number of characters: %d' % len(zen))
                print('Number of words: %d' % len(zen.replace('\n', ' ').split(' ')))
                print('Number of paragraphs: %d' % len(zen.split('\n')))
                
                # Number of characters: 825
                # Number of words: 140
                # Number of paragraphs: 21

Let’s use RecursiveCharacterTextSplitter and start with a relatively big chunk size equal to 300.

from langchain.text_splitter import RecursiveCharacterTextSplitter
                
                text_splitter = RecursiveCharacterTextSplitter(
                    chunk_size = 300,
                    chunk_overlap  = 0,
                    length_function = len,
                    is_separator_regex = False,
                )
                text_splitter.split_text(zen)

We will get three chunks: 264, 293 and 263 characters. We could see that all sentences are held together.

Let’s try to add chunk_overlap.

text_splitter = RecursiveCharacterTextSplitter(
                    chunk_size = 300,
                    chunk_overlap  = 100,
                    length_function = len,
                    is_separator_regex = False,
                )
                text_splitter.split_text(zen)

Now, we have four splits with 264, 232, 297 and 263 characters, and we can see that our chunks overlap.

Let’s make the chunk size a bit smaller.

text_splitter = RecursiveCharacterTextSplitter(
                  chunk_size = 50,
                  chunk_overlap  = 10,
                  length_function = len,
                  is_separator_regex = False,
              )
              text_splitter.split_text(zen)

Now, we even had to split some longer sentences. That’s how recursive split works: since after splitting by paragraphs ("\n"), chunks are still not small enough, the splitter proceeded to " ".

You can customise the split even further. For example, you could specify length_function = lambda x: len(x.split("\n")) to use the number of paragraphs as the chunk length instead of the number of characters. It’s also quite common to split by tokens because LLMs have limited context sizes based on the number of tokens. The other potential customisation is to use other separators to prefer to split by "," instead of " " . Let’s try to use it with a couple of sentences.

text_splitter = RecursiveCharacterTextSplitter(
                    chunk_size = 50,
                    chunk_overlap  = 0,
                    length_function = len,
                    is_separator_regex = False,
                    separators=["\n\n", "\n", ", ", " ", ""]
                )
                text_splitter.split_text('''\
                If the implementation is hard to explain, it's a bad idea.
                If the implementation is easy to explain, it may be a good idea.''')

It works, but commas are not in the right places.

To fix this issue, we could use regexp with lookback as a separator. text_splitter = RecursiveCharacterTextSplitter( chunk_size = 50, chunk_overlap = 0, length_function = len, is_separator_regex = True, separators=["\n\n", "\n", "(?<=\, )", " ", ""] ) text_splitter.split_text('''\ If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea.''') Now it’s fixed.

Also, LangChain provides tools for working with code so that your texts are split based on separators specific to programming languages. However, in our case, the situation is more straightforward. We know we have individual independent comments delimited by "\n" in each file, and we just need to split by it. Unfortunately, LangChain doesn’t support such a basic use case, so we need to do a bit of hacking to make it work as we want to.

from langchain.text_splitter import CharacterTextSplitter
                    
                    text_splitter = CharacterTextSplitter(
                        separator = "\n",
                        chunk_size = 1,
                        chunk_overlap  = 0,
                        length_function = lambda x: 1, # hack - usually len is used 
                        is_separator_regex = False
                    )
                    split_docs = text_splitter.split_documents(docs)
                    len(split_docs) 
                    12890

You can find more details on why we need a hack here in my previous article about LangChain. The significant part of the documents is metadata since it can give more context about where this chunk came from. In our case, LangChain automatically populated the source parameter for metadata so that we know which hotel each comment is related to.

There are some other approaches (i.e. for HTML or Markdown) that add titles to metadata while splitting documents. These methods could be quite helpful if you’re working with such data types.

Vector stores

We could store comments as strings, but it won’t help us to solve this task. A much more functional solution is to store documents’ embeddings. Embeddings are high-dimensional vectors. Embeddings capture semantical meanings and relationships between words and phrases so that semantically close texts will have a smaller distance between them. After embedding, we convert comments into numeric vectors. The next question is how we should store it so that this data is easily accessible. Let’s think about our use case. Our flow will be: - get a question, - calculate its embedding, - find the most relevant document chunks related to this question (the ones with the smallest distance to this embedding), - finally, pass found chunks to LLM as a context along with the initial question. The regular task for the data storage will be to find K nearest vectors (K most relevant documents). So, we will need to calculate the distance (in our case, Cosine Similarity) between our question’s embedding and all the vectors we have. Generic databases (like Snowflake or Postgres) will perform poorly for such a task. But there are databases optimised, especially for this use case — vector databases. We will be using an open-source embedding database, Chroma. Chroma is a lightweight in-memory DB, so it’s ideal for prototyping. You can find much more options for vector stores here.

from langchain.vectorstores import Chroma
                persist_directory = 'vector_store'
                
                vectordb = Chroma.from_documents(
                    documents=split_docs,
                    embedding=embedding,
                    persist_directory=persist_directory
                )

To be able to load data from disk when you need it next time, execute the following command.

embedding = OpenAIEmbeddings()
                vectordb = Chroma(
                    persist_directory=persist_directory,
                    embedding_function=embedding
                )

The database initialisation might take a couple of minutes since Chroma needs to load all documents and get their embeddings using OpenAI API. We can see that all documents have been loaded.

print(vectordb._collection.count())
                12890

Now, we could use a similarity search to find top customer comments about staff politeness. query_docs = vectordb.similarity_search('politeness of staff', k=3) Documents look pretty relevant to the question.

We have stored our customer comments in an accessible way, and it’s time to discuss retrieval in more detail.

Retrieval

We’ve already used vectordb.similarity_search to retrieve the most related chunks to the question. In most cases, such an approach will work for you, but there could be some nuances: - Lack of diversity — The model might return extremely close texts (even duplicates), which won’t add much new information to LLM. - Not taking into account metadata — similarity_search doesn’t take into account the metadata information we have. For example, if I query the top-5 comments for the question “breakfast in Travelodge Farringdon”, only three comments in the result will have the source equal to uk_england_london_travelodge_london_farringdon. - Context size limitation — as usual, we have limited LLM context size and need to fit our documents into it. Let’s discuss techniques that could help us to solve these problems. Addressing Diversity — MMR (Maximum Marginal Relevance)

Addressing Diversity — MMR (Maximum Marginal Relevance)

Similarity search returns the most close responses to your question. But to provide the complete information to the model, you might want not to focus on the most similar texts. For example, for the question “breakfast in Travelodge Farringdon”, the top five customer reviews might be about coffee. If we look only at them, we will miss other comments mentioning eggs or staff behaviour and get somewhat limited view on the customer feedback. We could use the MMR (Maximum Marginal Relevance) approach to increase the diversity of customer comments. It works pretty straightforward: First, we get fetch_k the most similar docs to the question using similarity_search. Then, we picked up k the most diverse among them. If we want to use MMR, we should use max_marginal_relevance_search instead of similarity_search and specify fetch_k number. It’s worth keeping fetch_k relatively small so that you don’t have irrelevant answers in the output. That’s it.

query_docs = vectordb.max_marginal_relevance_search('politeness of staff', 
                    k = 3, fetch_k = 30)

Let’s look at the examples for the same query. We got more diverse feedback this time. There’s even a comment with negative sentiment.

Addressing specificity — LLM-aided retrieval

The other problem is that we don’t take into account the metadata while retrieving documents. To solve it, we can ask LLM to split the initial question into two parts: semantical filter based on document texts, filter based on metadata we have. This approach is called “Self querying”. First, let’s add a manual filter specifying a source parameter with the filename related to Travelodge Farringdon hotel.

query_docs = vectordb.similarity_search('breakfast in Travelodge Farrigdon', 
                  k=5,
                  filter = {'source': 'hotels/london/uk_england_london_travelodge_london_farringdon'}
                )

Now, let’s try to use LLM to come up with such a filter automatically. We need to describe all our metadata parameters in detail and then use SelfQueryRetriever.

from langchain.llms import OpenAI
                from langchain.retrievers.self_query.base import SelfQueryRetriever
                from langchain.chains.query_constructor.base import AttributeInfo
                
                metadata_field_info = [
                    AttributeInfo(
                        name="source",
                        description="All sources starts with 'hotels/london/uk_england_london_' \
                          then goes hotel chain, constant 'london_' and location.",
                        type="string",
                    )
                ]
                
                document_content_description = "Customer reviews for hotels"
                llm = OpenAI(temperature=0.1) # low temperature to make model more factual
                # by default 'text-davinci-003' is used
                
                retriever = SelfQueryRetriever.from_llm(
                    llm,
                    vectordb,
                    document_content_description,
                    metadata_field_info,
                    verbose=True
                )
                
                question = "breakfast in Travelodge Farringdon"
                docs = retriever.get_relevant_documents(question, k = 5)

Our case is tricky since the source parameter in the metadata consists of multiple fields: country, city, hotel chain and location. It’s worth splitting such complex parameters into more granular ones in such situations so that the model can easily understand how to use metadata filters. However, with a detailed prompt, it worked and returned only documents related to Travelodge Farringdon. But I must confess, it took me several iterations to achieve this result. Let’s switch on debug and see how it works. To enter debug mode, you just need to execute the code below.

import langchain 
                langchain.debug = True

The complete prompt is pretty long, so let’s look at the main parts of it. Here’s the prompt’s start, which gives the model an overview of what we expect and the main criteria for the result. Then, the few-shot prompting technique is used, and the model is provided with two examples of input and expected output. Here’s one of the examples. We are not using a chat model like ChatGPT but general LLM (not fine-tuned on instructions). It’s trained just to predict the following tokens for the text. That’s why we finished our prompt with our question and the string Structured output: expecting the model to provide the answer.