What is Deep Learning?

Deep Learning is a specific use case of Machine Learning when we use multi-layered Neural Networks as a model. Machine learning is a discipline in which we define a program not by writing it entirely ourselves, but by learning from data. Deep learning is a specialty within machine learning that uses neural networks with multiple layers. Image classification is a representative example (also known as image recognition). We start with labeled data —a set of images for which we have assigned a label to each image, indicating what it represents. Our goal is to produce a program, called a model, that, given a new image, will make an accurate prediction regarding what that new image represents. Every model starts with a choice of architecture, a general template for how that kind of model works internally. The process of training (or fitting) the model is the process of finding a set of parameter values (or weights) that specialize that general architecture into a model that works well for our particular kind of data. To define how well a model does on a single prediction, we need to define a loss function, which determines how we score a prediction as good or bad. To make the training process go faster, we might start with a pretrained model—a model that has already been trained on someone else’s data. We can then adapt it to our data by training it a bit more on our data, a process called fine-tuning. When we train a model, a key concern is to ensure that our model generalizes: it learns general lessons from our data that also apply to new items it will encounter, so it can make good predictions on those items. The risk is that if we train our model badly, instead of learning general lessons, it effectively memorizes what it has already seen, and then it will make poor predictions about new images. Such a failure is called overfitting. To avoid this, we always divide our data into two parts, the training set and the validation set. We train the model by showing it only the training set, and then we evaluate how well the model is doing by seeing how well it performs on items from the validation set. In this way, we check if the lessons the model learns from the training set are lessons that generalize to the validation set. In order for a person to assess how well the model is doing on the validation set overall, we define a metric. During the training process, when the model has seen every item in the training set, we call that an epoch. All these concepts apply to machine learning in general. They apply to all sorts of schemes for defining a model by training it with data. What makes deep learning distinctive is a particular class of architectures: the architectures based on neural networks. In particular, tasks like image classification rely heavily on convolutional neural networks, which we will discuss shortly.

To train a model: 1. A dataset called the Oxford-IIIT Pet Dataset that contains 7,349 images of cats and dogs from 37 breeds will be downloaded from the fast.ai datasets collection to the GPU server you are using, and will then be extracted. 2. A pretrained model that has already been trained on 1.3 million images using a competition-winning model will be downloaded from the internet. 3. The pretrained model will be fine-tuned using the latest advances in transfer learning to create a model that is specially customized for recognizing dogs and cats. Another key piece of context is that deep learning is just a modern area in the more general discipline of machine learning. Machine learning is, like regular programming, a way to get computers to complete a specific task. But how would we use regular programming to do what we just did in the preceding section: recognize dogs versus cats in photos? We would have to write down for the computer the exact steps necessary to complete the task. Right back at the dawn of computing, in 1949, an IBM researcher named Arthur Samuel started working on a different way to get computers to complete tasks, which he called machine learning. In his classic 1962 essay “Artificial Intelligence: A Frontier of Automation,” he wrote:

Programming a computer for such computations is, at best, a difficult task, not primarily because of any inherent complexity in the computer itself but, rather, because of the need to spell out every minute step of the process in the most exasperating detail. Computers, as any programmer will tell you, are giant morons, not giant brains. Machine Learning "recommendation system" that can predict what products a user might purchase. This is often used in ecommerce, such as to customize products shown on a home page by showing the highest-ranked items. But such a model is generally created by looking at a user and their buying history (inputs) and what they went on to buy or look at (labels), which means that the model is likely to tell you about products the user already has, or already knows about, rather than new products that they are most likely to be interested in hearing about. That’s very different from what, say, an expert at your local bookseller might do, where they ask questions to figure out your taste, and then tell you about authors or series that you’ve never heard of before. Another critical insight comes from considering how a model interacts with its environment. This can create feedback loops, as described here: 1. A predictive policing model is created based on where arrests have been made in the past. In practice, this is not actually predicting crime, but rather predicting arrests, and is therefore partially simply reflecting biases in existing policing processes. 2. Law enforcement officers then might use that model to decide where to focus their policing activity, resulting in increased arrests in those areas. 3. Data on these additional arrests would then be fed back in to retrain future versions of the model. This is a positive feedback loop: the more the model is used, the more biased the data becomes, making the model even more biased, and so forth.

Evolution of Embeddings

We will start our journey with a brief tour into the history of text representations.

Bag of Words

The most basic approach to converting texts into vectors is a bag of words. Let’s look at one of the famous quotes of Richard P. Feynman “We are lucky to live in an age in which we are still making discoveries”. We will use it to illustrate a bag of words approach. The first step to get a bag of words vector is to split the text into words (tokens) and then reduce words to their base forms. For example, “running” will transform into “run”. This process is called stemming. We can use the NLTK Python package for it. 02.Embeddings_BagofWords.py Now, we have a list of base forms of all our words. The next step is to calculate their frequencies to create a vector.

import collections
                  bag_of_words = collections.Counter(stemmed_words)
                  print(bag_of_words)
                  # {'we': 2, 'are': 2, 'in': 2, 'lucki': 1, 'to': 1, 'live': 1, 
                  # 'an': 1, 'age': 1, 'which': 1, 'still': 1, 'make': 1, 'discoveri': 1}

Actually, if we wanted to convert our text into a vector, we would have to take into account not only the words we have in the text but the whole vocabulary. Let’s assume we also have “i”, “you” and ”study” in our vocabulary and let’s create a vector from Feynman’s quote. This approach is quite basic, and it doesn’t take into account the semantic meaning of the words, so the sentences “the girl is studying data science” and “the young woman is learning AI and ML” won’t be close to each other.

TF-IDF

A slightly improved version of the bag of the words approach is TF-IDF (Term Frequency — Inverse Document Frequency). It’s the multiplication of two metrics. Term Frequency shows the frequency of the word in the document. The most common way to calculate it is to divide the raw count of the term in this document (like in the bag of words) by the total number of terms (words) in the document. However, there are many other approaches like just raw count, boolean “frequencies”, and different approaches to normalisation. You can learn more about different approaches on Wikipedia. Inverse Document Frequency denotes how much information the word provides. For example, the words “a” or “that” don’t give you any additional information about the document’s topic. In contrast, words like “ChatGPT” or “bioinformatics” can help you define the domain (but not for this sentence). It’s calculated as the logarithm of the ratio of the total number of documents to those containing the word. The closer IDF is to 0 — the more common the word is and the less information it provides. So, in the end, we will get vectors where common words (like “I” or “you”) will have low weights, while rare words that occur in the document multiple times will have higher weights. This strategy will give a bit better results, but it still can’t capture semantic meaning. The other challenge with this approach is that it produces pretty sparse vectors. The length of the vectors is equal to the corpus size. There are about 470K unique words in English (source), so we will have huge vectors. Since the sentence won’t have more than 50 unique words, 99.99% of the values in vectors will be 0, not encoding any info. Looking at this, scientists started to think about dense vector representation.

Word2Vec

One of the most famous approaches to dense representation is word2vec, proposed by Google in 2013 in the paper “Efficient Estimation of Word Representations in Vector Space” by Mikolov et al. There are two different word2vec approaches mentioned in the paper: Continuous Bag of Words (when we predict the word based on the surrounding words) and Skip-gram (the opposite task — when we predict context based on the word). The high-level idea of dense vector representation is to train two models: encoder and decoder. For example, in the case of skip-gram, we might pass the word “christmas” to the encoder. Then, the encoder will produce a vector that we pass to the decoder expecting to get the words “merry”, “to”, and “you”. This model started to take into account the meaning of the words since it’s trained on the context of the words. However, it ignores morphology (information we can get from the word parts, for example, that “-less” means the lack of something). This drawback was addressed later by looking at subword skip-grams in GloVe. Also, word2vec was capable of working only with words, but we would like to encode whole sentences. So, let’s move on to the next evolutional step with transformers.

Transformers and Sentence Embeddings

The next evolution was related to the transformers approach introduced in the “Attention Is All You Need” paper by Vaswani et al. Transformers were able to produce information-reach dense vectors and become the dominant technology for modern language models. I won’t cover the details of the transformers’ architecture since it’s not so relevant to our topic and would take a lot of time. If you’re interested in learning more, there are a lot of materials about transformers, for example, “Transformers, Explained” or “The Illustrated Transformer”. Transformers allow you to use the same “core” model and fine-tune it for different use cases without retraining the core model (which takes a lot of time and is quite costly). It led to the rise of pre-trained models. One of the first popular models was BERT (Bidirectional Encoder Representations from Transformers) by Google AI. Internally, BERT still operates on a token level similar to word2vec, but we still want to get sentence embeddings. So, the naive approach could be to take an average of all tokens’ vectors. Unfortunately, this approach doesn’t show good performance. This problem was solved in 2019 when Sentence-BERT was released. It outperformed all previous approaches to semantic textual similarity tasks and allowed the calculation of sentence embeddings. It’s a huge topic so we won’t be able to cover it all in this article. So, if you’re really interested, you can learn more about the sentence embeddings in this article. We’ve briefly covered the evolution of embeddings and got a high-level understanding of the theory. Now, it’s time to move on to practice and lear how to calculate embeddings using OpenAI tools.

Calculating embeddings

The most commonly used technique for general texts is recursive split by character. In LangChain, it’s implemented in RecursiveCharacterTextSplitter class. Let’s try to understand how it works. First, you define a prioritised list of characters for the splitter (by default, it’s ["\n\n", "\n", " ", ""]). Then, the splitter goes through this list and tries to split the document by characters one by one until it gets small enough chunks. It means that this approach tries to keep semantically close parts together (paragraphs, sentences, words) until we need to split them to achieve the desired chunk size. Let’s use the Zen of Python to see how it works. There are 824 characters, 139 words and 21 paragraphs in this text. You can see the Zen of Python if you execute import this.

zen = '''
                Beautiful is better than ugly.
                Explicit is better than implicit.
                Simple is better than complex.
                Complex is better than complicated.
                Flat is better than nested.
                Sparse is better than dense.
                Readability counts.
                Special cases aren't special enough to break the rules.
                Although practicality beats purity.
                Errors should never pass silently.
                Unless explicitly silenced.
                In the face of ambiguity, refuse the temptation to guess.
                There should be one -- and preferably only one --obvious way to do it.
                Although that way may not be obvious at first unless you're Dutch.
                Now is better than never.
                Although never is often better than *right* now.
                If the implementation is hard to explain, it's a bad idea.
                If the implementation is easy to explain, it may be a good idea.
                Namespaces are one honking great idea -- let's do more of those!
                '''
                
                print('Number of characters: %d' % len(zen))
                print('Number of words: %d' % len(zen.replace('\n', ' ').split(' ')))
                print('Number of paragraphs: %d' % len(zen.split('\n')))
                
                # Number of characters: 825
                # Number of words: 140
                # Number of paragraphs: 21

Let’s use RecursiveCharacterTextSplitter and start with a relatively big chunk size equal to 300.

from langchain.text_splitter import RecursiveCharacterTextSplitter
                
                text_splitter = RecursiveCharacterTextSplitter(
                    chunk_size = 300,
                    chunk_overlap  = 0,
                    length_function = len,
                    is_separator_regex = False,
                )
                text_splitter.split_text(zen)

We will get three chunks: 264, 293 and 263 characters. We could see that all sentences are held together.

Let’s try to add chunk_overlap.

text_splitter = RecursiveCharacterTextSplitter(
                    chunk_size = 300,
                    chunk_overlap  = 100,
                    length_function = len,
                    is_separator_regex = False,
                )
                text_splitter.split_text(zen)

Now, we have four splits with 264, 232, 297 and 263 characters, and we can see that our chunks overlap.

Let’s make the chunk size a bit smaller.

text_splitter = RecursiveCharacterTextSplitter(
                  chunk_size = 50,
                  chunk_overlap  = 10,
                  length_function = len,
                  is_separator_regex = False,
              )
              text_splitter.split_text(zen)

Now, we even had to split some longer sentences. That’s how recursive split works: since after splitting by paragraphs ("\n"), chunks are still not small enough, the splitter proceeded to " ".

You can customise the split even further. For example, you could specify length_function = lambda x: len(x.split("\n")) to use the number of paragraphs as the chunk length instead of the number of characters. It’s also quite common to split by tokens because LLMs have limited context sizes based on the number of tokens. The other potential customisation is to use other separators to prefer to split by "," instead of " " . Let’s try to use it with a couple of sentences.

text_splitter = RecursiveCharacterTextSplitter(
                    chunk_size = 50,
                    chunk_overlap  = 0,
                    length_function = len,
                    is_separator_regex = False,
                    separators=["\n\n", "\n", ", ", " ", ""]
                )
                text_splitter.split_text('''\
                If the implementation is hard to explain, it's a bad idea.
                If the implementation is easy to explain, it may be a good idea.''')

It works, but commas are not in the right places.

To fix this issue, we could use regexp with lookback as a separator. text_splitter = RecursiveCharacterTextSplitter( chunk_size = 50, chunk_overlap = 0, length_function = len, is_separator_regex = True, separators=["\n\n", "\n", "(?<=\, )", " " , "" ] ) text_splitter.split_text('''\ If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea.''') Now it’s fixed.

Also, LangChain provides tools for working with code so that your texts are split based on separators specific to programming languages. However, in our case, the situation is more straightforward. We know we have individual independent comments delimited by "\n" in each file, and we just need to split by it. Unfortunately, LangChain doesn’t support such a basic use case, so we need to do a bit of hacking to make it work as we want to.

from langchain.text_splitter import CharacterTextSplitter
                    
                    text_splitter = CharacterTextSplitter(
                        separator = "\n",
                        chunk_size = 1,
                        chunk_overlap  = 0,
                        length_function = lambda x: 1, # hack - usually len is used 
                        is_separator_regex = False
                    )
                    split_docs = text_splitter.split_documents(docs)
                    len(split_docs) 
                    12890

You can find more details on why we need a hack here in my previous article about LangChain. The significant part of the documents is metadata since it can give more context about where this chunk came from. In our case, LangChain automatically populated the source parameter for metadata so that we know which hotel each comment is related to.

There are some other approaches (i.e. for HTML or Markdown) that add titles to metadata while splitting documents. These methods could be quite helpful if you’re working with such data types.

Vector stores

We could store comments as strings, but it won’t help us to solve this task. A much more functional solution is to store documents’ embeddings. Embeddings are high-dimensional vectors. Embeddings capture semantical meanings and relationships between words and phrases so that semantically close texts will have a smaller distance between them. After embedding, we convert comments into numeric vectors. The next question is how we should store it so that this data is easily accessible. Let’s think about our use case. Our flow will be: - get a question, - calculate its embedding, - find the most relevant document chunks related to this question (the ones with the smallest distance to this embedding), - finally, pass found chunks to LLM as a context along with the initial question. The regular task for the data storage will be to find K nearest vectors (K most relevant documents). So, we will need to calculate the distance (in our case, Cosine Similarity) between our question’s embedding and all the vectors we have. Generic databases (like Snowflake or Postgres) will perform poorly for such a task. But there are databases optimised, especially for this use case — vector databases. We will be using an open-source embedding database, Chroma. Chroma is a lightweight in-memory DB, so it’s ideal for prototyping. You can find much more options for vector stores here.

from langchain.vectorstores import Chroma
                persist_directory = 'vector_store'
                
                vectordb = Chroma.from_documents(
                    documents=split_docs,
                    embedding=embedding,
                    persist_directory=persist_directory
                )

To be able to load data from disk when you need it next time, execute the following command.

embedding = OpenAIEmbeddings()
                vectordb = Chroma(
                    persist_directory=persist_directory,
                    embedding_function=embedding
                )

The database initialisation might take a couple of minutes since Chroma needs to load all documents and get their embeddings using OpenAI API. We can see that all documents have been loaded.

print(vectordb._collection.count())
                12890

Now, we could use a similarity search to find top customer comments about staff politeness. query_docs = vectordb.similarity_search('politeness of staff', k=3) Documents look pretty relevant to the question.

We have stored our customer comments in an accessible way, and it’s time to discuss retrieval in more detail.

Retrieval

We’ve already used vectordb.similarity_search to retrieve the most related chunks to the question. In most cases, such an approach will work for you, but there could be some nuances: - Lack of diversity — The model might return extremely close texts (even duplicates), which won’t add much new information to LLM. - Not taking into account metadata — similarity_search doesn’t take into account the metadata information we have. For example, if I query the top-5 comments for the question “breakfast in Travelodge Farringdon”, only three comments in the result will have the source equal to uk_england_london_travelodge_london_farringdon. - Context size limitation — as usual, we have limited LLM context size and need to fit our documents into it. Let’s discuss techniques that could help us to solve these problems. Addressing Diversity — MMR (Maximum Marginal Relevance)

Addressing Diversity — MMR (Maximum Marginal Relevance)

Similarity search returns the most close responses to your question. But to provide the complete information to the model, you might want not to focus on the most similar texts. For example, for the question “breakfast in Travelodge Farringdon”, the top five customer reviews might be about coffee. If we look only at them, we will miss other comments mentioning eggs or staff behaviour and get somewhat limited view on the customer feedback. We could use the MMR (Maximum Marginal Relevance) approach to increase the diversity of customer comments. It works pretty straightforward: First, we get fetch_k the most similar docs to the question using similarity_search. Then, we picked up k the most diverse among them. If we want to use MMR, we should use max_marginal_relevance_search instead of similarity_search and specify fetch_k number. It’s worth keeping fetch_k relatively small so that you don’t have irrelevant answers in the output. That’s it.

query_docs = vectordb.max_marginal_relevance_search('politeness of staff', 
                    k = 3, fetch_k = 30)

Let’s look at the examples for the same query. We got more diverse feedback this time. There’s even a comment with negative sentiment.

Addressing specificity — LLM-aided retrieval

The other problem is that we don’t take into account the metadata while retrieving documents. To solve it, we can ask LLM to split the initial question into two parts: semantical filter based on document texts, filter based on metadata we have. This approach is called “Self querying”. First, let’s add a manual filter specifying a source parameter with the filename related to Travelodge Farringdon hotel.

query_docs = vectordb.similarity_search('breakfast in Travelodge Farrigdon', 
                  k=5,
                  filter = {'source': 'hotels/london/uk_england_london_travelodge_london_farringdon'}
                )

Now, let’s try to use LLM to come up with such a filter automatically. We need to describe all our metadata parameters in detail and then use SelfQueryRetriever.

from langchain.llms import OpenAI
                from langchain.retrievers.self_query.base import SelfQueryRetriever
                from langchain.chains.query_constructor.base import AttributeInfo
                
                metadata_field_info = [
                    AttributeInfo(
                        name="source",
                        description="All sources starts with 'hotels/london/uk_england_london_' \
                          then goes hotel chain, constant 'london_' and location.",
                        type="string",
                    )
                ]
                
                document_content_description = "Customer reviews for hotels"
                llm = OpenAI(temperature=0.1) # low temperature to make model more factual
                # by default 'text-davinci-003' is used
                
                retriever = SelfQueryRetriever.from_llm(
                    llm,
                    vectordb,
                    document_content_description,
                    metadata_field_info,
                    verbose=True
                )
                
                question = "breakfast in Travelodge Farringdon"
                docs = retriever.get_relevant_documents(question, k = 5)

Our case is tricky since the source parameter in the metadata consists of multiple fields: country, city, hotel chain and location. It’s worth splitting such complex parameters into more granular ones in such situations so that the model can easily understand how to use metadata filters. However, with a detailed prompt, it worked and returned only documents related to Travelodge Farringdon. But I must confess, it took me several iterations to achieve this result. Let’s switch on debug and see how it works. To enter debug mode, you just need to execute the code below.

import langchain 
                langchain.debug = True

The complete prompt is pretty long, so let’s look at the main parts of it. Here’s the prompt’s start, which gives the model an overview of what we expect and the main criteria for the result. Then, the few-shot prompting technique is used, and the model is provided with two examples of input and expected output. Here’s one of the examples. We are not using a chat model like ChatGPT but general LLM (not fine-tuned on instructions). It’s trained just to predict the following tokens for the text. That’s why we finished our prompt with our question and the string Structured output: expecting the model to provide the answer.

Addressing size limitations — Compression

The other technique for retrieval that might be handy is compression. Even though GPT 4 Turbo has a context size of 128K tokens, it’s still limited. That’s why we might want to preprocess documents and extract only relevant parts. The main advantages are: You will be able to fit more documents and information into the final prompt since they will be condensed. You will get better, more focused results because the non-relevant context will be cleaned during preprocessing. These benefits come with the cost — you will have more calls to LLM for compression, which means lower speed and higher price. You can find more info about this technique in the docs. Scheme by author Actually, we can even combine techniques and use MMR here. We used ContextualCompressionRetriever to get results. Also, we specified that we want just three documents in return. from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import LLMChainExtractor llm = OpenAI(temperature=0) compressor = LLMChainExtractor.from_llm(llm) compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=vectordb.as_retriever(search_type = "mmr", search_kwargs={"k": 3}) ) question = "breakfast in Travelodge Farringdon" compressed_docs = compression_retriever.get_relevant_documents(question) As usual, understanding how it works under the hood is the most exciting part. If we look at actual calls, there are three calls to LLM to extract only relevant information from the text. Here’s an example. In the output, we got only part of the sentence related to breakfast, so compression helps. There are many more beneficial approaches for retrieval, for example, techniques from classic NLP: SVM or TF-IDF. Different retrievers might be helpful in different situations, so I recommend you compare different versions for your task and select the most suitable one for your use case.

Generation

Finally, we got to the last stage: we will combine everything and generate the final answer. Here’s a scheme on how it all will work: we get a question from a user, we retrieve relevant documents for this question from the vector store using embeddings, we pass the initial question along with retrieved documents to the LLM and get the final answer. Scheme by author In LangChain, we could use RetrievalQA chain to implement this flow quickly. from langchain.chains import RetrievalQA from langchain.chat_models import ChatOpenAI llm = ChatOpenAI(model_name='gpt-4', temperature=0.1) qa_chain = RetrievalQA.from_chain_type( llm, retriever=vectordb.as_retriever(search_kwargs={"k": 3}) ) result = qa_chain({"query": "what customers like about staff in the hotel?"}) Let’s look at the call to ChatGPT. As you can see, we passed retrieved documents along with the user query. Here’s an output from the model. We can tweak the model’s behaviour, customising prompt. For example, we could ask the model to be more concise. from langchain.prompts import PromptTemplate template = """ Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Keep the answer as concise as possible. Use 1 sentence to sum all points up. ______________ {context} Question: {question} Helpful Answer:""" QA_CHAIN_PROMPT = PromptTemplate.from_template(template) qa_chain = RetrievalQA.from_chain_type( llm, retriever=vectordb.as_retriever(), return_source_documents=True, chain_type_kwargs={"prompt": QA_CHAIN_PROMPT} ) result = qa_chain({"query": "what customers like about staff in the hotel?"}) We got a much shorter answer this time. Also, since we specified return_source_documents=True, we got a set of documents in return. It could be helpful for debugging. As we’ve seen, all retrieved documents are combined in one prompt by default. This approach is excellent and straightforward since it invokes only one call to LLM. The only limitation is that your documents must fit the context size. If they don’t, you need to apply more complex techniques. Let’s look at different chain types that could allow us to work with any number of documents. The first one is MapReduce. This approach is similar to classical MapReduce: we generate answers based on each retrieved document (map stage) and then combine these answers into the final one (reduce stage). Scheme by author The limitations of all such approaches are cost and speed. Instead of one call to LLM, you need to do a call for each retrieved document. Regarding code, we just need to specify chain_type="map_reduce" to change behaviour. qa_chain_mr = RetrievalQA.from_chain_type( llm, retriever=vectordb.as_retriever(), chain_type="map_reduce" ) result = qa_chain_mr({"query": "what customers like about staff in the hotel?"}) In the result, we got the following output. Let’s see how it works using debug mode. Since it’s a MapReduce, we first sent each document to LLM and got the answer based on this chunk. Here’s an example of prompt for one of the chunks. Then, we combine all the results and ask LLM to come up with the final answer. That’s it. There is another drawback specific to the MapReduce approach. The model sees each document separately and doesn’t have them all in the same context, which might lead to worse results. We can overcome this drawback with the Refine chain type. Then, we will look at documents sequentially and allow the model to refine the answer on each iteration. Scheme by author Again, we just need to change chain_type to test another approach. qa_chain_refine = RetrievalQA.from_chain_type( llm, retriever=vectordb.as_retriever(), chain_type="refine" ) result = qa_chain_refine({"query": "what customers like about staff in the hotel?"}) With the Refine chain, we got a bit more wordy and complete answer. Let’s see how it works using debug. For the first chunk, we are starting from scratch. Then, we pass the current answer and a new chunk and give the model a chance to refine its answer. Then, we repeat the refining prompt for each remaining retrieved document and get the final result. That’s all that I wanted to tell you today. Let’s do a quick recap. Summary In this article, we went through the whole process of Retrieval-augmented generation: We’ve looked at different data loaders. We’ve discussed possible approaches to data splitting and their potential nuances. We’ve learned what embeddings are and set up a vector store to access data effectively. We’ve found different solutions for retrieval issues and learned how to increase diversity, to overcome context size limitations and to use metadata. Finally, we’ve used the RetrievalQA chain to generate the answer based on our data and compared different chain types. This knowledge should be enough for start building something similar with your data.