Cuda

CUDA

Check your computer's Cuda:

import torch
                  torch.cuda.is_available()
                  torch.zeros(1).cuda()
                  torch.__version__

pytrch: Pytorch

nvidia-smi
                  model.device

import streamlit as st
                  import datetime
                  
                  st.title("Form for the Users")
                  st.write("Here, you can answer to some questions in this form.")
                  
                  user_id = st.text_input("ID", value="Your ID", max_chars=7)
                  age     = st.number_input("Age", min_value=18, max_value=100, step=1)
                  b_date  = st.date_input("Date of Birth", min_value=datetime.date(1921, 1, 1),        max_value=datetime.date(2033, 12, 31))
                  smoke   = st.checkbox("Do you smoke?")
                  genre   = st.radio("Which movie genre do you like?",                                      options=['horror', 'adventure', 'romantic'])
                  weight  = st.slider("Choose your weight", min_value=40., max_value=150., step=0.5)
                  p_form  = st.selectbox("Select level of your physical condition",                 options=["Bad", "Normal", "Good"])
                  colors  = st.multiselect('What are your favorite colors',                                options=['Green', 'Yellow', 'Red', 'Blue', 'Pink'])
                  info    = st.text_area("Share some information about you", "Put information here",         help='You can write about your hobbies or family')
                  image   = st.file_uploader("Upload your photo", type=['jpg', 'png'])
                  
                  click = st.sidebar.button('Click me!')
                  if click:
                      st.sidebar.write("You clicked the button")
                  
                  col1, col2 = st.columns(2)
                  with col1:
                      st.image("https://static.streamlit.io/examples/cat.jpg", width=300)
                      st.button("Like cats")
                  with col2:
                      st.image("https://static.streamlit.io/examples/dog.jpg", width=355)
                      st.button("Like dogs")
                  
                  submit = st.button("Submit")
                  if submit:
                      st.write("You submitted the form")

Vector stores

We could store comments as strings, but it won’t help us to solve this task. A much more functional solution is to store documents’ embeddings. Embeddings are high-dimensional vectors. Embeddings capture semantical meanings and relationships between words and phrases so that semantically close texts will have a smaller distance between them. After embedding, we convert comments into numeric vectors. The next question is how we should store it so that this data is easily accessible. Let’s think about our use case. Our flow will be: - get a question, - calculate its embedding, - find the most relevant document chunks related to this question (the ones with the smallest distance to this embedding), - finally, pass found chunks to LLM as a context along with the initial question. The regular task for the data storage will be to find K nearest vectors (K most relevant documents). So, we will need to calculate the distance (in our case, Cosine Similarity) between our question’s embedding and all the vectors we have. Generic databases (like Snowflake or Postgres) will perform poorly for such a task. But there are databases optimised, especially for this use case — vector databases. We will be using an open-source embedding database, Chroma. Chroma is a lightweight in-memory DB, so it’s ideal for prototyping. You can find much more options for vector stores here.

from langchain.vectorstores import Chroma
                persist_directory = 'vector_store'
                
                vectordb = Chroma.from_documents(
                    documents=split_docs,
                    embedding=embedding,
                    persist_directory=persist_directory
                )

To be able to load data from disk when you need it next time, execute the following command.

embedding = OpenAIEmbeddings()
                vectordb = Chroma(
                    persist_directory=persist_directory,
                    embedding_function=embedding
                )

The database initialisation might take a couple of minutes since Chroma needs to load all documents and get their embeddings using OpenAI API. We can see that all documents have been loaded.

print(vectordb._collection.count())
                12890

Now, we could use a similarity search to find top customer comments about staff politeness. query_docs = vectordb.similarity_search('politeness of staff', k=3) Documents look pretty relevant to the question.

We have stored our customer comments in an accessible way, and it’s time to discuss retrieval in more detail.

Retrieval

We’ve already used vectordb.similarity_search to retrieve the most related chunks to the question. In most cases, such an approach will work for you, but there could be some nuances: - Lack of diversity — The model might return extremely close texts (even duplicates), which won’t add much new information to LLM. - Not taking into account metadata — similarity_search doesn’t take into account the metadata information we have. For example, if I query the top-5 comments for the question “breakfast in Travelodge Farringdon”, only three comments in the result will have the source equal to uk_england_london_travelodge_london_farringdon. - Context size limitation — as usual, we have limited LLM context size and need to fit our documents into it. Let’s discuss techniques that could help us to solve these problems. Addressing Diversity — MMR (Maximum Marginal Relevance)

Addressing Diversity — MMR (Maximum Marginal Relevance)

Similarity search returns the most close responses to your question. But to provide the complete information to the model, you might want not to focus on the most similar texts. For example, for the question “breakfast in Travelodge Farringdon”, the top five customer reviews might be about coffee. If we look only at them, we will miss other comments mentioning eggs or staff behaviour and get somewhat limited view on the customer feedback. We could use the MMR (Maximum Marginal Relevance) approach to increase the diversity of customer comments. It works pretty straightforward: First, we get fetch_k the most similar docs to the question using similarity_search. Then, we picked up k the most diverse among them. If we want to use MMR, we should use max_marginal_relevance_search instead of similarity_search and specify fetch_k number. It’s worth keeping fetch_k relatively small so that you don’t have irrelevant answers in the output. That’s it.

query_docs = vectordb.max_marginal_relevance_search('politeness of staff', 
                    k = 3, fetch_k = 30)

Let’s look at the examples for the same query. We got more diverse feedback this time. There’s even a comment with negative sentiment.

Addressing specificity — LLM-aided retrieval

The other problem is that we don’t take into account the metadata while retrieving documents. To solve it, we can ask LLM to split the initial question into two parts: semantical filter based on document texts, filter based on metadata we have. This approach is called “Self querying”. First, let’s add a manual filter specifying a source parameter with the filename related to Travelodge Farringdon hotel.

query_docs = vectordb.similarity_search('breakfast in Travelodge Farrigdon', 
                  k=5,
                  filter = {'source': 'hotels/london/uk_england_london_travelodge_london_farringdon'}
                )

Now, let’s try to use LLM to come up with such a filter automatically. We need to describe all our metadata parameters in detail and then use SelfQueryRetriever.

from langchain.llms import OpenAI
                from langchain.retrievers.self_query.base import SelfQueryRetriever
                from langchain.chains.query_constructor.base import AttributeInfo
                
                metadata_field_info = [
                    AttributeInfo(
                        name="source",
                        description="All sources starts with 'hotels/london/uk_england_london_' \
                          then goes hotel chain, constant 'london_' and location.",
                        type="string",
                    )
                ]
                
                document_content_description = "Customer reviews for hotels"
                llm = OpenAI(temperature=0.1) # low temperature to make model more factual
                # by default 'text-davinci-003' is used
                
                retriever = SelfQueryRetriever.from_llm(
                    llm,
                    vectordb,
                    document_content_description,
                    metadata_field_info,
                    verbose=True
                )
                
                question = "breakfast in Travelodge Farringdon"
                docs = retriever.get_relevant_documents(question, k = 5)

Our case is tricky since the source parameter in the metadata consists of multiple fields: country, city, hotel chain and location. It’s worth splitting such complex parameters into more granular ones in such situations so that the model can easily understand how to use metadata filters. However, with a detailed prompt, it worked and returned only documents related to Travelodge Farringdon. But I must confess, it took me several iterations to achieve this result. Let’s switch on debug and see how it works. To enter debug mode, you just need to execute the code below.

import langchain 
                langchain.debug = True

The complete prompt is pretty long, so let’s look at the main parts of it. Here’s the prompt’s start, which gives the model an overview of what we expect and the main criteria for the result. Then, the few-shot prompting technique is used, and the model is provided with two examples of input and expected output. Here’s one of the examples. We are not using a chat model like ChatGPT but general LLM (not fine-tuned on instructions). It’s trained just to predict the following tokens for the text. That’s why we finished our prompt with our question and the string Structured output: expecting the model to provide the answer.

Addressing size limitations — Compression

The other technique for retrieval that might be handy is compression. Even though GPT 4 Turbo has a context size of 128K tokens, it’s still limited. That’s why we might want to preprocess documents and extract only relevant parts. The main advantages are: You will be able to fit more documents and information into the final prompt since they will be condensed. You will get better, more focused results because the non-relevant context will be cleaned during preprocessing. These benefits come with the cost — you will have more calls to LLM for compression, which means lower speed and higher price. You can find more info about this technique in the docs. Scheme by author Actually, we can even combine techniques and use MMR here. We used ContextualCompressionRetriever to get results. Also, we specified that we want just three documents in return. from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import LLMChainExtractor llm = OpenAI(temperature=0) compressor = LLMChainExtractor.from_llm(llm) compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=vectordb.as_retriever(search_type = "mmr", search_kwargs={"k": 3}) ) question = "breakfast in Travelodge Farringdon" compressed_docs = compression_retriever.get_relevant_documents(question) As usual, understanding how it works under the hood is the most exciting part. If we look at actual calls, there are three calls to LLM to extract only relevant information from the text. Here’s an example. In the output, we got only part of the sentence related to breakfast, so compression helps. There are many more beneficial approaches for retrieval, for example, techniques from classic NLP: SVM or TF-IDF. Different retrievers might be helpful in different situations, so I recommend you compare different versions for your task and select the most suitable one for your use case.

Generation

Finally, we got to the last stage: we will combine everything and generate the final answer. Here’s a scheme on how it all will work: we get a question from a user, we retrieve relevant documents for this question from the vector store using embeddings, we pass the initial question along with retrieved documents to the LLM and get the final answer. Scheme by author In LangChain, we could use RetrievalQA chain to implement this flow quickly. from langchain.chains import RetrievalQA from langchain.chat_models import ChatOpenAI llm = ChatOpenAI(model_name='gpt-4', temperature=0.1) qa_chain = RetrievalQA.from_chain_type( llm, retriever=vectordb.as_retriever(search_kwargs={"k": 3}) ) result = qa_chain({"query": "what customers like about staff in the hotel?"}) Let’s look at the call to ChatGPT. As you can see, we passed retrieved documents along with the user query. Here’s an output from the model. We can tweak the model’s behaviour, customising prompt. For example, we could ask the model to be more concise. from langchain.prompts import PromptTemplate template = """ Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Keep the answer as concise as possible. Use 1 sentence to sum all points up. ______________ {context} Question: {question} Helpful Answer:""" QA_CHAIN_PROMPT = PromptTemplate.from_template(template) qa_chain = RetrievalQA.from_chain_type( llm, retriever=vectordb.as_retriever(), return_source_documents=True, chain_type_kwargs={"prompt": QA_CHAIN_PROMPT} ) result = qa_chain({"query": "what customers like about staff in the hotel?"}) We got a much shorter answer this time. Also, since we specified return_source_documents=True, we got a set of documents in return. It could be helpful for debugging. As we’ve seen, all retrieved documents are combined in one prompt by default. This approach is excellent and straightforward since it invokes only one call to LLM. The only limitation is that your documents must fit the context size. If they don’t, you need to apply more complex techniques. Let’s look at different chain types that could allow us to work with any number of documents. The first one is MapReduce. This approach is similar to classical MapReduce: we generate answers based on each retrieved document (map stage) and then combine these answers into the final one (reduce stage). Scheme by author The limitations of all such approaches are cost and speed. Instead of one call to LLM, you need to do a call for each retrieved document. Regarding code, we just need to specify chain_type="map_reduce" to change behaviour. qa_chain_mr = RetrievalQA.from_chain_type( llm, retriever=vectordb.as_retriever(), chain_type="map_reduce" ) result = qa_chain_mr({"query": "what customers like about staff in the hotel?"}) In the result, we got the following output. Let’s see how it works using debug mode. Since it’s a MapReduce, we first sent each document to LLM and got the answer based on this chunk. Here’s an example of prompt for one of the chunks. Then, we combine all the results and ask LLM to come up with the final answer. That’s it. There is another drawback specific to the MapReduce approach. The model sees each document separately and doesn’t have them all in the same context, which might lead to worse results. We can overcome this drawback with the Refine chain type. Then, we will look at documents sequentially and allow the model to refine the answer on each iteration. Scheme by author Again, we just need to change chain_type to test another approach. qa_chain_refine = RetrievalQA.from_chain_type( llm, retriever=vectordb.as_retriever(), chain_type="refine" ) result = qa_chain_refine({"query": "what customers like about staff in the hotel?"}) With the Refine chain, we got a bit more wordy and complete answer. Let’s see how it works using debug. For the first chunk, we are starting from scratch. Then, we pass the current answer and a new chunk and give the model a chance to refine its answer. Then, we repeat the refining prompt for each remaining retrieved document and get the final result. That’s all that I wanted to tell you today. Let’s do a quick recap. Summary In this article, we went through the whole process of Retrieval-augmented generation: We’ve looked at different data loaders. We’ve discussed possible approaches to data splitting and their potential nuances. We’ve learned what embeddings are and set up a vector store to access data effectively. We’ve found different solutions for retrieval issues and learned how to increase diversity, to overcome context size limitations and to use metadata. Finally, we’ve used the RetrievalQA chain to generate the answer based on our data and compared different chain types. This knowledge should be enough for start building something similar with your data.