Evolution, Visualisation & Applications of Embeddings

As human beings, we can read and understand texts (at least some of them). Computers in opposite “think in numbers”, so they can’t automatically grasp the meaning of words and sentences. If we want computers to understand the natural language, we need to convert this information into the format that computers can work with — vectors of numbers. People learned how to convert texts into machine-understandable format many years ago (one of the first versions was ASCII). Such an approach helps render and transfer texts but doesn’t encode the meaning of the words. At that time, the standard search technique was a keyword search when you were just looking for all the documents that contained specific words or N-grams. Then, after decades, embeddings have emerged. We can calculate embeddings for words, sentences, and even images. Embeddings are also vectors of numbers, but they can capture the meaning. So, you can use them to do a semantic search and even work with documents in different languages. In this article, I would like to dive deeper into the embedding topic and discuss all the details: - what preceded the embeddings and how they evolved, - how to calculate embeddings using OpenAI tools, - how to define whether sentences are close to each other, - how to visualise embeddings, - the most exciting part is how you could use embeddings in practice. Let’s move on and learn about the evolution of embeddings.

Tokens and Embeddings in LLMs

In the world of natural language processing, it is the smallest unit of analysis that we define. What you call a token depends on your tokenization method; plenty of such methods exist. Creating tokens is basically the first step to perform for most NLP tasks.

Tokenization Methods in NLP

# Example string for tokenization example_string = "It's over 9000!" # Method 1: White Space Tokenization # This method splits the text based on white spaces white_space_tokens = example_string.split() # Method 2: WordPunct Tokenization # This method splits the text into words and punctuation from nltk.tokenize import WordPunctTokenizer wordpunct_tokenizer = WordPunctTokenizer() wordpunct_tokens = wordpunct_tokenizer.tokenize(example_string) # Method 3: Treebank Word Tokenization # This method uses the standard word tokenization of the Penn Treebank from nltk.tokenize import TreebankWordTokenizer treebank_tokenizer = TreebankWordTokenizer() treebank_tokens = treebank_tokenizer.tokenize(example_string) white_space_tokens, wordpunct_tokens, treebank_tokens (["It's", 'over', '9000!'], ['It', "'", 's', 'over', '9000', '!'], ['It', "'s", 'over', '9000', '!'])

Why do we need to tokenize strings?

    To break down complex text into manageable units. To present text in a format that is easier to analyze or perform operations on. Useful for specific linguistic tasks like part-of-speech tagging, syntactic parsing, and named entity recognition. Uniformly preprocess text in NLP applications and create structured training data.
Most NLP systems perform some operations on these tokens to perform a specific task. For example, we can design a system to take a sequence of tokens and predict the next token. We can also convert the tokens into their phonetic representation as part of a text-to-speech system. Many other NLP tasks can be done, like keyword extraction, translation, etc.

How do we actually use these tokens to build these systems in the first place?

1. Feature Extraction: Tokens are used to extract features that are fed into machine learning models. Features might include the tokens themselves, their frequency, their part-of-speech tags, their position in a sentence, etc. For instance, in sentiment analysis, the presence of certain tokens might be strongly indicative of positive or negative sentiment. 2. Vectorization: In many NLP tasks, tokens are converted into numerical vectors using techniques like Bag of Words (BoW), TF-IDF (Term Frequency-Inverse Document Frequency), or word embeddings (like Word2Vec, GloVe). This process turns text data into numbers that machine learning models can understand and work with. 3. Sequence Modeling: For tasks like language modeling, machine translation, and text generation, tokens are used in sequence models like Recurrent Neural Networks (RNNs), Long Short-Term Memory Networks (LSTMs), or Transformers. These models learn to predict sequences of tokens, understanding the context and the likelihood of token occurrence. 4. Training the Model: In the training phase, models are fed tokenized text and corresponding labels or targets (like categories for classification tasks or next tokens for language models). The models learn patterns and associations between the tokens and the desired output. 5. Context Understanding: Advanced models like BERT and GPT use tokens to understand context and generate embeddings that capture the meaning of a word in a specific context. This is crucial for tasks where the same word can have different meanings based on its usage. In very simple terms, we have text strings that we convert to independent units called tokens. This makes it easier to convert them to “numbers,” later, which the computer understands.

ChatGPT and Tokens

What do tokens look like in the context of LLMs like ChatGPT? The tokenization methods used for LLMs differ from those used in general NLP. Broadly speaking, we can call it “subword tokenization,” where we create tokens that need not necessarily be complete words as we see in whitespace tokenization. This is precisely why one word is not equal to one token. When they say GPT-4 Turbo has 128K tokens as its context length, it is not exactly 128K words but a number close to it. Why use such different and more complicated tokenization methods? 1. These tokens are more intricate representations of language than complete words. 2. They help address a large range of vocabulary, including rare and unknown words. 3. Working with smaller subunits is computationally more efficient. 4. They help with better contextual understanding. It’s more adaptable across languages that can be quite different from English.

Byte-Pair-Encoding (BPE)

Many open-source models, like Meta’s LLAMA-2 and the older GPT models, use a version of this method. In a real-world context, BPE analyzes a large amount of text to determine the most common pairs. from transformers import GPT2Tokenizer # Initialize the tokenizer tokenizer = GPT2Tokenizer.from_pretrained("gpt2") text = "It's over 9000!" # Tokenize the text token_ids = tokenizer.encode(text, add_special_tokens=True) # Output the token IDs print("Token IDs:", token_ids) # Convert token IDs back to raw tokens and output them raw_tokens = [tokenizer.decode([token_id]) for token_id in token_ids] print("Raw tokens:", raw_tokens) Token IDs: [1026, 338, 625, 50138, 0] Raw tokens: ['It', "'s", ' over', ' 9000', '!'] Let’s break down how this process works. Building the “Vocabulary” (this is basically part of the BPE method) Starting with Characters: Initially, the vocabulary consists of individual characters (like letters and punctuation). Finding Common Pairs: The training data (a large corpus of text) is scanned to find the most frequently occurring pairs of characters. For example, if ‘th’ appears often, it becomes a candidate to be added to the vocabulary. Merging and Creating New Tokens: These common pairs are then merged to form new tokens. The process continues iteratively, each time identifying and merging the next most frequent pair. The vocabulary grows from individual characters to common pairings and eventually to larger structures like common words or parts of words. Limiting the Vocabulary: There’s a limit to the vocabulary size (e.g., 50,000 tokens in GPT-2). Once this limit is reached, the process stops, resulting in a fixed-size vocabulary that includes a mix of characters, common pairings, and more complex tokens. Assigning Token IDs Indexing the Vocabulary: Each unique token in the final vocabulary is assigned a unique numerical index or ID. This is done straightforwardly, much like indexing in a list or array. Token ID Representation: In the context of GPT-2, each piece of text (like a word or part of a word) is represented by the ID of the corresponding token in this vocabulary. If a word is not in the vocabulary, it’s broken down into smaller tokens that are in the vocabulary. Special Tokens: Special tokens (like those representing the start and end of a text or unknown words) are also assigned unique IDs. The key point is that the assignment of token IDs is not arbitrary but based on the frequency of occurrence and combination patterns in the language data the model was trained on. This allows GPT-2 and similar models to efficiently process and generate human language using a manageable and representative set of tokens. Here, the “vocabulary” refers to all the unique tokens that the model can recognize and work with. It’s essentially the tokens created with the help of training data using the given tokenization method. Phew! That’s a lot of stuff to process. Most current generation of LLMs use some variation of BPE. For example, the Mistral model uses the byte fallback BPE tokenizer. Some other methods beyond BPE include unigram, sentencepiece, and wordpiece. Let’s not worry about all that. For now, what’s important to know is that creating tokens is one of the first steps when dealing with NLP or LLMs. Different tokenization methods exist to create tokens, which are also assigned some token IDs.