Exploring GPT Models: Understanding Their Inner Workings
Written on
Chapter 1: Introduction to GPT Models
In recent years, artificial intelligence has captured significant attention, primarily due to the emergence of GPT-based large language models. While this technique isn't entirely novel—LSTM (Long Short-Term Memory) networks were introduced in 1997, and the influential paper "Attention is All You Need" appeared in 2017—these developments laid the groundwork for modern natural language processing. It wasn’t until 2020 that GPT-3's performance became impressive enough to be applicable in real-world scenarios, not just in academic contexts.
Nowadays, many users interact with GPT via web browsers, yet only a small fraction (less than 1%) truly understands its underlying mechanisms. The model's clever and engaging responses can create the illusion of conversing with an intelligent entity, but is that actually the case? The best way to grasp its workings is to delve into its architecture. In this article, we will set up a GPT model from OpenAI locally and explore its functionalities step by step.
This piece is designed for beginners and those interested in programming and data science. I will use Python for demonstrations, but extensive knowledge of Python isn't a prerequisite. Let’s dive in!
Loading the Model
For our exploration, we'll utilize the GPT-2 "Large" model developed by OpenAI in 2019. While this model was once considered cutting-edge, it has since lost its commercial relevance and can now be freely downloaded from HuggingFace. Importantly, the architecture of GPT-2 resembles that of its successors, albeit with different parameter counts:
- The GPT-2 "Large" model contains 0.7 billion parameters (GPT-3 boasts 175 billion, and GPT-4 is rumored to have 1.7 trillion).
- GPT-2 consists of 36 layers with 20 attention heads (GPT-3 has 96, while GPT-4 is rumored to have 120 layers).
- It has a context length of 1024 tokens (GPT-3 supports 2048, and GPT-4 is speculated to manage 128K).
While GPT-3 and GPT-4 excel in benchmark tests compared to GPT-2, they are not readily available for download. Even if they were, operating a model with 175 billion parameters necessitates a highly advanced computing setup. Therefore, for our understanding, GPT-2 suffices.
To implement the model in Python, we require two components: the model itself and the tokenizer.
from transformers import GPT2LMHeadModel, GPT2Tokenizer, set_seed
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
model = GPT2LMHeadModel.from_pretrained('gpt2-large')
The transformers library efficiently handles the model download during the initial setup. Now, let’s explore how to utilize it.
Understanding the Tokenizer
A tokenizer is essential in any language model. Neural networks can't process text in its raw form; instead, the tokenizer translates text into a numerical array.
print(tokenizer("Paris will", return_tensors="pt"))
#> tensor([[40313, 481]])
In this example, the phrase "Paris will" is transformed into the tensor [40313, 481]. Here, "Paris" corresponds to the single token 40313. Conversely, we can decode the token back into text:
print(tokenizer.decode([40313]))
#> Paris
Interestingly, encoding the lowercase "paris" yields two tokens:
print(tokenizer.encode("paris"))
#> [1845, 271]
This indicates that only the most common words receive single-digit token representations, while others are split into smaller parts. This method allows the model to handle new, unfamiliar, or misspelled words effectively.
For those interested, the complete GPT-2 vocabulary is available in JSON format on GitHub, featuring 50,257 entries. Notably, GPT-3 and GPT-4 utilize a different tokenizer, known as tiktoken, although it follows a similar logic:
import tiktoken
tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo")
tokenizer.encode("Paris")
#> [60704]
tokenizer.encode("paris")
#> [1768, 285]
However, the GPT-2 model remains compatible with tiktoken as well:
tokenizer.encoding_for_model("gpt-2").encode("Paris")
#> [40313]
Generating Output with the GPT Model (Simple Approach)
Now that we can convert text into tokens, we can use the GPT model to generate output. With the transformers library, this can be accomplished in just a few lines of code:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')
model = GPT2LMHeadModel.from_pretrained('gpt2-large')
text = "Paris will"
model_input = tokenizer.encode(text, return_tensors="pt")
set_seed(42)
output = model.generate(model_input,
max_length=32,
pad_token_id=tokenizer.eos_token_id)[0]
print(output)
#> tensor([40313, 481, 307, 262, 717, 1748, 287, 262, 995,
#> 284, 423, 257, 3938, 16359 ... ])
print(tokenizer.decode(output))
#> Paris will be the first city in the world to have a fully
#> automated train system ...
As previously mentioned, the model's output is also a tensor, necessitating a reverse conversion to yield text. The set_seed function initializes the internal random number generator, ensuring consistent responses from the model for identical prompts.
Generating Output with the GPT Model (Advanced Approach)
While we successfully generated output using the GPT model, the process remains somewhat obscured within the library's implementation. Let's explore the generation process at a more granular level, analyzing it token by token.
First, we need to load the model and prepare the prompt:
import torch
import torch.nn.functional as F
text = "London was"
model_input = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
#> tensor([[23421, 373]])
model = GPT2LMHeadModel.from_pretrained('gpt2-large')
model.eval()
Next, we will initiate the generation:
outputs = model(model_input, labels=model_input)
loss, logits = outputs[:2]
The logits output represents unnormalized probabilities for each token. To analyze it further:
print(logits)
#> tensor([[[ 1.1894, 4.7469, 0.0803, ..., -6.4411, -5.3999, 1.9996],
#> [-0.1938, 1.7015, -3.6939, ..., -6.9758, -3.3617, 0.0359]]])
print(logits.shape)
#> [1, 2, 50257]
This output illustrates that GPT is a language model trained on vast amounts of text, providing output probabilities. For instance, given the phrase "London was," the next token "a" is statistically more likely than "no."
Moreover, GPT is an auto-regressive model, generating tokens one step at a time. The output is "shifted right," meaning we input two tokens and receive two tensors as output. The output shape of [1, 2, 50257] indicates an array of probabilities for every token in the vocabulary (50257 total).
In practice, we focus on the last token generated. To obtain the next token, we will extract the last tensor and identify the tokens with the highest probabilities:
logits = logits[:, -1, :]
#> tensor([[-0.1938, 1.7015, -3.6939, ..., -6.9758, -3.3617, 0.0359]])
top_k = 5
indices_to_remove = logits[0] < torch.topk(logits[0], top_k)[0][..., -1, None]
logits[:, indices_to_remove] = -float("Inf")
next_tokens = torch.multinomial(F.softmax(logits, dim=-1), num_samples=5)
print(tokenizer.decode(next_tokens.squeeze()))
#> "one", "the", "a", "also", "not"
Here, we display the five most probable tokens corresponding to our prompt "London was." Typically, we only need one. We also append the new token to the input for the next iteration:
next_token = torch.tensor([[next_tokens[0][0]]])
model_input = torch.cat((model_input, next_token), dim=1)
#> [23421, 373, 530]
We are now ready for the next step. The process remains consistent; the only difference is we will input three tokens:
outputs = model(model_input, labels=model_input)
loss, logits = outputs[:2]
print(logits)
#> tensor([[[ 2.1159, 4.8143, -0.3819, ..., -8.6419, -5.5092, 1.1465],
#> [ 0.4149, 1.4974, -2.9283, ..., -7.9501, -3.9587, 0.1875],
#> [-1.2257, 0.9350, -4.2245, ..., -7.4362, -4.9682, -0.8710]]]))
print(logits.shape)
#> torch.Size([1, 3, 50257])
Having sent three tokens, we receive three probability arrays, each of size 50257. This process is somewhat inefficient, underscoring the need for high-performance GPUs for rapid computations.
Let's retrieve the top five tokens again to determine the next word:
indices_to_remove = logits[0] < torch.topk(logits[0], top_k)[0][..., -1, None]
logits[:, indices_to_remove] = -float("Inf")
next_tokens = torch.multinomial(F.softmax(logits, dim=-1), num_samples=5)
print(tokenizer.decode(next_tokens.squeeze()))
#> "of", "city", "such", "place", "the"
We can then append the latest token to our input:
next_token = torch.tensor([[next_tokens[0][0]]])
model_input = torch.cat((model_input, next_token), dim=1)
#> [23421, 373, 530, 286]
By repeating this process, we will eventually generate a complete phrase:
print(model_input)
#> tensor([[23421, 373, 530, 286, 262, 1178, 4113, 326,
#> 550, 262, 11917, 284, 1302, 510, 284, 262, 1230, 13]])
print(tokenizer.decode(model_input.squeeze()))
#> London was one of the few places that had the courage to stand
#> up to the government.
Acquiring the top-N most probable tokens is merely one method of selecting the next token. Readers seeking further information can refer to a 2020 blog post from HuggingFace.
Conclusion
In this article, we successfully operated the GPT model, generating output token by token. I hope this guide helps readers grasp the intricacies of how GPT functions.
With this knowledge, we may also ponder whether this model possesses any form of consciousness. The answer seems clear. On one hand, the GPT model has been trained on terabytes of data, giving it extensive factual knowledge. On the other hand, several issues in the generation process indicate it lacks true consciousness:
- The GPT model is static; its knowledge is constrained to the date of its training. It exists as an array of weights stored in a file, incapable of having thoughts or intentions on its own.
- The model lacks memory and cannot acquire new information. The model file is "read-only," and every new request is computed from scratch. While modern libraries like LangChain can summarize prior conversations and integrate them into subsequent prompts, the GPT model itself does not retain memory.
- Most critically, there is an absence of self-awareness. Although GPT can generate impressive responses based on prompts, it doesn't produce anything autonomously. Repeatedly sending the same prompt yields identical responses, as the model lacks a "mind" to change.
Could these challenges be addressed in the future? This remains a billion-dollar question. The pace of AI advancements is rapid, and predictions regarding the arrival of Artificial General Intelligence (AGI) vary widely—from claims of achieving it within five years to estimates suggesting it may take decades (30-50+ years). Currently, numerous teams and individuals worldwide are striving toward this goal, but the timeline for success remains uncertain.
Thank you for reading! If you found this article insightful, consider subscribing to Medium for updates on my future writings, along with access to countless stories from other authors. Feel free to connect with me on LinkedIn. For those interested in the complete source code related to this and other posts, visit my Patreon page.
Additionally, those keen on language models and natural language processing may enjoy my other articles:
- A Weekend AI Project (Part 1): Running Speech Recognition and a LLaMA-2 GPT on a Raspberry Pi
- A Weekend AI Project (Part 2): Using Speech Recognition, PTT, and a Large Action Model on a Raspberry Pi
- A Weekend AI Project (Part 3): Making a Visual Assistant for People with Vision Impairments
- LLMs for Everyone: Running LangChain and a MistralAI 7B Model in Google Colab
- LLMs for Everyone: Running the LLaMA-13B model and LangChain in Google Colab
- 16, 8, and 4-bit Floating Point Formats — How Does it Work?
This video titled "How GPT3 Works - Easily Explained with Animations" provides a visual explanation of the inner workings of GPT-3, breaking down complex concepts into digestible animations.
The video "GPT - Explained!" offers a concise overview of GPT technology, its applications, and the principles behind its operation.