4. Technical Foundations


Work in progress

This section is under construction. This information hasn’t been reviewed or edited yet!


Introduction

Now that we’ve explored how AI evolved into its current form, let’s lift the hood and examine the engine that powers large language models (LLMs). These systems are marvels of engineering, built on a foundation of interconnected components that work together to process and generate human-like text.

What will I get out of this?

By the end of this section, you will be able to:

  1. Explain the process of tokenization in LLMs, including how it transforms text into computational units and its impact on model performance and cost.
  2. Describe the role of context windows in LLMs, comparing their evolution from early models to modern systems with extended capacities.
  3. Analyze how the Transformer architecture and attention mechanisms enable LLMs to process text more effectively than sequential models.
  4. Differentiate between various types of “memory” implementations in LLMs, including semantic, episodic, and procedural memory, and explain strategies for effective context management.
  5. Evaluate the importance of quality training data in LLM development, considering factors such as diversity and accuracy.
  6. Identify the key components of LLM learning, including weights and biases, and explain their role in pattern recognition and text generation.

How Do LLMs Actually Produce Human-Like Text?

Large language models (LLMs) like GPT and PaLM generate human-like text by leveraging advanced neural network architectures, vast training datasets, and probabilistic techniques. At their core, these models predict the next word (or token) in a sequence based on the context provided, producing coherent and contextually relevant text. Here’s a simplified breakdown of how this process works:

Tokenization:

When we read a sentence, our minds don’t process it letter by letter. Instead, we recognize meaningful chunks—words, phrases, or even entire ideas. LLMs do something similar by breaking text into tokens. A token might be a single word, part of a word, or even a sequence of words depending on how common that sequence is in the model’s training data.

Concept: Tokens

Tokens are the fundamental units of text that LLMs process. They enable the model to break down language into manageable pieces for analysis and generation. Tokenization is the first step in transforming raw text into numerical representations for computation.

Remember: different tokenizers will split the text differently, and this can lead to different results!

For instance, the word ‘understanding’ might be split into ‘under’ and ‘standing,’ while other words such as ‘quantum’ or ‘cognition’ might be kept as a single token each. This varies from tokenizer to tokenizer. This flexibility allows LLMs to handle both familiar and unfamiliar text efficiently.

tokens
Example of tokenization from OpenAI's tokenizer

Contextual Understanding via Embeddings

Once tokenized, the input is transformed into embeddings—mathematical representations of tokens in a high-dimensional space.

These embeddings capture semantic relationships between words:

  • Words with similar meanings (e.g., “cat” and “feline”) are placed closer together in this vector space.
  • Positional encodings are added to these embeddings to help the model understand the order of words in a sentence.
Embeddings

We’ll revisit the topic of embeddings later in this chapter, but for now the important thing to remember is that embeddings form the foundation of transformer-based models like GPT, BERT, LLaMA, etc.

Think of embeddings as the “language” that AI models speak internally. Whether doing simple completion or complex retrieval tasks, the model is always working with embeddings under the hood.

Transformer Architecture

Before 2017, AI models processed text sequentially—like reading a book one word at a time. This approach struggled with long-range dependencies in language (e.g., understanding how the start of a paragraph relates to its end). The Transformer architecture changed everything by enabling models to process entire sequences simultaneously. Introduced by Google’s researchers Vaswani et al. in 2017, it revolutionized the field of natural language processing.

transformer
Vaswani et al. in 2017

Imagine analyzing a painting: instead of focusing on one corner at a time, you take in the whole image while zooming in on details that matter most. Transformers use this principle to understand context across an entire input sequence.

In other words, the Transformer architecture processes input data in parallel rather than sequentially. It uses mechanisms like attention to focus on relevant parts of the input, enabling efficient handling of long-range dependencies in text.

Attention Mechanisms

Here’s an example:

“The trophy didn’t fit in the suitcase because it was too big.”

What does “it” refer to—the trophy or the suitcase? Humans rely on context to resolve this ambiguity. Attention mechanisms give LLMs a similar ability by assigning importance (or “attention”) to different parts of an input sequence.

Concept: Attention Mechanisms

Attention mechanisms allow models to weigh the relevance of different tokens within an input sequence. This capability enables nuanced understanding by focusing on contextually important information.

Transformers take this further with multi-head attention, which lets them analyze multiple aspects of context simultaneously—like examining color, texture, and composition in a painting all at once.

So in our previous “trophy” example, self-attention mechanisms help the model focus on “trophy” when interpreting “it” as the subject of the sentence.

Transformers use multiple layers of attention (multi-head attention) to analyze different aspects of context simultaneously, such as syntax, semantics, or long-range dependencies. Once the model has processed the input, it generates text token by token in an autoregressive manner.

Autoregressive Text Generation

Autoregressive generation refers to a mechanism used by LLMs in which they predict the next token in a sequence based on the previous tokens. This process follows these steps:

  • The model predicts the most likely next token based on the input and previously generated tokens.
  • This predicted token is added to the sequence.
  • The process repeats until a stopping criterion is met, such as reaching a maximum length or encountering an end-of-sequence (EOS) token.
Concept: EOS token

An end-of-sequence token is a special marker that signals to the model that it should stop generating further tokens. It acts as a boundary for text completion, ensuring that the output ends at an appropriate point. For example, it might appear as [EOS] or </s> to mark the end of the generated text that fits the input prompt.

So, in our previous example, the model might predict the next token to be “I” and add it to the sequence, and so on. In our example, the model might generate the following text:

“The trophy didn’t fit in the suitcase because it was too big. I’m going to have to buy a new suitcase.”

Context Windows

A model’s context window represents its ability to “remember” and process information. For instance, a model with a 4,000-token context window can work with roughly 3,000 words of text at once. This capacity affects everything from document analysis to conversation memory.

Concept: Context Windows

Context windows define the maximum amount of input text an LLM can process at once, imposing practical limits on long-form tasks.

Modern models have pushed these boundaries significantly:

  • Early models managed only 512 tokens
  • Mid-generation models reached 2,048-4,096 tokens
  • Modern models such as Google’s Gemini boast more than 2 million token context windows

For perspective, each token represents approximately 4 characters or about 0.75 words in English. This means Gemini 1.5 Pro can theoretically process around 1.5 million words at once - equivalent to multiple novels in a single context window.

Note

This is an important concept to understand, as it is not only the basis of how LLMs work, but also usually the basis on how LLM usage is billed. On API usage from a provider, you usually will pay for the number of tokens processed. Normally input tokens are cheaper, and output tokens are more expensive.


Memory

Memory in LLMs refers to how these models maintain and manage information during conversations or interactions. However, it’s crucial to understand that this “memory” is fundamentally different from human memory.

LLMs are Stateless

LLMs are inherently stateless - each inference (generation of text) is completely independent. They have no built-in ability to remember previous interactions or maintain ongoing conversations. What we call “memory” in LLMs is actually implemented through external systems and careful management of inputs.

Types of Memory Implementation

  1. Semantic Memory (Facts & Knowledge):

    • Stores specific facts and information
    • Example: User preferences, biographical details, or domain-specific knowledge
    • Implementation methods:
      • Profile approach: Single, updated document with user information
      • Collection approach: Multiple documents with specific pieces of information
  2. Episodic Memory (Experiences & Interactions):

    • Records specific interactions or conversations
    • Example: Previous troubleshooting steps in a support conversation
    • Implementation methods:
      • Conversation history management
      • Few-shot example storage and retrieval
  3. Procedural Memory (Instructions & Rules):

    • Defines how the model should behave
    • Example: System prompts, formatting rules, response guidelines
    • Implementation methods:
      • System prompts
      • Instruction fine-tuning
      • Dynamic prompt adjustment

Memory Storage Methods

Remember that LLMs are stateless, so the memory ultimately is an addendum to the prompt. It is an external mechanism that is built to dynamically change the prompt based on the conversation history and the information the LLM may need to know at any one time.

There are several ways to implement memory, but they all essentially boil down to the same thing:

  1. Vector Databases:
    • Store text as numerical representations
    • Enable semantic search
    • Efficient retrieval of relevant information
Concept: Vector Databases

Vector databases store text as numerical representations and enable semantic search, allowing for efficient retrieval of relevant information. We’ll cover them to more detail in the Vector Operations and Embeddings section below.

  1. Traditional Databases:

    • Store structured data
    • Quick access to specific facts
    • Good for user profiles and preferences
  2. Document Collections:

    • Store raw text and documents
    • Flexible and easy to update
    • Useful for maintaining conversation history

Here’s how a chatbot might implement memory. It is not exhaustive, but it shows that there are different use cases for memory beyond keeping track of the conversation history, each requiring different implementation strategies:

1. Store user preference: "Prefers technical explanations"
2. Save conversation history in chunks
3. Before each response:
    - Retrieve relevant preferences
    - Load recent conversation context
    - Generate response considering both
4. After response:
    - Update conversation history
    - Extract and store new learned information

Memory Management Strategies

  1. Summarization:

    • Condense long conversations into key points
    • Example: “Previously discussed: user’s project requirements, budget constraints, and timeline”
    • Helps maintain context while saving tokens
  2. Selective Retention:

    • Keep important information, discard less relevant
    • Prioritize based on:
      • Recency (newer messages)
      • Relevance to current topic
      • Explicit user instructions
  3. Memory Augmentation:

    • External vector databases
    • Knowledge bases
    • API integrations
    • Allows “remembering” beyond context limits

Context Management

Memory management is a crucial aspect of LLMs, and one of the greatest challenges in building with them. It is how the model is able to maintain a coherent conversation with the user. Effective context management is crucial for maintaining coherent conversations. It can be achieved through a combination of techniques:

  • Keep track of token usage
  • Use summarization for long conversations
  • Prioritize recent and relevant information
Think about it this way…

Imagine trying to remember a phone number someone is telling you: Miller’s Law tells us that we can probably hold about 7 digits in our short-term memory easily. If they keep adding more numbers (like an extension, country code, etc.), you’ll start dropping the earlier digits unless you write them down or group them meaningfully. LLMs work similarly with their context window - they can only “remember” a certain amount of information at once.

For example, if you’re in a long conversation about planning a party, you might summarize earlier details like “We decided on Saturday at 7 PM” while keeping recent discussion about food choices fresh in your memory. This is exactly how context management works - keeping the most relevant information while condensing older details to make room for new ones, or removing details that are no longer relevant.


Learning and Training

Understanding how LLMs manage context leads us to a deeper question: how do these models learn to understand and process information in the first place? Just as we humans learn to manage information through experience and training, AI models must be trained to recognize patterns and relationships in data. Let’s explore how this learning process works.

How Does the AI Learn These Relationships?

To understand how AI models learn, let’s use a simple analogy: imagine teaching a child to recognize animals.

Weights: Think of weights as the “strength” of connections in the AI’s “brain”:

  • Just as a child learns that “having fur” is strongly connected to “being a mammal”
  • The AI learns that certain patterns in text are strongly connected to certain meanings
  • These connections are represented by numbers (weights) that the AI adjusts as it learns
Think of it This Way…

A child learning that “meows” is strongly connected to “cats” but weakly connected to “dogs” (because both are sounds animals make). The strength of these connections represents the weights in an AI model 🐱🐶

Biases: These are like the AI’s default assumptions or starting points:

  • They help the model make better predictions based on common patterns
  • For example, after “The sun is…” the model might default to words like “bright” or “hot”
  • However, biases can also reflect problematic assumptions from training data
Important Note About Bias

Just like humans can develop biases from their environment, AI models can learn biases from their training data. For example, they might associate certain jobs with specific genders or favor certain cultural perspectives. Addressing these biases is an important part of AI development.

The Importance of Quality Data

Just as a child learns better from good educational materials, AI models need high-quality training data to perform well. Here’s what makes data “good”:

  1. Diversity: The data should include:

    • Different writing styles (formal, casual, technical)
    • Various topics and perspectives
    • Multiple languages and dialects
  2. Quality: Like choosing good textbooks for students:

    • Accurate and reliable information
    • Well-written and clear content
    • Free from errors and inappropriate material
Think of it This Way…

Imagine learning a new language only from news articles 📰 You might become great at understanding formal writing but struggle with everyday conversations, slang, or technical terminology. AI models face the same challenge - they need exposure to many different types of language use to develop well-rounded understanding.

Quality and Challenges

Like any technology, the quality of an AI model depends on how well it’s built and trained. The weights and biases it learns, combined with the quality of its training data, determine how well it can understand and generate text.

Looking Ahead

In later chapters, we’ll explore more deeply how AI models are trained, how they learn these weights and biases, and how we can ensure they learn from high-quality, unbiased data.


Quiz

Let’s test your understanding!

Want to test your understanding of the technical foundations that power Large Language Models? This quiz focuses on conceptual comprehension rather than memorizing facts.

## When deploying an LLM-powered application that needs to process large technical documents, which factor would most significantly affect both performance and cost? > Hint: Think about how text is prepared for processing and how this affects computational resources. 1. [ ] The programming language used to deploy the model > Not quite. While the programming language might affect implementation efficiency, it's not the primary factor affecting performance and cost when processing large documents. 1. [ ] The tokenization method used to process the text > Not exactly. While tokenization does affect how text is broken into computational units, another factor has a more direct impact on cost and performance. 1. [ ] The visual interface design of the application > Incorrect. While UI design affects user experience, it has minimal impact on the underlying model's performance or cost when processing large documents. 1. [x] The context window token count > Correct! How you manage the context window - deciding what portions of a document to include in each API call and how to segment large documents - directly impacts token usage, which determines cost. It also affects the model's ability to maintain coherence across sections, which impacts performance and output quality. ## Why might an application with a 4K token context window struggle to analyze a 20-page legal document, even though the raw text is fewer than 4,000 words? > Hint: Consider the relationship between words, characters, and tokens. 1. [ ] The model isn't trained on legal documents > This isn't the primary technical reason. While domain expertise matters, the question is about context window limitations, not knowledge limitations. 1. [ ] The context window is measured in bytes, not words > Not accurate. Context windows are measured in tokens, not bytes. 1. [x] Tokens don't map directly to words, and specialized legal terminology may require more tokens per word > Correct! Tokens are roughly 0.75 words in English on average, but specialized or uncommon terminology (like legal jargon) often gets split into multiple tokens. A 4,000-word legal document might easily exceed 6,000-7,000 tokens due to complex terminology and formatting. 1. [ ] The context window includes both input and output tokens > While technically true that context windows include both input and output, this doesn't explain why a document with fewer words than the token limit would exceed it. ## An LLM chatbot suddenly seems to "forget" information a user provided earlier in a conversation. What is the most likely technical explanation? > Hint: Consider the fundamental nature of how LLMs manage information over time. 1. [ ] The model's weights were updated during the conversation > Incorrect. Model weights aren't typically updated during inference (conversation) - that happens during training. 1. [ ] The user's earlier messages were corrupted in the database > This is possible but not the most likely technical explanation based on how LLMs function. 1. [x] The conversation exceeded the model's context window, and earlier messages were truncated > Correct! LLMs are stateless and rely on their context window to "remember" previous interactions. Once a conversation exceeds this limit, earlier messages are typically removed, causing the model to "forget" information that was trimmed from its context. 1. [ ] The model experienced catastrophic forgetting > Not accurate in this context. Catastrophic forgetting refers to a phenomenon during training where new information overwrites previously learned information, not to limitations during a conversation. ## Why do modern LLMs use attention mechanisms instead of simply processing text sequentially? > Hint: Think about the limitations of sequential processing versus the advantages of seeing relationships across an entire input. 1. [ ] Attention mechanisms require less computational power > Incorrect. Attention mechanisms are actually more computationally intensive than sequential processing, but the benefits outweigh the costs. 1. [ ] Sequential processing is impossible with neural networks > Not true. Earlier models like RNNs and LSTMs processed text sequentially. 1. [x] Attention allows the model to identify relationships between words regardless of their distance in text > Correct! Attention mechanisms enable models to consider relationships between all words in a sequence simultaneously, rather than forgetting earlier context as in sequential models. This helps resolve references like "it" in "The trophy didn't fit in the suitcase because it was too big" by weighing relevance across the entire input. 1. [ ] Attention mechanisms are easier to implement > Incorrect. Attention mechanisms are actually more complex to implement than sequential processing methods.

Coming up next

Now that we understand how LLMs represent and process information, we’re ready to explore how they’re actually used in real-world applications. In the next section, we’ll look at practical examples of AI systems and their impact across different industries.