6. Inference Techniques


Work in progress

This section is under construction. This information hasn’t been reviewed or edited yet!


Introduction

Now that we’ve explored the fundamentals of LLMs, key players in the market, deployment considerations, technical foundations, and the art of prompt engineering, it’s time to dive into how these models actually operate in real-world applications. This section will examine the technical aspects of inference—the process where LLMs generate responses to our inputs—focusing on API integration patterns, response handling strategies, knowledge integration techniques, and optimization methods. Understanding inference techniques is crucial for implementing LLMs effectively, whether you’re building a simple chatbot or a sophisticated enterprise application with access to proprietary knowledge.

What will I get out of this?

By the end of this section, you will be able to:

  1. Differentiate between the Completion API and Chat API, describing their respective features, use cases, and advantages for various LLM integration scenarios.
  2. Analyze response handling strategies (synchronous vs. streaming) and evaluate their suitability for different application requirements.
  3. Explain knowledge integration techniques (Base Model Only, In-Prompt Static Data, Retrieval-Augmented Generation), including their strengths, limitations, and practical applications.
  4. Describe the concept of embeddings, including how they represent semantic relationships between concepts in AI systems.
  5. Evaluate inference optimization techniques (e.g., batching, caching, request throttling, early stopping) and recommend appropriate strategies for improving performance and cost-efficiency in LLM-powered applications.
  6. Apply the concept of Retrieval-Augmented Generation (RAG) by outlining how it integrates external knowledge into LLMs during inference to enhance accuracy and relevance.

API Integration Types

When integrating with Large Language Models (LLMs), you have two primary approaches: the Completion API and the Chat API. While both facilitate interactions with LLMs, they are designed for different types of tasks and use cases.


Completion API

The Completion API is ideal for single-turn interactions, where you send one prompt and receive one response. It offers precise control over the input structure and is best suited for isolated tasks.

Key Features:

  • Simplicity: Each interaction is self-contained, with no built-in conversation structure.
  • Flexibility: You can design prompts exactly as needed for your application.
  • Token Efficiency: Typically uses fewer tokens, making it more efficient for short, standalone tasks.
  • Use Cases: Generating content, performing text summarization, or answering one-off queries.

Enhancing Completion API with Templates

Developers often employ templating techniques to improve prompt clarity and consistency. Two common methods include:

  • ChatML-Style Format:

    system
    You are a helpful coding tutor who explains concepts clearly.
    
    user
    What is a for loop?
    
    assistant
    A for loop is a control flow statement that repeats a block of code a specified number of times...
    
    user
    Can you give me an example?
    
    assistant  

    This approach leverages a markdown-like structure to separate different parts of the prompt.

  • Custom Delimiters:

    ### System ###
    You are a helpful coding tutor who explains concepts clearly.
    
    ### User ###
    What is a for loop?
    
    ### Assistant ###
    A for loop is a control structure used to iterate over a sequence...
    
    ### User ###
    Can you give me an example?
    
    ### Assistant ###

    Custom delimiters enable you to define clear sections within your prompt.

Did you Notice?

Leaving the ‘assistant’ section for the AI to being completion open is one way to ensure it will begin immediately completing following the pattern that preceded it, rather than breaking the expected format or sequence.


Chat API

The Chat API is specifically built for multi-turn, interactive conversations. It organizes messages into discrete units, each assigned a role, to maintain a coherent dialogue during a session.

Key Features:

  • Structured Dialogue: Uses well-defined message roles—system, user, and assistant—to organize interactions.
  • Multi-Turn Interactions: Designed to handle a series of conversational exchanges, providing a natural conversational flow.
  • Enhanced Clarity: Each message is explicitly tagged, which aids in maintaining context within a session.
  • Use Cases: Ideal for chatbots, interactive assistants, and applications requiring a back-and-forth dialogue.

Chat API Conversation Example

Below is a sample conversation using the Chat API format:

{
  "role": "system",
  "message": "You are a helpful coding tutor who explains concepts clearly."
},
{
  "role": "user",
  "message": "What is a for loop?"
},
{
  "role": "assistant",
  "message": "A for loop is a control structure that allows you to execute a block of code repeatedly."
},
{
  "role": "user",
  "message": "Can you show me an example?"
}

This format clearly delineates each message, which helps in maintaining a structured conversation throughout the session.


Choosing the Right API

These are general guidelines, rather than hard rules. Depending on the model and how particular the requirements are, there may be exceptions to this. But largely, you can make the choice based on these criteria:

When to Use the Completion API:

  • For simple, one-off tasks.
  • When detailed conversation history is not required.
  • When you need fine-grained control over the prompt structure.
  • For tasks that demand token efficiency.

When to Use the Chat API:

  • For applications requiring multi-turn conversations.
  • When a structured dialogue is essential.
  • For interactive systems, such as chatbots or virtual assistants.
  • When clarity in the conversation flow is needed.

So… Which One is Better?

Both the Completion API and the Chat API serve distinct purposes in LLM integration. The Completion API offers simplicity and efficient token usage for standalone tasks, while the Chat API provides a structured format suited for interactive, multi-turn conversations. By selecting the appropriate API type based on your application’s requirements, you can enhance both the functionality and user experience of your LLM-driven projects.


Response Handling

The way you receive and process responses from an LLM can significantly impact your application’s user experience and performance. There are two main approaches:

  1. Synchronous (Non-Streaming)

    • Wait for complete response before processing
    • Best for batch processing and applications where immediate feedback isn’t critical
    • Advantages:
      • Simpler to implement
      • Easier to validate and process complete responses
      • Better for systems that need to analyze full responses before proceeding
    • Disadvantages:
      • Can lead to longer perceived latency
      • No intermediate feedback to users
  2. Streaming

    • Real-time token delivery
    • Ideal for interactive applications and chat interfaces
    • Advantages:
      • Better user experience with immediate feedback
      • Allows for progressive rendering of responses
      • Can implement typing indicators or similar UI elements
    • Disadvantages:
      • More complex to implement
      • Requires handling partial responses
      • May need additional error handling for interrupted streams

The choice between synchronous and streaming responses often depends on your application’s needs and how you plan to integrate external knowledge, which leads us to our next topic: knowledge integration strategies.


Knowledge Integration

When working with LLMs, you’ll need to decide how to provide the model with the information it needs to generate accurate and relevant responses. There are three main approaches:

  1. Base Model Only

    • Uses model’s pre-trained knowledge exclusively
    • Limited to training cutoff date
    • Best for general tasks and reasoning
    • Advantages:
      • Simplest to implement
      • Lowest latency
      • No additional infrastructure needed
    • Disadvantages:
      • Knowledge may be outdated
      • Can’t access proprietary or specific information
  2. In-prompt Static Data

    • Using the context window of the LLM to include necessary information
    • Suitable for small amounts of specific data
    • Advantages:
      • Simple to implement
      • Direct control over information provided
      • Good for specific, contained use cases
    • Disadvantages:
      • Limited by context window size
      • Can be expensive in terms of tokens
      • Not scalable for large amounts of data
  3. Retrieval-Augmented Generation (RAG)

    • Enhances responses with external knowledge
    • Keeps information up-to-date
    • Ideal for domain-specific applications
    • Advantages:
      • Access to large amounts of current information
      • Can handle proprietary data
      • More accurate and specific responses
    • Disadvantages:
      • More complex architecture
      • Requires additional infrastructure
      • May have higher latency

The first two are obvious enough, but the latter requires a little more exposition. That’s where we’ll focus next.

Why is this Important?

Retrieval-Augmented Generation in its many kinds and flavors is the foundation of most current enterprise applications. It allows the LLM to access propertary data, records and archives that are variable or frequently updated, rather than data that remains static long term. The latter could potentially be baked into a model by fine-tuning, but it would be impractical and costly to do it regularly for data that changes often.


Retrieval-Augmented Generation (RAG)

Imagine asking an LLM about current events—something it might not know if its training data is outdated. Ultimately the way to adress this is by providing the information it needs to answer the question, this is achieved through the prompt.

However, it is not feasible or practical to provide the all the possible proprietary or up-to-date information it needs to answer the any question ever through the prompt. As we learned before, the context window is limited.

This is where RAG comes in. RAG bridges this gap by combining the model’s generative capabilities with real-time information retrieval. Here’s how it works:

  1. Relevant documents are retrieved from an external database based on the user’s query.
  2. These documents are fed into the model alongside the query.
  3. The model generates a response that incorporates both its learned knowledge and the retrieved information.

For example, a customer support chatbot could use RAG to pull answers from internal company documents while maintaining conversational fluency.

Concept: RAG

Retrieval-Augmented Generation (RAG) enhances LLMs by integrating external information sources into the prompt during inference, enabling them to provide accurate and context-specific responses beyond their static training data.


Technical Components

Vector Operations and Embeddings

To understand how LLMs process and understand text, we need to grasp two fundamental concepts: vectors and embeddings. While related, they are not exactly the same thing.

What is a Vector?

A vector is simply a list of numbers that can represent a point in space, like coordinates on a map. For example:

  • A 2D vector [3, 4] represents a point 3 units east and 4 units north
  • A 3D vector [1, 2, 3] represents a point in three-dimensional space

These are simple examples we can visualize. In LLMs, we use vectors with hundreds or thousands of dimensions. But at their core, vectors are just containers for numbers - they don’t inherently mean anything. Just like the coordinates [3, 4] could represent:

  • A point on a map
  • A measurement of temperature and humidity
  • The height and width of a rectangle
  • Any other pair of numbers

The meaning comes from how we choose to interpret and use these numbers.

Think of it This Way…

Think of vectors like barcodes - they’re just sequences of numbers that don’t mean anything on their own. Embeddings are like barcodes that have been configured to represent specific products. When you scan a barcode, the numbers suddenly have meaning because they’ve been trained to represent that tasty chocolate bar 🍫

All embeddings are vectors, but not all vectors are embeddings! Just like how all barcodes are number sequences, but not all number sequences are meaningful barcodes.


What is an Embedding?

An embedding is a specific use of vectors for representing meaning. If vectors are like empty containers, embeddings are those containers filled with carefully chosen numbers to represent specific concepts. What makes embeddings special is that:

  1. They are learned through training:

    • The model learns what numbers to put in each vector
    • These numbers aren’t arbitrary - they’re optimized to capture meaning
  2. They have meaningful relationships:

    • Similar concepts get similar numbers
    • The differences between embeddings represent meaningful relationships

Here’s a concrete example:

# These are just vectors (containers of numbers):
v1 = [0.2, 0.5, 0.1]
v2 = [0.3, 0.4, 0.2]

# These same vectors become embeddings when we use them to represent words:
"cat"  = [0.2, 0.5, 0.1]  # Now it's an embedding for "cat"
"dog"  = [0.3, 0.4, 0.2]  # Now it's an embedding for "dog"

The key difference:

  • The raw vectors [0.2, 0.5, 0.1] and [0.3, 0.4, 0.2] are just lists of numbers
  • They become embeddings when we train them to represent “cat” and “dog” in a way where:
    • Their similarity reflects that both are pets
    • Their difference reflects that they’re different species
    • Their relationship to other word embeddings captures meaningful patterns
    • Each number in the vector is a dimension
Key Takeaway

The power of embeddings lies in how they capture relationships between concepts. Just like how we naturally understand that “kitten” is related to “cat” and “puppy” is related to “dog”, embeddings allow AI models to understand these connections through carefully chosen numbers.

Understanding Dimensions in Embeddings

To understand dimensions, let’s start with a simplified analogy: imagine if each dimension represented a physical trait - the first dimension being the number of legs, the second being the number of ears, the third being the size, and so on. This helps us grasp why we might need multiple dimensions to fully describe something.

For example, in a trained embedding space:

  • The numbers in the “cat” embedding are specifically chosen so that:
    • It’s close to “kitten” (similar concept)
    • It’s somewhat close to “dog” (both are pets)
    • It’s far from “airplane” (unrelated concept)
    • The difference between “cat” and “kitten” represents the concept of “young animal”
Reality Check

In practice, dimensions in embeddings don’t actually correspond to such clear-cut physical or conceptual features. While it’s tempting to think of dimensions as representing specific traits (like our legs and ears example), the reality is that the dimensions work together in complex ways to capture semantic relationships. The more dimensions we have, the more subtle nuances and relationships the embedding can represent.

Think of it like how our brain processes faces - while we might try to describe a face in terms of specific features (nose shape, eye color, etc.), our actual recognition system works in much more complex and interconnected ways that are harder to break down into individual components.


Putting It All Together

Now that we understand vectors and embeddings, let’s see how they work in practice. When you interact with an AI like ChatGPT, here’s what’s happening behind the scenes:

  1. Words to Numbers: Your text is converted into embeddings (those special number lists we talked about)
  2. Finding Relationships: The AI uses these numbers to understand:
    • Which words are similar in meaning
    • How different concepts relate to each other
    • What makes sense in the context you’ve provided

This process happens millions of times per second as the AI processes your input and generates responses. While the underlying math is complex, the basic idea is simple: we’re converting words into numbers in a way that captures their meaning and relationships.

Embeddings: Beyond RAG

It’s important to understand that embeddings are not used exclusively for RAG or vector database applications!

They are actually fundamental to how all modern LLMs work internally. Even when using a “base model only” approach with no external retrieval, the model is constantly creating and manipulating embeddings internally.

Also, it is important to note that while we’ve focused on text embeddings, similar techniques are used for images, audio, and multimodal AI systems.

So for example, when you ask ChatGPT “What’s the capital of France?” here’s what happens:

  1. Your words are converted into embeddings
  2. The AI finds relationships between concepts like “capital,” “France,” and “cities”
  3. It recognizes that “Paris” is the embedding that best matches what you’re looking for
  4. The embedding is converted back into text: “Paris”

Inference Optimization Techniques

When using LLMs in applications, several techniques are oftne used improve performance and reduce costs. It is important that we are aware of them as we consider GenAI-integrated systems:

  • Batching: Batch processing is like baking multiple trays of cookies at once. Batching can process more requests overall, though individual requests might take a bit longer. This approach is useful for applications where overall throughput is more important than the speed of each individual response.

One way to think about batching is to imagine a bakery. The baker could make one cookie at a time, or they could put multiple batches of cookies in the oven simultaneously.

  • Caching: Caching saves time and money by reusing previous answers instead of generating them again. It’s especially valuable when users often ask similar questions or when certain responses are used repeatedly.

Imagine a tour guide who memorizes answers to frequently asked questions. When asked “When was this building constructed?”, they can answer immediately without having to look it up again. This is the same principle used by caching.

  • Request Throttling: Throttling prevents system overloads by controlling how many requests are processed at once. This helps maintain stable performance and manages costs by preventing sudden usage spikes.

Consider highway metering lights that regulate how many cars can enter the freeway. Without them, too many cars at once would cause traffic jams. Throttling requests work under the same principle.

  • Early Stopping: AI systems can stop generating text once they’ve provided a sufficient answer. Early stopping reduces unnecessary processing, saving time and resources. It’s particularly useful for yes/no questions or classification tasks where a complete response isn’t always needed.

To understand Early Stopping, imagine a chef tasting soup as it cooks. Once it reaches the right flavor, there’s no need to continue cooking.

  • Model Optimization: AI models can be “compressed” for specific tasks. Optimized models require less computing power and memory while still delivering good results. This can make AI more accessible, faster, and less expensive to operate.

As an analogy, consider transportation choices. While a large truck can carry more, a compact car is more efficient for daily commuting.


Practical Takeaways

Understanding these optimization techniques helps you:

  • Evaluate AI solutions: Know what questions to ask about performance and efficiency
  • Set expectations: Understand the trade-offs between speed, quality, and cost
  • Plan resources: Anticipate computing needs for AI applications

Even if you’re not implementing these techniques yourself, knowing they exist helps you make informed decisions about AI deployments.


Quiz

Let’s test your understanding!

Want to test your understanding of inference techniques and their applications? This quiz focuses on the practical implications of different implementation choices.

## A company needs to build a customer support system that handles sensitive financial information. Which approach would best balance responsiveness with data security? > Hint: Consider where and how information is processed in different integration patterns. 1. [ ] Using a public LLM through an API with in-prompt data inclusion of customer financial details > This approach poses significant security risks. Including sensitive financial information directly in prompts means this data is being sent to external servers, potentially violating financial data regulations and privacy policies. 1. [x] Implementing a RAG system with a private vector database that retrieves only relevant, anonymized information > Correct! This approach provides the best balance of security and functionality. The RAG system can retrieve relevant information without exposing all customer data, while anonymization adds an extra layer of protection. The model can still provide helpful responses by reasoning over the selectively retrieved information. 1. [ ] Using the base model only and avoiding any integration with customer data systems > While this eliminates direct data security concerns, it would severely limit the system's ability to provide specific, personalized support for financial queries, making it ineffective for the intended purpose. 1. [ ] Caching all possible customer queries and responses to avoid real-time processing > This is impractical for financial support (which often involves unique situations) and still doesn't address how the cached responses would be generated securely in the first place. It also creates a new security risk by storing potentially sensitive pre-computed responses. ## An e-commerce company is implementing an AI product recommendation system. Which of the following would most effectively improve both performance and user experience? > Hint: Consider optimization techniques that specifically enhance responsiveness and relevance. 1. [ ] Increasing the temperature parameter for all recommendation requests > Simply increasing temperature would make recommendations more random but not necessarily more relevant or performant. It might even reduce the quality of recommendations by introducing too much variability. 1. [ ] Implementing a non-streaming response pattern to ensure recommendations are only shown when complete > While this might ensure completeness, it would likely create a perceived latency issue as users wait for recommendations without any visual feedback, degrading the user experience. 1. [x] Combining caching of common recommendations with request batching during peak shopping periods > Correct! This approach addresses both performance and user experience. Caching frequently requested recommendations reduces computation time for common queries, while batching during high-traffic periods optimizes throughput without sacrificing individual response quality. Together, these make the system more responsive and efficient. 1. [ ] Sending every product in the catalog as context with each query > This would overwhelm the context window, dramatically increase token usage and costs, and likely slow down response times as the model processes excessive information. ## A company is building a real-time collaborative document editor with an integrated AI assistant. Which implementation approach would be most effective for providing continuous writing suggestions while minimizing API costs? > Hint: Consider how different API types and response patterns affect both user experience and resource usage. 1. [ ] Using synchronous Completion API calls triggered after every paragraph > This would create noticeable pauses in the user experience as they wait for suggestions after completing each paragraph, and doesn't efficiently handle real-time collaboration. 1. [ ] Sending the entire document with each API call to maintain full context > While this ensures the AI has complete context, it would dramatically increase token usage and costs, especially for longer documents, making it financially impractical. 1. [x] Implementing streaming responses with the Chat API and early stopping when sufficient suggestions are generated > Correct! This approach provides immediate feedback to users through streaming, maintains conversation context efficiently with Chat API, and optimizes costs by using early stopping to avoid generating unnecessary content once sufficient suggestions are available. This combination is ideal for real-time collaborative environments where responsiveness and cost efficiency are both critical. 1. [ ] Using separate independent API calls for each collaborator to avoid context confusion > This approach would miss important collaborative context, potentially leading to contradictory suggestions, and would multiply API costs unnecessarily by treating each user's interactions as completely separate. ## A data scientist is creating embeddings for a technical documentation search system. Which statement most accurately describes the relationship between vectors and embeddings in this context? > Hint: Consider the fundamental distinction between these two related concepts. 1. [ ] Vectors and embeddings are interchangeable terms for the same concept in AI > Incorrect. While related, these terms have distinct meanings and implications in AI systems. 1. [ ] Embeddings are always three-dimensional, while vectors can have any number of dimensions > This is factually incorrect. Embeddings typically have hundreds or thousands of dimensions, not just three. 1. [x] All embeddings are vectors, but embeddings specifically have been trained to capture semantic relationships > Correct! This accurately describes the relationship. Embeddings are a specific use of vectors where the values have been learned through training to represent semantic meaning. The numbers in embeddings are carefully chosen to ensure that similar concepts have similar vector representations, whereas generic vectors are simply lists of numbers without this semantic property. 1. [ ] Vectors are used for storing text, while embeddings are used for storing numerical data > This mischaracterizes both concepts. Both vectors and embeddings are numerical representations, and neither directly "stores" text.

Coming up next

Now that we understand how LLMs represent and process information, we’re ready to explore how they’re actually used in real-world applications. In the next section, we’ll examine the evolution of AI systems toward greater autonomy and agency, and how this shift will impact various industries.