5. Prompt Engineering


Work in progress

This section is under construction. This information hasn’t been reviewed or edited yet!


Introduction

At their core, LLMs work by responding to “prompts” - text inputs that tell the model what we want it to do. Think of a prompt as a conversation starter or instruction that guides the AI’s response. However, there’s more complexity to prompts than meets the eye, especially when working with different API types and managing conversations.

What will I get out of this?

By the end of this section, you will be able to:

  1. Explain the concept of prompts and their role in guiding LLM responses.
  2. Describe the key components of an effective prompt, including task instructions, context, and format specifications.
  3. Analyze the impact of essential parameters like temperature and top-P sampling on LLM outputs.
  4. Identify best practices for prompt engineering, including clarity, specificity, and error handling.
  5. Differentiate between traditional prompt engineering techniques and approaches optimized for modern reasoning models.
  6. Evaluate the appropriate use cases for different prompt engineering strategies based on task requirements and model capabilities.
Prompt Engineering: as much Art as Science

Prompt Engineering is a surprisingly complex discipline! Different models, different methods of inference, different tasks - all are criteria that influence the creation of a good prompt. While going into extreme minutiae on this is outside the scope of this course, we’ll cover general good practices.

Ultimately, the best way to craft a good prompt will involve a lot of experimentation and evaluation!

Anatomy of an Effective Prompt

A well-structured prompt typically includes several key components:

  1. Task Instructions:

    • Clear, specific directions about what you want
    • Example: “Analyze this code for security vulnerabilities”
  2. Context and Background:

    • Relevant information the model needs
    • Previous conversation history (in chat contexts)
    • Example: “Given a Python web application using Flask…”
  3. Format Specifications:

    • How you want the output structured
    • Example: “Provide your answer in bullet points”
  4. Examples (Few-Shot Learning):

    • Demonstrations of desired input-output pairs
    • Helps the model understand patterns
    Input: "Hello"
    Output: "Hi there! How can I help?"
    
    Input: "What's the weather?"
    Output: "I don't have access to current weather data."

Context Management

Managing context in longer conversations requires careful consideration:

  1. Context Window Limits:

    • Models have maximum token limits
    • Need to strategically manage conversation history
    • Consider summarizing or pruning older messages
  2. Conversation Memory:

    • Recent messages are more influential than older ones
    • Important to maintain relevant context while removing unnecessary details
    • Example strategy:
      Keep: Last 3-5 exchanges
      Summarize: Earlier important points
      Remove: Off-topic or resolved discussions
Effective Context Management
  • Keep track of token usage
  • Prioritize recent and relevant information
  • Use summarization for long conversations
  • Consider implementing memory systems for persistent knowledge

Best Practices

  1. Clarity and Specificity:

    • Be explicit about what you want
    • Avoid ambiguous instructions
    • Example: “Generate a Python function that calculates the Fibonacci sequence up to n terms”
  2. Safety and Control:

    • Include guardrails in system messages
    • Specify output constraints
    • Example: “Never generate executable code without safety checks”
  3. Error Handling:

    • Plan for edge cases
    • Include fallback instructions
    • Example: “If you’re unsure, ask for clarification rather than making assumptions”
Common Pitfalls
  • Overloading context windows with unnecessary information
  • Mixing multiple tasks in a single prompt
  • Assuming the model remembers previous conversations without proper context
  • Not setting clear boundaries in system messages
A Note on API Types

When implementing LLMs, you’ll use either a Completion API (for single-turn interactions) or a Chat API (for multi-turn conversations). Each has strengths for different scenarios. We’ll explore these integration patterns in detail in the next section on Inference Techniques, but it’s important to consider which API you’ll use as it affects how you structure your prompts.

The art of crafting effective prompt engineering is crucial because:

  • The same question asked differently can yield vastly different results
  • Prompts can include context, examples, or specific formatting instructions
  • The way we phrase prompts can help prevent or inadvertently enable harmful outputs
  • Different API types require different prompt structures

Advanced Prompt Engineering Techniques

Chain-of-Thought Prompting

Chain-of-thought (CoT) prompting is a technique that encourages LLMs to break down complex problems into step-by-step reasoning. Instead of jumping straight to an answer, the model explains its thinking process.

Example:

Basic prompt: "What's 123 × 456?"
CoT prompt: "Let's solve 123 × 456 step by step:
1. First, let's break down 456: 400 + 50 + 6
2. Now multiply 123 by each part:
   - 123 × 400 = 49,200
   - 123 × 50 = 6,150
   - 123 × 6 = 738
3. Finally, add these results:
   49,200 + 6,150 + 738 = 56,088"

This technique is particularly effective for:

  • Mathematical problems
  • Logical reasoning tasks
  • Complex decision-making
  • Debugging code
  • Analysis requiring multiple steps
Evolution of Reasoning

While Chain-of-Thought prompting has been a breakthrough in getting traditional LLMs to show their work, newer reasoning models are changing this paradigm entirely. Let’s explore how these models are reshaping our approach to prompt engineering.


Prompt Engineering for Reasoning Models

Modern reasoning models (like OpenAI’s o3 and DeepSeek R1) have built-in multi-step reasoning capabilities that fundamentally change how we should approach prompt engineering. In fact, many of the explicit CoT techniques we just covered may actually hinder these models’ performance.

Key Differences from Traditional Models

Aspect Traditional LLMs Reasoning Models
Reasoning Process Needs explicit CoT prompting Has automatic internal reasoning
Best Prompt Style Detailed instructions + examples Concise, direct queries
Few-Shot Learning Generally improves performance Can actually reduce quality
Processing Style Single-pass prediction Multi-step deliberation
Error Handling Requires manual iteration Has built-in verification

Optimizing for Reasoning Models

  1. Keep It Simple

    # Instead of:
    "Please solve this equation step by step: 3x + 7 = 22"
    
    # Use:
    "Solve 3x + 7 = 22"

    The model will automatically break down complex problems - explicit instructions can interfere with this process.

  2. Embrace Conciseness

    # Instead of:
    "Please carefully analyze all aspects and provide a detailed report..."
    
    # Use:
    "Analyze [topic] with supporting evidence"
  3. Model-Specific Considerations

Model Best For Prompt Style
o3 Structured coding tasks JSON schema constraints
R1 Mathematical reasoning Open-ended questions
When to Use Reasoning Models

Consider these models when:

  • Tasks require complex logical steps
  • Output consistency is critical
  • You need automated verification
  • Dealing with mathematical or coding challenges

Stick with traditional LLMs for:

  • Simple text transformations
  • Creative writing
  • Tasks where cost is a primary concern

Essential Parameters

While understanding how LLMs work is crucial, effectively using them requires mastering their control parameters. These parameters shape how models generate and process text:

Temperature

Temperature controls the randomness in the model’s responses:

  • Low values (e.g., 0.2) → more predictable, focused responses
  • High values (e.g., 0.8) → more creative, diverse responses
Think of it This Way…

At low temperatures, LLMs stick to the most probable responses (like saying “the sky is blue”). At higher temperatures, it might get more unpredictable with the words it uses (like “the sky is a canvas painted in azure hues”) 🎨 It is why it is often correlated with the “creativity” of the model.

Top-P (Nucleus) Sampling

While temperature affects overall randomness, top-P sampling controls which words the model considers:

  • Setting p = 0.9 means only the top 90% most likely tokens are considered
  • Lower values → more focused, conservative text
  • Higher values → more diverse vocabulary

For example:

Question: "What color is the sky?"
Top-P = 0.5: "blue" (sticks to most common answer)
Top-P = 0.9: "azure", "cerulean", "sapphire" (considers more options)

Response Length

Controls how much text the model generates:

  • Set by maximum token count
  • Longer isn’t always better
  • Consider context window limits
Context Window Trade-offs

Remember that longer responses consume more of your context window. A 1000-token response means 1000 fewer tokens available for future context in the conversation.


Quiz

Let’s test your understanding!

Want to test your understanding of prompt engineering principles? This quiz focuses on applying these concepts in real-world scenarios.

## A financial analyst needs to extract specific data points from quarterly reports. Which prompt structure would be most effective? > Hint: Consider which prompt elements would help the model understand the exact format needed. 1. [ ] A simple instruction: "Extract financial data from this quarterly report." > This is too vague. Without specifying which data points to extract or how to format the output, the model would likely provide inconsistent or incomplete results. 1. [ ] A detailed explanation of why the data is needed > While context can be helpful, simply explaining the purpose without specifying the format or examples wouldn't effectively guide the model to extract the specific data points needed. 1. [x] Clear instructions with format specifications and examples > Correct! An effective prompt would include: 1) Specific instructions about which data points to extract, 2) The exact format for the output (e.g., JSON with specific fields), and 3) One or two examples showing the expected input-output pattern. This structure guides the model to consistently extract the required information in the desired format. 1. [ ] Chain-of-thought reasoning that walks through the report section by section > While chain-of-thought can be useful for complex reasoning, for data extraction tasks, clear formatting instructions and examples are more important than walking through the reasoning process. ## You're building a customer service chatbot that needs to remember details from earlier in the conversation. What's the most effective approach to manage this? > Hint: Consider the inherent limitations of LLMs regarding memory. 1. [ ] Instruct the model to "remember everything the customer says" > This won't work because LLMs don't have persistent memory between requests. Simply instructing the model to remember doesn't create an actual memory mechanism. 1. [ ] Increase the temperature parameter to improve creativity > Temperature affects randomness/creativity, not memory capabilities. Changing this parameter won't help the model remember previous interactions. 1. [x] Implement context management by storing and selectively including relevant conversation history > Correct! Since LLMs are stateless, you need to implement external memory by: 1) Storing previous exchanges, 2) Selectively including relevant parts of the conversation history in each new prompt, and 3) Managing token usage to prevent exceeding context limits. This might include summarizing older parts of the conversation to save tokens while preserving key information. 1. [ ] Switch from a standard model to a reasoning model > The choice between model types affects how responses are processed and constructed, and may affect the quality of the answer, but doesn't impact the model's ability to retain information between requests. ## A software developer is using an LLM to help debug complex code. Which approach would likely yield the most accurate assistance? > Hint: Think about how different LLMs process reasoning tasks. 1. [ ] Using high temperature settings (0.8-1.0) to get creative solutions > Higher temperatures increase randomness, which is generally counterproductive for debugging tasks where precision and accuracy are critical. This would likely introduce more errors. 1. [x] Using a modern reasoning model with a direct, concise prompt > Correct! For complex debugging tasks, modern reasoning models (like o3 or R1) with built-in verification capabilities would perform best when given concise prompts. These models automatically employ multi-step reasoning without needing explicit prompting instructions, making them ideal for code analysis where accuracy is crucial. 1. [ ] Providing detailed step-by-step instructions on how to analyze the code > For modern reasoning models, explicit step-by-step instructions can actually interfere with their built-in reasoning capabilities. Overly detailed prompting can constrain the model's approach and reduce effectiveness. 1. [ ] Using a Chat API with multiple detailed few-shot examples > While few-shot examples can help traditional LLMs, they can actually reduce the quality of responses from reasoning models on technical tasks like debugging. These advanced models perform better with direct questions that allow them to employ their internal reasoning processes. ## For an AI application that generates product descriptions for an e-commerce site, which parameter configuration would be most appropriate? > Hint: Consider the balance between consistency and creativity needed for this specific task. 1. [ ] Temperature: 0.1, Top-P: 0.5 > This combination creates highly deterministic, conservative outputs. While consistency is important for product descriptions, this setting might produce overly generic or repetitive content that doesn't effectively highlight product features. 1. [ ] Temperature: 1.0, Top-P: 1.0 > These settings maximize randomness and diversity, which could result in unpredictable and potentially inaccurate product descriptions. This level of creativity is inappropriate for factual content like product information. 1. [x] Temperature: 0.7, Top-P: 0.9 > Correct! This balanced configuration provides enough creativity to make engaging, varied product descriptions while maintaining sufficient consistency to ensure accuracy. It allows for some stylistic flair (beneficial for marketing) without straying into fabrication or extreme variability. 1. [ ] Temperature: 0, No Top-P filtering > Zero temperature produces completely deterministic results, essentially taking the most probable token at each step. This would create extremely rigid, mechanical descriptions lacking any engaging qualities needed for marketing.
Coming up next

Now that we’ve explored how to effectively communicate with AI models through well-crafted prompts, let’s dive into the technical approaches for integrating these models into applications. In the next section, we’ll examine different inference techniques and API integration patterns.