Activity 1.3: Local Inference with Ollama

Work in progress

This section is under construction. This information hasn’t been reviewed or edited yet!


Practical Activity Overview

This application demonstrates how to run a local inference using Ollama. We’ll be building and running this application locally on your machine.

Prerequisites

  • Python 3.8 or higher installed on your system
  • Basic familiarity with command line/terminal
  • A text editor or IDE of your choice
  • At least 8GB RAM (16GB recommended for better performance)
  • At least 10GB free disk space

Activities

Step 1: Set Up Your Development Environment

1.1 Make sure you are using the virtual environment we created in the previous activity:

  • On Windows:
.\venv\Scripts\activate
  • On macOS/Linux:
source venv/bin/activate

Step 2: Install Ollama

2.1 Ollama is an easy-to-use tool for running LLMs locally. Let’s install it first:

  1. Download the installer from https://ollama.com/download
  2. Run the installer and follow the prompts

2.2 After installation, verify Ollama is running by opening a terminal and typing:

ollama --version

Step 3: Pull a Model

3.1 Before we can use a model, we need to download it. Let’s pull a small but capable model:

ollama pull llama3.2:1b
Model Selection

We’re using Llama 3.2 1B parameter model because it’s relatively small (about 1GB) but still performs well. It should run well even on modest computing resources and CPU, and it’ll make our future tasks much quicker.

You can explore other models with ollama list after installation.

Step 4: Install Required Packages

4.1 Add ‘requests’ to our requirements.txt file. This will handle making calls to our Ollama endpoint.

streamlit 
google-generativeai 
python-dotenv
requests

4.2 Install the new packages:

pip install -r requirements.txt
Ollama Authentication

By default, Ollama doesn’t require an API key when running locally. We’ll access it directly via its local API endpoint.

Step 5: Create the Enhanced Streamlit App

5.1 Create a file named app.py with the following code:

import os
import streamlit as st
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_community.llms import Ollama
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Check for API key
if not os.getenv('GOOGLE_API_KEY'):
    st.error("Please set your Google API Key in the .env file!")
    st.stop()

st.title("Chat App")
model_type = st.sidebar.selectbox("Model", ["Gemini", "Ollama"])

# Initialize the appropriate model based on selection
def get_model():
    if model_type == "Gemini":
        return ChatGoogleGenerativeAI(model="gemini-2.0-flash")
    else:
        return Ollama(model="llama3.2:1b", base_url="http://localhost:11434")

# Here we create our chat input
user_input = st.chat_input("Type your message here...")

# Generate and display response
if user_input:
    st.chat_message("user").write(user_input)
    try:
        model = get_model()
        response = model.invoke(user_input)
        
        # Handle different response formats from different model types
        content = response.content if hasattr(response, 'content') else response
        st.chat_message("assistant").write(content)
    except Exception as e:
        st.error(str(e))

Step 6: Run the Application

6.1 Make sure your virtual environment is activated

6.2 Ensure Ollama is running (it should start automatically after installation)

6.3 Run the Streamlit application:

streamlit run app.py

6.4 Your default web browser should open to http://localhost:8501

Step 7: Experiment with Different Models

7.1 Try comparing responses between:

  • Gemini (cloud-based)
  • Ollama (local)

7.2 Pay attention to:

  • Response quality
  • Response time
  • Memory capabilities

Key Learning Points

When you test the app, you’ll notice something important: the models don’t remember previous messages in the conversation! This is because our implementation treats each message independently.

Why doesn’t the model remember previous messages?

  • LLMs are inherently stateless - they don’t maintain memory between requests
  • Each request is processed independently without context from previous interactions
  • Our current implementation doesn’t pass conversation history to the model

In the next activity, we’ll learn how to implement conversation memory by:

  1. Maintaining a history of messages
  2. Sending the entire conversation context with each request
  3. Implementing proper conversation management
Troubleshooting
  • If Ollama isn’t responding, make sure the service is running
  • For Windows users, check if Ollama appears in your system tray
  • If you get “connection refused” errors, try restarting the Ollama service
  • Verify the model was downloaded successfully with ollama list
Security Note

Never commit your .env file to version control. Add it to your .gitignore file if you’re using Git.