Your Own Coding Assistant

Estimated time: 5–6 hours | Difficulty: Intermediate

Side Quest

What You Will Learn

Understand how large language models (LLMs) work at a high level — tokens, parameters, and context windows
Explain why model size, quantization, and hardware matter for running AI locally
Install and run Ollama to host an LLM on your own machine
Interact with a local model through the terminal and through its REST API
Build a working chat interface using HTML, CSS, and JavaScript that talks to your local model
Craft system prompts that turn a general model into a focused coding assistant
Understand the trade-offs between local and cloud-hosted AI

1. Why Run Your Own AI?

You have almost certainly used an AI assistant by now. Maybe you have asked ChatGPT to explain a concept, or used GitHub Copilot to autocomplete a function. These tools are powerful, but they all have something in common: your code leaves your computer. Every prompt you send travels over the internet to someone else's server, gets processed there, and the response comes back. The company running that server can see your code, your questions, and your mistakes.

For learning, that is usually fine. But for professional work? Many companies ban cloud AI tools outright because they do not want proprietary code sitting on someone else's infrastructure. Hospitals cannot send patient data to external APIs. Government agencies have strict rules about where data can go. Even as an individual, you might not want a corporation reading every question you ask while debugging at 2 AM.

Running an AI model locally changes the equation entirely. Your data never leaves your machine. There is no API key to pay for. There is no rate limit. There is no terms-of-service update that changes what happens to your conversations. The model runs on your hardware, responds to your prompts, and that is the end of the story.

But privacy is only part of the reason. Running a model locally teaches you how AI actually works in a way that using a cloud API never can. You will see exactly how much memory a model needs. You will feel the difference between a 1-billion-parameter model and a 7-billion-parameter model. You will understand why your Raspberry Pi struggles with tasks that a cloud GPU handles instantly. This hands-on understanding is becoming essential knowledge for developers, and you are about to build it from the ground up.

2. How LLMs Actually Work

Before you install anything, you need a mental model of what you are actually running. An LLM is not magic. It is a very large mathematical function — a neural network — that predicts the next word (or more precisely, the next token) in a sequence.

Tokens

LLMs do not read words the way you do. They break text into tokens — chunks that are usually parts of words. The sentence "The developer debugged the application" might become six tokens: ["The", " developer", " debug", "ged", " the", " application"]. Common words often get their own token. Rare or long words get split into pieces.

This matters for two reasons. First, the model has a context window — a maximum number of tokens it can process at once. A model with a 4,096-token context window can handle roughly 3,000 words of combined input and output. If you paste a 500-line Java file and ask the model to review it, you are using a big chunk of that window. Second, the model generates output one token at a time. That is why AI responses appear to stream in word by word — the model is literally predicting one token, then the next, then the next.

Parameters

When people say a model has "7 billion parameters," they are describing the number of numerical values (weights) inside the neural network. Think of parameters as the model's accumulated knowledge — everything it learned during training. More parameters generally means the model can handle more complex reasoning, understand subtler context, and generate more coherent responses.

But more parameters also means more memory. Each parameter is a number, and numbers take up space. Here is the rough math:

1B parameter model — ~2 GB of RAM at full precision, ~0.5–1 GB quantized
3B parameter model — ~6 GB at full precision, ~2 GB quantized
7B parameter model — ~14 GB at full precision, ~4 GB quantized
13B parameter model — ~26 GB at full precision, ~7–8 GB quantized
70B parameter model — ~140 GB at full precision, ~35–40 GB quantized

That last number should jump out at you. A 70-billion-parameter model needs more RAM than most computers have, even after optimization. This is why hardware matters so much for local AI.

Quantization

You noticed the word "quantized" in that list. Quantization is the process of reducing the precision of each parameter to make the model smaller and faster. Instead of storing each weight as a 16-bit or 32-bit floating-point number, you store it as a 4-bit or 8-bit integer. You lose some accuracy, but you gain enormous savings in memory and speed.

Think of it like image compression. A raw photograph might be 25 megabytes. A JPEG version might be 2 megabytes. You lose some detail, but for most purposes the JPEG looks fine. Quantization does the same thing to model weights.

When you see model names like llama3.2:3b-q4_K_M, that suffix tells you the quantization level. Q4 means 4-bit quantization (aggressive compression, smallest size). Q8 means 8-bit (less compression, better quality). For most coding tasks, Q4 models perform surprisingly well — the difference in code quality between Q4 and Q8 is usually small enough that the memory savings are worth it.

Key insight: The quality of an AI model is not just about parameter count. A well-quantized 7B model often outperforms a poorly quantized 13B model, because the 7B model fits comfortably in RAM while the 13B model constantly swaps to disk. A model that fits in your RAM will always feel faster and more responsive than a bigger model that does not.

3. Choosing Your Hardware

This lesson works on any computer that can run Linux, macOS, or Windows. But your experience will vary dramatically depending on what hardware you have. Here is what to expect at each tier:

Raspberry Pi 5 (8 GB RAM)

The Raspberry Pi is the most constrained option, and that is exactly what makes it interesting. With 8 GB of RAM, you can run models up to about 3 billion parameters at 4-bit quantization. That means models like llama3.2:1b or llama3.2:3b.

The responses will be slow — expect a few tokens per second rather than the near-instant responses you get from cloud services. The model's coding ability will be limited compared to larger models. It will handle simple questions well ("What does git rebase do?") but struggle with complex multi-step reasoning ("Refactor this 200-line class to use the strategy pattern").

But here is why the Pi is still a great choice: it makes the constraints visible. You will feel the model loading into RAM. You will notice how response speed drops as the conversation gets longer and the context window fills up. You will understand, viscerally, why companies spend millions on GPU clusters. And when you are done learning, you have a tiny, silent, always-on AI assistant sitting on your desk that costs pennies in electricity to run.

Pi setup: Make sure you have a Raspberry Pi 5 with 8 GB RAM (the 4 GB version will not work well). Use a good-quality microSD card (or better, an NVMe SSD via the Pi's M.2 HAT) and make sure you have adequate cooling — the Pi 5 will thermal throttle under sustained AI workloads without a heatsink or fan.

Older Laptop or Desktop (8–16 GB RAM)

If you have a laptop or desktop from the last five or six years with at least 8 GB of RAM, you are in a much better position. With 16 GB, you can comfortably run 7-billion-parameter models at 4-bit quantization, which are genuinely useful for coding assistance. Models like llama3.2:latest, codellama:7b, or deepseek-coder:6.7b will run well and produce helpful results.

If you have a dedicated GPU (even an older one with 6+ GB of VRAM), Ollama can offload computation to it, which dramatically speeds up generation. A $200 used GPU can make a local model feel nearly as responsive as a cloud API for small to medium queries.

Modern Desktop (32+ GB RAM or Dedicated GPU)

If you have a modern desktop with 32 GB of RAM or a GPU with 12+ GB of VRAM, you can run 13-billion-parameter models or even larger. At this tier, the local experience starts to rival cloud-hosted models for many coding tasks. Models like codellama:13b or deepseek-coder:33b produce excellent code and can handle complex refactoring, debugging, and architecture discussions.

No matter what hardware you have, start small. Begin with a 1B or 3B model, get everything working, then try larger models if your hardware can handle them. You can always pull a bigger model later — Ollama makes switching models trivial.

4. Installing Ollama

Ollama is an open-source tool that makes running LLMs locally as simple as running a Docker container. It handles downloading models, managing memory, quantization, and exposing a clean API — all in a single binary. It is the easiest path from "I have never run an AI model" to "I have a working local AI" in under five minutes.

Linux (including Raspberry Pi and WSL)

Open your terminal and run:

curl -fsSL https://ollama.com/install.sh | sh

This downloads and runs the official Ollama install script. It detects your system architecture (x86, ARM for Pi), installs the binary, and sets up Ollama as a system service that starts automatically.

After installation, verify it is running:

ollama --version

macOS

Download Ollama from the official website. It installs as a standard macOS application. After installing, open it once from your Applications folder — it will set up the command-line tool automatically. Then verify in your terminal:

ollama --version

Windows

Download the installer from the Ollama website and run it. After installation, open a new PowerShell or Command Prompt window and verify:

ollama --version

WSL users: You have two options. You can install Ollama inside WSL (using the Linux instructions) or install it on Windows directly. If you install on Windows, the Ollama API will be accessible from both Windows and WSL. If you install inside WSL, it stays within your Linux environment. Either works — pick whichever feels more natural.

Pulling Your First Model

Ollama works like a package manager for AI models. You pull a model to download it, then run it to start chatting. Let us start with a small, fast model so you can see results quickly regardless of your hardware:

# Pull a small, fast model (about 1.3 GB download)
ollama pull llama3.2:3b

This downloads a 3-billion-parameter version of Meta's Llama 3.2, quantized to about 2 GB. It will run on virtually any hardware, including a Raspberry Pi 5. The download might take a few minutes depending on your internet connection.

Once the download finishes, you can see what models you have installed:

ollama list

You should see llama3.2:3b in the list with its size. This model is now stored locally on your machine — you will never need to download it again, and it works completely offline.

5. Your First Conversation

Time to talk to your model. Run:

ollama run llama3.2:3b

You will see a prompt that looks like >>>. Type a question and press Enter:

>>> What is a REST API? Explain it like I'm a beginner.

Watch what happens. The response streams in token by token. On a fast machine, it might feel instant. On a Raspberry Pi, you will see words appear one at a time, maybe a few per second. This is the model running entirely on your hardware. No internet connection is needed. No API key. No cloud server. The neural network is loaded into your RAM and performing billions of mathematical operations to generate each token.

Try a few more prompts to get a feel for the model:

>>> Write a Java method that reverses a string without using StringBuilder.

>>> What is the difference between == and .equals() in Java?

>>> Explain CSS flexbox in three sentences.

Notice the quality of the responses. A 3B model can handle straightforward knowledge questions and simple code generation well. It will start to struggle with very complex or nuanced requests. That is expected — you are running a model that is orders of magnitude smaller than what powers cloud services like ChatGPT.

To exit the chat, type /bye and press Enter.

Trying a Larger Model (If Your Hardware Allows)

If you have 16 GB or more of RAM, try pulling a larger model:

# 7B model — needs ~4-5 GB RAM (about 4 GB download)
ollama pull llama3.2

# Or a model specifically trained for code
ollama pull deepseek-coder:6.7b

Run it the same way and ask the same questions you asked the 3B model. Compare the responses. The larger model will give more detailed, more accurate answers, especially for code. This is the trade-off in action: more parameters equals better quality, but more memory and slower speed.

If your machine freezes or becomes very slow after pulling a model, the model is too large for your available RAM. Press Ctrl + C to stop, then try a smaller model. You can remove a model with ollama rm model-name.

6. The Ollama API

Chatting in the terminal is useful, but the real power comes from the API. When Ollama is running, it exposes a REST API on http://localhost:11434. This should sound familiar — it is the same concept you learned when building your Spring Boot API in the Resumator lessons. Ollama is a server. It accepts HTTP requests. It returns JSON responses. You already know how this works.

Open a new terminal window (keep Ollama running in the background) and try this:

# Check that Ollama is running
curl http://localhost:11434

# You should see: "Ollama is running"

Now send a prompt to the model using the /api/generate endpoint:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:3b",
  "prompt": "Write a JavaScript function that checks if a string is a palindrome.",
  "stream": false
}'

The stream: false option tells Ollama to wait until the full response is generated and return it as a single JSON object. Without it, Ollama streams the response as a series of JSON objects, one per token — useful for real-time display, which you will use later.

The response comes back as JSON:

{
  "model": "llama3.2:3b",
  "response": "Here's a JavaScript function...",
  "done": true,
  "total_duration": 4521887000,
  "eval_count": 156
}

Look at those fields. response contains the generated text. eval_count tells you how many tokens the model generated. total_duration is in nanoseconds — divide by 1,000,000,000 to get seconds. You can calculate tokens per second: eval_count / (total_duration / 1e9). This is a concrete, measurable metric for how fast your hardware runs AI inference.

The Chat API

For multi-turn conversations (where the model remembers what you said before), use the /api/chat endpoint:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2:3b",
  "messages": [
    {"role": "system", "content": "You are a helpful coding tutor."},
    {"role": "user", "content": "What is a for loop?"},
    {"role": "assistant", "content": "A for loop repeats a block of code a specific number of times..."},
    {"role": "user", "content": "Can you show me one in Java?"}
  ],
  "stream": false
}'

The messages array contains the entire conversation history. Each message has a role: system (sets the model's behavior), user (your messages), or assistant (the model's previous replies). By including the full history, the model has context about what was discussed before and can give a coherent follow-up response.

This is the exact same pattern used by the OpenAI API, the Anthropic API, and virtually every other AI chat API. The skill you are building right now — sending structured messages to an AI endpoint — transfers directly to any cloud AI service.

7. Building a Chat Interface

Now you are going to combine everything you have learned in this course — HTML, CSS, and JavaScript — to build a real chat interface that talks to your local Ollama model. This is not a toy. It is a working application that you can use every day while you code.

Create a new directory for this project and open it in VS Code:

mkdir ~/coding-assistant
cd ~/coding-assistant
code .

Create a file called index.html:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>My Coding Assistant</title>
  <link rel="stylesheet" href="style.css">
</head>
<body>
  <div class="chat-container">
    <header class="chat-header">
      <h1>Coding Assistant</h1>
      <p class="model-name">llama3.2:3b</p>
    </header>

    <div class="chat-messages" id="chat-messages">
      <div class="message assistant">
        <p>Hi! I'm your local coding assistant. Ask me anything about programming. I'm running entirely on your machine — nothing leaves your computer.</p>
      </div>
    </div>

    <form class="chat-input" id="chat-form">
      <textarea id="user-input" placeholder="Ask a coding question..." rows="2"></textarea>
      <button type="submit" id="send-btn">Send</button>
    </form>
  </div>

  <script src="chat.js"></script>
</body>
</html>

This is straightforward HTML. A container with a header, a scrollable message area, and an input form at the bottom. Nothing you have not seen before.

Now create style.css:

* {
  margin: 0;
  padding: 0;
  box-sizing: border-box;
}

body {
  font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
  background: #1a1a2e;
  color: #e0e0e0;
  height: 100vh;
  display: flex;
  justify-content: center;
  align-items: center;
}

.chat-container {
  width: 100%;
  max-width: 800px;
  height: 100vh;
  display: flex;
  flex-direction: column;
  background: #16213e;
}

.chat-header {
  padding: 16px 24px;
  background: #0f3460;
  border-bottom: 1px solid #1a1a4e;
  display: flex;
  justify-content: space-between;
  align-items: center;
}

.chat-header h1 {
  font-size: 18px;
  font-weight: 600;
}

.model-name {
  font-size: 13px;
  color: #7a7adb;
  font-family: monospace;
}

.chat-messages {
  flex: 1;
  overflow-y: auto;
  padding: 24px;
  display: flex;
  flex-direction: column;
  gap: 16px;
}

.message {
  padding: 12px 16px;
  border-radius: 12px;
  max-width: 85%;
  line-height: 1.6;
  white-space: pre-wrap;
  word-wrap: break-word;
}

.message p {
  margin: 0;
}

.message.user {
  background: #0f3460;
  align-self: flex-end;
  border-bottom-right-radius: 4px;
}

.message.assistant {
  background: #1a1a4e;
  align-self: flex-start;
  border-bottom-left-radius: 4px;
}

.message.assistant code {
  background: #0d1117;
  padding: 2px 6px;
  border-radius: 4px;
  font-size: 14px;
}

.message.assistant pre {
  background: #0d1117;
  padding: 12px;
  border-radius: 8px;
  overflow-x: auto;
  margin: 8px 0;
}

.message.assistant pre code {
  background: none;
  padding: 0;
}

.chat-input {
  padding: 16px 24px;
  background: #0f3460;
  border-top: 1px solid #1a1a4e;
  display: flex;
  gap: 12px;
  align-items: flex-end;
}

#user-input {
  flex: 1;
  padding: 12px 16px;
  background: #16213e;
  border: 1px solid #1a1a4e;
  border-radius: 8px;
  color: #e0e0e0;
  font-size: 15px;
  font-family: inherit;
  resize: none;
  outline: none;
}

#user-input:focus {
  border-color: #7a7adb;
}

#send-btn {
  padding: 12px 24px;
  background: #7a7adb;
  color: #fff;
  border: none;
  border-radius: 8px;
  font-size: 15px;
  cursor: pointer;
  font-weight: 600;
}

#send-btn:hover {
  background: #6a6acb;
}

#send-btn:disabled {
  background: #3a3a5e;
  cursor: not-allowed;
}

.typing-indicator {
  color: #7a7adb;
  font-style: italic;
}

This gives you a dark-themed chat interface that looks and feels like a real application. The layout uses flexbox — the same CSS concept you learned in the web basics track. Messages from the user align right, messages from the assistant align left.

Now the important part — create chat.js:

const OLLAMA_URL = 'http://localhost:11434/api/chat';
const MODEL = 'llama3.2:3b';

const SYSTEM_PROMPT = `You are a helpful coding assistant. You give clear,
concise answers with code examples when appropriate. When showing code,
always specify the language. Keep explanations beginner-friendly.`;

// Conversation history — sent with every request so the model has context
const messages = [
  { role: 'system', content: SYSTEM_PROMPT }
];

const chatMessages = document.getElementById('chat-messages');
const chatForm = document.getElementById('chat-form');
const userInput = document.getElementById('user-input');
const sendBtn = document.getElementById('send-btn');

// Send message on form submit
chatForm.addEventListener('submit', (e) => {
  e.preventDefault();
  const text = userInput.value.trim();
  if (!text) return;

  // Add user message to the conversation
  addMessage('user', text);
  messages.push({ role: 'user', content: text });
  userInput.value = '';
  sendBtn.disabled = true;

  // Send to Ollama and stream the response
  getResponse();
});

// Allow Enter to submit, Shift+Enter for new line
userInput.addEventListener('keydown', (e) => {
  if (e.key === 'Enter' && !e.shiftKey) {
    e.preventDefault();
    chatForm.dispatchEvent(new Event('submit'));
  }
});

function addMessage(role, content) {
  const div = document.createElement('div');
  div.className = `message ${role}`;
  const p = document.createElement('p');
  p.textContent = content;
  div.appendChild(p);
  chatMessages.appendChild(div);
  chatMessages.scrollTop = chatMessages.scrollHeight;
  return div;
}

async function getResponse() {
  // Create a placeholder message for the assistant's response
  const responseDiv = addMessage('assistant', '');
  const responseP = responseDiv.querySelector('p');
  responseP.textContent = 'Thinking...';
  responseP.classList.add('typing-indicator');

  try {
    const response = await fetch(OLLAMA_URL, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        model: MODEL,
        messages: messages,
        stream: true
      })
    });

    if (!response.ok) {
      throw new Error(`Ollama returned ${response.status}. Is the model running?`);
    }

    // Read the streaming response
    const reader = response.body.getReader();
    const decoder = new TextDecoder();
    let fullResponse = '';
    responseP.classList.remove('typing-indicator');

    while (true) {
      const { done, value } = await reader.read();
      if (done) break;

      // Ollama streams newline-delimited JSON objects
      const chunk = decoder.decode(value, { stream: true });
      const lines = chunk.split('\n').filter(line => line.trim());

      for (const line of lines) {
        try {
          const data = JSON.parse(line);
          if (data.message && data.message.content) {
            fullResponse += data.message.content;
            responseP.textContent = fullResponse;
            chatMessages.scrollTop = chatMessages.scrollHeight;
          }
        } catch (parseErr) {
          // Skip malformed JSON chunks
        }
      }
    }

    // Save the assistant's full response to conversation history
    messages.push({ role: 'assistant', content: fullResponse });

  } catch (err) {
    responseP.textContent = `Error: ${err.message}`;
    responseP.style.color = '#ff6b6b';
  }

  sendBtn.disabled = false;
  userInput.focus();
}

Let us walk through what this code does, because every line connects to something you have already learned:

Lines 1–6: Configuration. The Ollama URL and model name are constants at the top, just like you learned in the module pattern. The system prompt defines how the assistant behaves.
Lines 8–10: The messages array holds the full conversation history. Every time you or the model speaks, the message gets pushed onto this array. The entire array is sent with each request, which is how the model "remembers" the conversation.
Lines 20–30: Event listener on the form, just like the event handling you learned in JavaScript basics. Prevents the default form submission, grabs the input text, and calls the API.
Lines 56–95: The getResponse function uses the Fetch API with streaming. Instead of waiting for the entire response, it reads chunks as they arrive using a ReadableStream reader. Each chunk is a JSON object containing one or more tokens. This is why the response appears to type itself out in real time.

Open the index.html file in your browser. You can do this by double-clicking it in your file manager, or if you have a local server set up:

cd ~/coding-assistant
python3 -m http.server 8080

Then open http://localhost:8080 in your browser. Make sure Ollama is running in the background (ollama serve if it is not running as a system service). Type a question and watch the response stream in.

CORS note: If you open the HTML file directly (via file://), some browsers may block the request to localhost:11434. If you get a CORS error, use the Python HTTP server method above. Ollama allows cross-origin requests from localhost by default.

Congratulations. You just built a working AI chat application. The HTML structures the page. The CSS makes it look professional. The JavaScript handles user interaction, API communication, and real-time streaming. Every skill from the web basics track came together here.

8. Crafting a Coding Assistant

Right now, your chat interface uses a generic system prompt. It works, but you can make it much more useful by crafting a prompt that turns the model into a specialized coding assistant. This is called prompt engineering, and it is one of the most valuable skills you can develop as a developer working with AI.

The System Prompt

The system prompt is the first message in the conversation. It sets the model's persona, defines its boundaries, and shapes every response that follows. A good system prompt is specific, includes examples of the behavior you want, and sets clear expectations.

Try replacing the SYSTEM_PROMPT in your chat.js with something more detailed:

const SYSTEM_PROMPT = `You are a coding tutor helping a student who is learning
web development (HTML, CSS, JavaScript) and backend development (Java, Spring Boot).

Rules:
- When the student asks a question, explain the concept FIRST, then show code.
- Always specify the programming language in code blocks.
- If the student shares code with a bug, do not just give the fix — explain
  what is wrong and why, so they learn from it.
- Keep explanations concise. Use analogies when they help.
- If you are not sure about something, say so. Do not make up information.
- When showing Java code, use the conventions from the course: meaningful variable
  names, proper indentation, and comments only where the logic is not obvious.`;

The difference in output quality can be dramatic, especially with smaller models. Without a system prompt, a 3B model might give you a generic textbook answer. With a well-crafted prompt, the same model tailors its responses to your specific learning context.

Sending Code Context

The most useful thing a coding assistant can do is help you with your code. When you are stuck on a bug or want feedback on your implementation, you can paste your code directly into the chat. The model sees it as part of the conversation and responds in context.

Try pasting something like this into your chat interface:

Here is my Java method. It is supposed to find the average of an array
of integers, but it returns 0 for the input [1, 2, 3, 4, 5]. What is wrong?

public static int average(int[] numbers) {
    int sum = 0;
    for (int i = 0; i < numbers.length; i++) {
        sum += numbers[i];
    }
    return sum / numbers.length;
}

Even a 3B model should be able to spot the integer division bug here (dividing two int values discards the decimal). The system prompt you wrote tells the model to explain the bug, not just fix it, which makes the interaction educational rather than just transactional.

Understanding the Limits

Local models are incredibly useful, but they are not magic. Here is an honest assessment of what to expect:

1–3B models — Good for: explaining concepts, simple code generation, answering factual programming questions. Struggle with: multi-file refactoring, complex debugging, architectural decisions, anything requiring deep context.
7B models — Good for: code review, bug identification, writing functions, explaining error messages. Struggle with: very large codebases, subtle logic errors, performance optimization advice.
13B+ models — Genuinely useful for most day-to-day coding tasks. Can handle code review, refactoring suggestions, test writing, and architectural discussions. Still not as capable as the largest cloud models for truly complex reasoning.

The right mental model is this: a local AI is like a junior developer who has read a lot of documentation. It is fast at recalling syntax, good at spotting common patterns, and helpful for rubber-duck debugging. But it will not architect your system for you, and you should always verify its suggestions by actually running the code.

When to use cloud vs. local: Use your local model for quick questions, code snippets, concept explanations, and anything involving proprietary code. Use cloud-hosted models (when appropriate and allowed) for complex multi-step reasoning, large-scale refactoring, or when you need the highest quality output. Many professional developers use both — local for speed and privacy, cloud for heavy lifting.

9. Going Further

You now have a working local AI assistant and the knowledge to understand what is happening under the hood. Here are some directions you can take this:

Always-On Assistant

If you set up Ollama on a Raspberry Pi or a dedicated machine, it can run 24/7 on your network. Other devices on your local network can connect to it by using the machine's local IP address instead of localhost. To enable this, set the OLLAMA_HOST environment variable:

# Allow connections from other devices on your network
OLLAMA_HOST=0.0.0.0 ollama serve

Then update the OLLAMA_URL in your chat.js to point to the machine's IP address (e.g., http://192.168.1.50:11434/api/chat). Now you have a private AI server running on your network that you can access from your laptop, your phone, or any device with a browser.

Try Different Models

Ollama has a library of models you can experiment with. Some are general-purpose, others are specialized for code:

# A model specifically trained for code
ollama pull deepseek-coder:6.7b

# A model good at following instructions
ollama pull mistral

# See all available models
ollama list

# Remove a model you no longer need
ollama rm model-name

Each model has different strengths. Try asking the same coding question to different models and compare the responses. You will quickly develop an intuition for which models work best for which tasks.

Enhance Your Chat Interface

The chat interface you built is functional, but there is plenty of room to make it better. Consider adding:

Markdown rendering — Model responses often include markdown formatting. A library like marked.js can render headings, bold text, and code blocks properly.
Syntax highlighting — Use a library like highlight.js to color-code the code blocks in responses.
Model selector — A dropdown that lets you switch between models without changing the code.
Conversation history — Save conversations to localStorage so they persist between sessions (you already know how to do this from the progress tracking module).
Copy button — A button on code blocks that copies the code to your clipboard.

Each of these improvements is a small project in itself, and each one uses skills you have already learned. This is what real software development looks like — you build a working version first, then iterate.

Knowledge Check

1. A model has 7 billion parameters and is quantized to 4-bit precision. Approximately how much RAM will it need?

About 1 GB About 4 GB About 14 GB About 28 GB

Correct answer: About 4 GB. At full 16-bit precision, each parameter takes 2 bytes, so 7B parameters = ~14 GB. Quantizing to 4-bit reduces this by roughly 4x, bringing it down to about 3.5–4 GB. This is why quantization is essential for running models on consumer hardware.

2. What protocol does Ollama use to expose its local API?

WebSocket gRPC HTTP REST (JSON over HTTP) GraphQL

Correct answer: HTTP REST. Ollama exposes a REST API at http://localhost:11434 that accepts JSON requests and returns JSON responses. This is the same request/response pattern you used when building the Resumator API with Spring Boot.

3. In the chat API, what is the purpose of the system role message?

It contains the model's previous responses It sets the model's behavior, persona, and instructions for the entire conversation It provides system-level hardware information to the model It logs error messages for debugging

Correct answer: It sets the model's behavior, persona, and instructions. The system prompt is the first message in the conversation and shapes how the model responds to everything that follows. A well-crafted system prompt can dramatically improve the usefulness of even small models.

Deliverable

Your deliverable for this side quest is the working chat interface you built in section 7, customized with your own system prompt from section 8.

Your project should include:

index.html — the chat interface structure
style.css — the styling (customize the colors and layout to make it yours)
chat.js — the JavaScript that communicates with Ollama

Beyond the working interface, you should be able to:

Explain what tokens, parameters, and quantization mean in the context of LLMs
Describe why a model that fits in RAM outperforms a larger model that does not
Use the Ollama CLI to pull, run, list, and remove models
Send requests to the Ollama REST API using curl
Explain the difference between the system, user, and assistant roles in the chat API
Articulate when to use a local model vs. a cloud-hosted model

AI is not replacing developers. But developers who understand how AI works — who can run it, configure it, build interfaces for it, and reason about its strengths and limitations — will have a significant advantage. You just built that understanding from the ground up, on your own hardware, with your own code. That is something no cloud API tutorial can give you.

Finished this side quest?

← Back to Side Quests