Running llama.cpp in Docker on Raspberry Pi

Running llama.cpp in Docker on Raspberry Pi

Running large language models on a Raspberry Pi isn’t just possible—it’s fun. Whether you're a hacker exploring local AI, a developer prototyping LLM workflows, or just curious about how far you can push a Pi, this tutorial is for you.

We’ll show you how to build and run llama.cpp in Docker on an ARM-based Pi to get a full LLM experience in a tiny, reproducible container. No weird dependencies. No system pollution. Just clean, fast, edge-side inference.

If you are looking for a bare metal installation on the Raspberry Pi. Check this https://rmauro.dev/running-llm-llama-cpp-natively-on-raspberry-pi/

Dockerfile

The following Dockerfile builds llama.cpp from source within an Ubuntu 22.04 base image. It includes all required dependencies and sets the container entrypoint to the compiled CLI binary.

FROM ubuntu:22.04

ENV DEBIAN_FRONTEND=noninteractive

RUN apt update && apt upgrade -y && \
    apt install -y --no-install-recommends \
    ca-certificates git build-essential cmake wget curl \
    libcurl4-openssl-dev && \
    apt clean && rm -rf /var/lib/apt/lists/*

WORKDIR /opt
RUN git clone https://github.com/ggerganov/llama.cpp.git
WORKDIR /opt/llama.cpp

RUN cmake -B build
RUN cmake --build build --config Release -j$(nproc)

WORKDIR /opt/llama.cpp/build/bin
ENTRYPOINT ["./llama-cli"]

Build the Docker Image

Run the following command in the same directory as your Dockerfile to build the image:

docker build -t llama-cpp-pi .

Download a Quantized Model (on Host)

You need a quantized .gguf model to perform inference. Run this command from your host system:

mkdir -p models
wget -O models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf \
  https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_0.gguf

This creates a models directory and downloads a compact version of TinyLlama suitable for edge devices.

Run Inference from Docker

Mount the models directory and run the container, specifying the model and prompt:

docker run --rm -it \
  -v $(pwd)/models:/models \
  llama-cpp-pi \
  -m /models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf -p "Hello from Docker!"

To use a different model:

MODEL=your-model-name.gguf

docker run --rm -it \
  -v $(pwd)/models:/models \
  llama-cpp-pi \
  -m /models/$MODEL -p "Hello with custom model!"

Conclusion

This Docker-based setup enables efficient deployment of llama.cpp on ARM-based devices like the Raspberry Pi.

It abstracts away system-level configuration while preserving the flexibility to swap models, test prompts, or integrate with other AI pipelines.

For developers, researchers, and students, this is an ideal workflow to explore the capabilities of local LLM inference.