Running llama.cpp in Docker on Raspberry Pi

Running large language models on a Raspberry Pi isn’t just possible—it’s fun. Whether you're a hacker exploring local AI, a developer prototyping LLM workflows, or just curious about how far you can push a Pi, this tutorial is for you.
We’ll show you how to build and run llama.cpp
in Docker on an ARM-based Pi to get a full LLM experience in a tiny, reproducible container. No weird dependencies. No system pollution. Just clean, fast, edge-side inference.
If you are looking for a bare metal installation on the Raspberry Pi. Check this https://rmauro.dev/running-llm-llama-cpp-natively-on-raspberry-pi/
Dockerfile
The following Dockerfile builds llama.cpp
from source within an Ubuntu 22.04 base image. It includes all required dependencies and sets the container entrypoint to the compiled CLI binary.
FROM ubuntu:22.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt update && apt upgrade -y && \
apt install -y --no-install-recommends \
ca-certificates git build-essential cmake wget curl \
libcurl4-openssl-dev && \
apt clean && rm -rf /var/lib/apt/lists/*
WORKDIR /opt
RUN git clone https://github.com/ggerganov/llama.cpp.git
WORKDIR /opt/llama.cpp
RUN cmake -B build
RUN cmake --build build --config Release -j$(nproc)
WORKDIR /opt/llama.cpp/build/bin
ENTRYPOINT ["./llama-cli"]
Build the Docker Image
Run the following command in the same directory as your Dockerfile to build the image:
docker build -t llama-cpp-pi .
Download a Quantized Model (on Host)
You need a quantized .gguf
model to perform inference. Run this command from your host system:
mkdir -p models
wget -O models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf \
https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_0.gguf
This creates a models
directory and downloads a compact version of TinyLlama suitable for edge devices.
Run Inference from Docker
Mount the models
directory and run the container, specifying the model and prompt:
docker run --rm -it \
-v $(pwd)/models:/models \
llama-cpp-pi \
-m /models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf -p "Hello from Docker!"
To use a different model:
MODEL=your-model-name.gguf
docker run --rm -it \
-v $(pwd)/models:/models \
llama-cpp-pi \
-m /models/$MODEL -p "Hello with custom model!"
Conclusion
This Docker-based setup enables efficient deployment of llama.cpp
on ARM-based devices like the Raspberry Pi.
It abstracts away system-level configuration while preserving the flexibility to swap models, test prompts, or integrate with other AI pipelines.
For developers, researchers, and students, this is an ideal workflow to explore the capabilities of local LLM inference.