llama.cpp

Running llama.cpp in Docker on Raspberry Pi

Running large language models on a Raspberry Pi isn’t just possible—it’s fun. Whether you're a hacker exploring local AI, a developer prototyping LLM workflows, or just curious about how far you can push a Pi, this tutorial is for you.

We’ll show you how to build and run llama.cpp in Docker on an ARM-based Pi to get a full LLM experience in a tiny, reproducible container. No weird dependencies. No system pollution. Just clean, fast, edge-side inference.

If you are looking for a bare metal installation on the Raspberry Pi. Check this https://rmauro.dev/running-llm-llama-cpp-natively-on-raspberry-pi/

Dockerfile

The following Dockerfile builds llama.cpp from source within an Ubuntu 22.04 base image. It includes all required dependencies and sets the container entrypoint to the compiled CLI binary.

FROM ubuntu:22.04

ENV DEBIAN_FRONTEND=noninteractive

RUN apt update && apt upgrade -y && \
    apt install -y --no-install-recommends \
    ca-certificates git build-essential cmake wget curl \
    libcurl4-openssl-dev && \
    apt clean && rm -rf /var/lib/apt/lists/*

WORKDIR /opt
RUN git clone https://github.com/ggerganov/llama.cpp.git
WORKDIR /opt/llama.cpp

RUN cmake -B build
RUN cmake --build build --config Release -j$(nproc)

WORKDIR /opt/llama.cpp/build/bin
ENTRYPOINT ["./llama-cli"]

Build the Docker Image

Run the following command in the same directory as your Dockerfile to build the image:

docker build -t llama-cpp-pi .

Download a Quantized Model (on Host)

You need a quantized .gguf model to perform inference. Run this command from your host system:

mkdir -p models
wget -O models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf \
  https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_0.gguf

This creates a models directory and downloads a compact version of TinyLlama suitable for edge devices.

Run Inference from Docker

Mount the models directory and run the container, specifying the model and prompt:

docker run --rm -it \
  -v $(pwd)/models:/models \
  llama-cpp-pi \
  -m /models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf -p "Hello from Docker!"

To use a different model:

MODEL=your-model-name.gguf

docker run --rm -it \
  -v $(pwd)/models:/models \
  llama-cpp-pi \
  -m /models/$MODEL -p "Hello with custom model!"

Conclusion

This Docker-based setup enables efficient deployment of llama.cpp on ARM-based devices like the Raspberry Pi.

It abstracts away system-level configuration while preserving the flexibility to swap models, test prompts, or integrate with other AI pipelines.

For developers, researchers, and students, this is an ideal workflow to explore the capabilities of local LLM inference.

Running LLM llama.cpp Natively on Raspberry Pi

For developers and hackers who enjoy squeezing maximum potential out of compact machines, getting a large language model like llama.cpp running natively on a Raspberry Pi is a rewarding challenge. This guide walks you through compiling llama.cpp from source, downloading a model, and running inference - all on

MinIO Storage with C#

MinIO is a high-performance, self-hosted object storage system that's S3-compatible. I came across MinIO while looking fora self-hosted alternative to AWS S3. I needed a object storage that was lightweight and open source. MinIO stood out because it has S3 API Compatbility while being simple to deploy with

Getting started with KNN and C#

K-Nearest Neighbors (k-NN) regression is a simple yet effective technique for predicting numeric values. This post demonstrates how to implement k-NN regression in C# from scratch. For simplicity, we use a small synthetic dataset: double[][] trainData = new double[][] { new double[] {1.0, 1.1, 100.0}, new double[] {2.0,

Socket Client over PowerShell

In this article let's use PowerShell to create a socket client and do a simple call to a webserver. There are multiple ways to create sockets using PowerShell. For this article let's take use TcpClient object for it's simplicity. Table of Contents * Set up