Docs

Everything you need to know about running hoof models.

What you get

Each hoof model is delivered as a standalone executable — a single file that contains the model weights, tokeniser, inference engine, and a built-in web UI. No Python, no GPU drivers, no internet connection required. Double-click and it runs.

# Windows
joke-teller.exe
# macOS / Linux
chmod +x joke-teller
./joke-teller

The executable starts a local web server and opens a chat interface in your browser. All processing happens on your machine — nothing is sent to the cloud.

Under the hood

Each executable is powered by the hoof runtime — a lightweight inference engine written in Rust. It handles:

  • - Loading and decompressing the model weights at startup
  • - Tokenisation (byte-level BPE, compatible with LLaMA, GPT-2, and similar models)
  • - Transformer inference with SIMD acceleration (AVX2 + FMA on x86)
  • - Quantised execution — 8-bit or 4-bit weights, dequantised on the fly
  • - LoRA adapter application — finetuned corrections are baked into the model
  • - A local web server with streaming generation and chat history

The runtime is not currently available separately — it's embedded in each executable we deliver. This keeps the models self-contained and ensures consistent behaviour across deployments.

The .hoof format

Internally, each executable contains a .hoof model file — a custom binary format that stores everything needed for inference in a single file:

Model weights Quantised (Q4K or Q8) for efficient storage and fast loading
Tokeniser Vocabulary, merge table, and special tokens
Configuration Architecture params, chat template, rope settings
LoRA adapters Finetuned correction layers (when applicable)

Web API

Each executable exposes a local REST API alongside the web UI. This means you can integrate hoof models into your own applications:

GET /health Liveness check
GET /api/info Model metadata
POST /api/generate Run inference (JSON response)
POST /api/generate/stream Streaming inference (SSE)

System requirements

OS
Windows 10+, macOS, Linux
CPU
Any modern 64-bit (AVX2 recommended)
RAM
4 GB minimum, 8 GB recommended
GPU
Not required

Need a model built for your use case?

Make an Enquiry