Docs
Everything you need to know about running hoof models.
What you get
Each hoof model is delivered as a standalone executable — a single file that contains the model weights, tokeniser, inference engine, and a built-in web UI. No Python, no GPU drivers, no internet connection required. Double-click and it runs.
The executable starts a local web server and opens a chat interface in your browser. All processing happens on your machine — nothing is sent to the cloud.
Under the hood
Each executable is powered by the hoof runtime — a lightweight inference engine written in Rust. It handles:
- - Loading and decompressing the model weights at startup
- - Tokenisation (byte-level BPE, compatible with LLaMA, GPT-2, and similar models)
- - Transformer inference with SIMD acceleration (AVX2 + FMA on x86)
- - Quantised execution — 8-bit or 4-bit weights, dequantised on the fly
- - LoRA adapter application — finetuned corrections are baked into the model
- - A local web server with streaming generation and chat history
The runtime is not currently available separately — it's embedded in each executable we deliver. This keeps the models self-contained and ensures consistent behaviour across deployments.
The .hoof format
Internally, each executable contains a .hoof model file —
a custom binary format that stores everything needed for inference in a single file:
Web API
Each executable exposes a local REST API alongside the web UI. This means you can integrate hoof models into your own applications:
/health Liveness check /api/info Model metadata /api/generate Run inference (JSON response) /api/generate/stream Streaming inference (SSE) System requirements
Need a model built for your use case?
Make an Enquiry