How hoof works

Large language models are trained to do everything. Your task only needs a fraction of that capability. hoof finds and keeps exactly that fraction.

The problem with general-purpose models

A 7B parameter model weighs 14 GB and requires a high-end GPU or slow CPU inference. It knows how to write poetry, debug code, answer trivia, translate languages, and thousands of other things. If you only need it to translate English to French, you're carrying 13.5 GB of capability you'll never use.

Task analysis

hoof parses your task description to understand exactly what the model needs to do — which languages, domains, and output types are required. This shapes every decision that follows.

Vocabulary pruning

Most tokens in a model's vocabulary are irrelevant to your task. A French translator doesn't need tokens for Python syntax, emoji, or Japanese characters. hoof removes them, shrinking the embedding tables by up to 70%.

Attention head pruning

Transformer models have many attention heads — parallel mechanisms that learn different relationships in text. Many are redundant or task-irrelevant. hoof uses activation analysis to identify and remove them.

Layer reduction

Deep models have many layers. For focused tasks, layers in the middle of the network often contribute little. hoof measures each layer's contribution to your specific task and removes or merges the least important ones.

Quantisation

The remaining weights are compressed from 32-bit floats to 8-bit or 4-bit integers — a further 4–8× size reduction with minimal quality loss on task-specific benchmarks.

Knowledge distillation

Surgery alone degrades quality: removing layers and heads causes the model to drift from its original output distribution. hoof recovers this quality by training small LoRA adapter layers that learn to match the original model's predictions on task-relevant examples. This step brings perplexity back close to the original — a 97 → 17 recovery on TinyLlama 1.1B — without changing the file size.

Packaging

The reduced, fine-tuned model is embedded into a single native executable alongside a lightweight inference engine and local web UI. No Python runtime, no model files, no environment setup. Just one file.

The result

A model that runs on any laptop CPU, works fully offline, and retains ~95% accuracy on its target task. For 7B+ architectures with full vocabulary pruning, size reductions of 10–15× are typical. Smaller models see more modest reductions — the value is in the packaging, the speed, and the quality recovery from distillation.

See real benchmark results →