Experiments
Detailed records of our model reduction experiments — the engineering behind the results.
LLaMA 3.2 3B Movie Pitcher
May 2026
Our first experiment using ablation-guided surgery. Instead of guessing which layers to remove, we ran hoof ablate to measure KL divergence per layer and per attention head across 200 movie pitch examples — then used those scores to drive the compression. The result is a 1.86 GB movie pitch generator that passes 10/10 on our evaluation suite.
Key findings
- - 71% size reduction — from ~6 GB (F32) to 1.86 GB with layer pruning, MLP neuron pruning, and Q4K quantization
- - Ablation-guided layer selection — hoof ablate scored all 28 layers and 672 attention heads; the 2 lowest-KL layers were dropped
- - MLP neuron pruning at threshold 0.005 — removed 2.2% of neurons (the most dormant, by activation magnitude)
- - SFT fine-tuning on 200 movie pitch examples — 800 steps, lr=5e-6, final CE 3.03
- - 10/10 on our evaluation suite — coherent pitches across thriller, mystery, comedy, action, sci-fi, fantasy, romance, horror, historical, and psychological genres
- - Available on request
CodeLlama 7B Code Assistant
April 2026
A 14 GB general-purpose code model, fine-tuned into a focused coding assistant — one that writes, explains, and debugs code without the preamble, padding, and off-topic responses of the base model. The result passes 9/10 on our evaluation suite and handles Python, JavaScript, Rust, SQL, and Bash cleanly.
Key findings
- - 73% size reduction — from ~14 GB (F32) to ~3.8 GB with layer pruning and Q4K quantization
- - Supervised fine-tuning (SFT) on 367 curated prompt-response pairs — cleaner output than KL distillation for this task
- - 9/10 on our evaluation suite — correct code, correct language, no hallucinated syntax
- - Style shift confirmed: base model wraps every response in "Sure! Here's an example..."; fine-tuned model outputs code directly
- - Interesting finding: base model defaulted to Python for all code tasks regardless of the requested language — targeted JS calibration data fixed this
- - 1200 training steps on an A100 GPU, completed in under 20 minutes
- - ~3.8 GB executable available on request
Mistral 7B French Translator
April 2026
We exported a 14 GB general-purpose model and distilled it into a 4.09 GB French translator. Surgical layer pruning removed 2 of 32 layers, Q4K quantization compressed the weights, and LoRA distillation on an A100 taught the model to translate English to French — 800 steps in under 5 minutes.
Key findings
- - 71% size reduction — from ~14 GB (F32) to 4.09 GB with layer pruning + Q4K quantization
- - LoRA rank-32 distillation on A100 — 800 steps, lr=3e-6, KL divergence 0.95 (stable throughout)
- - 10/10 on our evaluation suite — accurate French output with correct grammar and idiom
- - KL < 1.0 confirms the distilled model closely tracks the original teacher distribution
- - ~4 GB executable available on request
LLaMA 3.2 3B Joke Teller
April 2026
We downloaded a 6 GB general-purpose model and turned it into a 1.93 GB joke-telling executable. The model tells coherent jokes, handles multi-turn conversation, and answers general questions — packaged as a single file you can double-click.
Key findings
- - 68% size reduction — from ~6 GB to 1.93 GB with Q4K quantization
- - LoRA distillation on a single A100 GPU — 300 training steps in under 3 minutes
- - 10/10 on our evaluation suite — jokes, puns, knock-knocks, and follow-up requests
- - Multi-turn conversation works — the model remembers context across turns
- - Packaged as a standalone Windows executable with built-in web UI
TinyLlama 1.1B Chat Assistant
February–March 2026
Our first end-to-end proof of concept. A 2.1 GB chat model surgically pruned to 18 layers, then quality-recovered with LoRA distillation — entirely on a laptop CPU, no GPU required.
Key findings
- - Surgical pruning: 22 → 18 layers (18% reduction) while maintaining coherent output
- - Quality recovery via LoRA distillation — perplexity dropped from 97 back to 17 (original: 8.1)
- - Entire pipeline runs on a laptop CPU — no cloud, no GPU needed for small models
- - Training completed in under an hour on consumer hardware
- - Validated the full create → finetune → run → package workflow