Experiments

Detailed records of our model reduction experiments — the engineering behind the results.

LLaMA 3.2 3B Movie Pitcher

May 2026

Complete

Our first experiment using ablation-guided surgery. Instead of guessing which layers to remove, we ran hoof ablate to measure KL divergence per layer and per attention head across 200 movie pitch examples — then used those scores to drive the compression. The result is a 1.86 GB movie pitch generator that passes 10/10 on our evaluation suite.

Key findings

- 71% size reduction — from ~6 GB (F32) to 1.86 GB with layer pruning, MLP neuron pruning, and Q4K quantization
- Ablation-guided layer selection — hoof ablate scored all 28 layers and 672 attention heads; the 2 lowest-KL layers were dropped
- MLP neuron pruning at threshold 0.005 — removed 2.2% of neurons (the most dormant, by activation magnitude)
- SFT fine-tuning on 200 movie pitch examples — 800 steps, lr=5e-6, final CE 3.03
- 10/10 on our evaluation suite — coherent pitches across thriller, mystery, comedy, action, sci-fi, fantasy, romance, horror, historical, and psychological genres
- Available on request

Result: 1.86 GB executable, 10/10 eval, ablation-guided surgery, available on request

CodeLlama 7B Code Assistant

April 2026

Complete

A 14 GB general-purpose code model, fine-tuned into a focused coding assistant — one that writes, explains, and debugs code without the preamble, padding, and off-topic responses of the base model. The result passes 9/10 on our evaluation suite and handles Python, JavaScript, Rust, SQL, and Bash cleanly.

Key findings

- 73% size reduction — from ~14 GB (F32) to ~3.8 GB with layer pruning and Q4K quantization
- Supervised fine-tuning (SFT) on 367 curated prompt-response pairs — cleaner output than KL distillation for this task
- 9/10 on our evaluation suite — correct code, correct language, no hallucinated syntax
- Style shift confirmed: base model wraps every response in "Sure! Here's an example..."; fine-tuned model outputs code directly
- Interesting finding: base model defaulted to Python for all code tasks regardless of the requested language — targeted JS calibration data fixed this
- 1200 training steps on an A100 GPU, completed in under 20 minutes
- ~3.8 GB executable available on request

Result: ~3.8 GB executable, 9/10 eval, Python / JS / Rust / SQL / Bash, available on request

Mistral 7B French Translator

April 2026

Complete

We exported a 14 GB general-purpose model and distilled it into a 4.09 GB French translator. Surgical layer pruning removed 2 of 32 layers, Q4K quantization compressed the weights, and LoRA distillation on an A100 taught the model to translate English to French — 800 steps in under 5 minutes.

Key findings

- 71% size reduction — from ~14 GB (F32) to 4.09 GB with layer pruning + Q4K quantization
- LoRA rank-32 distillation on A100 — 800 steps, lr=3e-6, KL divergence 0.95 (stable throughout)
- 10/10 on our evaluation suite — accurate French output with correct grammar and idiom
- KL < 1.0 confirms the distilled model closely tracks the original teacher distribution
- ~4 GB executable available on request

Result: ~4 GB executable, 10/10 eval, KL=0.95, available on request

LLaMA 3.2 3B Joke Teller

April 2026

Complete

We downloaded a 6 GB general-purpose model and turned it into a 1.93 GB joke-telling executable. The model tells coherent jokes, handles multi-turn conversation, and answers general questions — packaged as a single file you can double-click.

Key findings

- 68% size reduction — from ~6 GB to 1.93 GB with Q4K quantization
- LoRA distillation on a single A100 GPU — 300 training steps in under 3 minutes
- 10/10 on our evaluation suite — jokes, puns, knock-knocks, and follow-up requests
- Multi-turn conversation works — the model remembers context across turns
- Packaged as a standalone Windows executable with built-in web UI

Result: 1.93 GB executable, 10/10 eval, multi-turn conversation, zero dependencies

TinyLlama 1.1B Chat Assistant

February–March 2026

Complete

Our first end-to-end proof of concept. A 2.1 GB chat model surgically pruned to 18 layers, then quality-recovered with LoRA distillation — entirely on a laptop CPU, no GPU required.

Key findings

- Surgical pruning: 22 → 18 layers (18% reduction) while maintaining coherent output
- Quality recovery via LoRA distillation — perplexity dropped from 97 back to 17 (original: 8.1)
- Entire pipeline runs on a laptop CPU — no cloud, no GPU needed for small models
- Training completed in under an hour on consumer hardware
- Validated the full create → finetune → run → package workflow

Result: 1.8 GB model, coherent chat output, runs on any laptop