Experiments
Detailed records of our model reduction experiments — the engineering behind the results.
LLaMA 3.2 3B Joke Teller
April 2026
Took a 6 GB general-purpose model and turned it into a 1.93 GB joke-telling executable. The model tells coherent jokes, handles multi-turn conversation, and answers general questions — packaged as a single file you can double-click.
Key findings
- - 68% size reduction — from ~6 GB to 1.93 GB with Q4K quantization
- - LoRA distillation on a single A100 GPU — 300 training steps in under 3 minutes
- - 10/10 on our evaluation suite — jokes, puns, knock-knocks, and follow-up requests
- - Multi-turn conversation works — the model remembers context across turns
- - Packaged as a standalone Windows executable with built-in web UI
Result: 1.93 GB executable, 10/10 eval, multi-turn conversation, zero dependencies
TinyLlama 1.1B Chat Assistant
February–March 2026
Our first end-to-end proof of concept. A 2.1 GB chat model surgically pruned to 18 layers, then quality-recovered with LoRA distillation — entirely on a laptop CPU, no GPU required.
Key findings
- - Surgical pruning: 22 → 18 layers (18% reduction) while maintaining coherent output
- - Quality recovery via LoRA distillation — perplexity dropped from 97 back to 17 (original: 8.1)
- - Entire pipeline runs on a laptop CPU — no cloud, no GPU needed for small models
- - Training completed in under an hour on consumer hardware
- - Validated the full create → finetune → run → package workflow
Result: 1.8 GB model, coherent chat output, runs on any laptop