Experiments

Detailed records of our model reduction experiments — the engineering behind the results.

LLaMA 3.2 3B Joke Teller

April 2026

Complete

Took a 6 GB general-purpose model and turned it into a 1.93 GB joke-telling executable. The model tells coherent jokes, handles multi-turn conversation, and answers general questions — packaged as a single file you can double-click.

Key findings

  • - 68% size reduction — from ~6 GB to 1.93 GB with Q4K quantization
  • - LoRA distillation on a single A100 GPU — 300 training steps in under 3 minutes
  • - 10/10 on our evaluation suite — jokes, puns, knock-knocks, and follow-up requests
  • - Multi-turn conversation works — the model remembers context across turns
  • - Packaged as a standalone Windows executable with built-in web UI
Result: 1.93 GB executable, 10/10 eval, multi-turn conversation, zero dependencies

TinyLlama 1.1B Chat Assistant

February–March 2026

Complete

Our first end-to-end proof of concept. A 2.1 GB chat model surgically pruned to 18 layers, then quality-recovered with LoRA distillation — entirely on a laptop CPU, no GPU required.

Key findings

  • - Surgical pruning: 22 → 18 layers (18% reduction) while maintaining coherent output
  • - Quality recovery via LoRA distillation — perplexity dropped from 97 back to 17 (original: 8.1)
  • - Entire pipeline runs on a laptop CPU — no cloud, no GPU needed for small models
  • - Training completed in under an hour on consumer hardware
  • - Validated the full create → finetune → run → package workflow
Result: 1.8 GB model, coherent chat output, runs on any laptop