Benchmark Results
Your quantized model passes every benchmark.
Here's what it's hiding.
We quantized ViT-B/16 (86.6M params) at four precision levels and ran each through standard ML metrics and proprietary structural analysis on 5,000 samples. Standard metrics say int4 is “fine” (cosine 0.988, kNN -0.5%). Our analysis reveals per-class blind spots and ranking degradation that perplexity and cosine similarity miss entirely.
Cosine says “safe.” Your search results disagree.
Every class passes the cosine similarity test. But look at what happens to actual retrieval neighbors. The left side is what your monitoring dashboard shows. The right side is reality.
Your users got different search results
Pick a sample. See exactly which neighbors changed after int4 quantization. These are real embedding comparisons — not synthetic examples.
The damage averages hide
Average cosine is 0.988. But zoom into the worst 10% and you see which classes are disproportionately damaged. Cat and bird absorb most of the quantization damage while truck and automobile are barely touched.
Standard ML metrics
Industry-standard comparison metrics. All vs float32 baseline. ViT-B/16 on CIFAR-10, 5,000 samples.
| Variant | Cosine Sim | kNN Acc | Spearman ρ | SQNR (dB) |
|---|---|---|---|---|
| Float32 (baseline) | 1.000 | 94.4% | 1.000 | 120.0% |
| Float16 | 1.000 | 94.4% | 1.000 | 54.8% |
| Int8 (bitsandbytes) | 0.996 | 94.3% | 0.996 | 21.3% |
| Int4 NF4 | 0.988 | 93.9% | 0.989 | 16.1% |
Cosine similarity = per-sample directional similarity. kNN accuracy = classification from embedding neighbors (k=5). Spearman ρ = rank-order correlation of pairwise distances. SQNR = signal-to-quantization-noise ratio.
What standard metrics miss
Transfer Oracle's proprietary structural analysis goes beyond standard metrics. Where cosine says “fine,” structural analysis reveals hidden damage.
| Variant | Structural Prediction | Distribution Coverage | Structural Integrity | Transfer Risk |
|---|---|---|---|---|
| Float32 (baseline) | 100.0% | 100.0% | 1.000 | 0.00 |
| Float16 | 100.0% | 100.0% | 1.000 | 0.00 |
| Int8 (bitsandbytes) | 98.3% | 98.0% | 0.991 | 0.02 |
| Int4 NF4 | 97.4% | 91.0% | 0.984 | 0.03 |
Structural Prediction
How accurately the quantized model's internal structure predicts correct behavior. 100% = structurally identical.
Distribution Coverage
Fraction of training distribution still reachable. Lower coverage means the model lost access to learned regions.
Structural Integrity
Composite score from multiple independent structural analyses. 1.0 = perfect preservation. Detects damage invisible to cosine similarity.
Transfer Risk
Overall deployment risk. 0 = safe to deploy, 1 = do not deploy. Combines all structural signals into a single go/no-go metric.
Feature importance spectrum
How information is distributed across representation dimensions. Healthy models spread information broadly. Collapsed models concentrate it.
The quantization gradient
Float16
1.000
Cosine similarity
Lossless. 2x smaller. Spearman ρ 0.99999. No reason not to use it.
Int8
0.996
Cosine similarity
Near-lossless for classification. Spearman ρ 0.996 — ranking well preserved.
Int4 NF4
0.988
Cosine similarity
Looks fine. But Spearman drops to 0.989 — ~1% of retrieval rankings shuffled. Per-class blind spots emerge.
More Formats
Transfer Oracle also supports ternary (BitNet), GGUF, GPTQ, AWQ, and other quantization formats. Any format that produces embeddings can be audited.
Working with ternary quantization or planning a BitNet deployment? Contact us for a pilot program.
Int4 has blind spots
Overall kNN accuracy drops just 0.5%. But per-class analysis reveals cat (class 3) loses 1.4% and drops to 85.9% kNN — the damage is non-uniform. Average metrics hide class-specific damage.
| Class | Float32 | Int4 NF4 | Delta | Impact |
|---|---|---|---|---|
| airplane | 95.0% | 95.0% | -- | No degradation |
| automobile | 93.2% | 93.2% | -- | No degradation |
| bird | 92.0% | 92.0% | -- | No degradation |
| cat | 87.3% | 85.9% | -1.4% | |
| deer | 93.5% | 93.5% | -- | No degradation |
| dog | 91.8% | 92.0% | +0.2% | No degradation |
| frog | 97.8% | 97.8% | -- | No degradation |
| horse | 93.5% | 93.5% | -- | No degradation |
| ship | 96.8% | 96.8% | -- | No degradation |
| truck | 98.2% | 98.2% | -- | No degradation |
Why these classes? Cat requires fine-grained feature discrimination against similar classes (dog, deer). Int4 precision loss degrades these subtle distinctions. Classes with highly distinctive shapes (frog, truck) survive quantization intact.
Per-class similarity map
Cosine similarity vs float32, broken down by class. Green = preserved, red = destroyed. Notice how int4 degrades selectively across classes — some stay intact while others lose structure.
Distribution of per-sample similarity
The mean hides the spread. Int4 has a wider tail of damaged samples than int8. Int8 stays tightly correlated. Int4 shows subtle spread — the structural damage that accuracy alone misses.
Embedding space projection
Same 500 samples projected into 2D. Click each variant to see how class clusters deform. Float32 → Int8: clusters hold. Int4: subtle deformation visible in class boundaries.
Multi-metric profile
All 8 metrics on one chart. Float32 is a perfect octagon. Lower quantization levels show which axes degrade first. Int4 shows selective damage — geometry and kNN shrink while cosine stays high.
Do metrics agree on Int4?
Six metrics say “safe.” Two say “damaged.” This is why single-metric evaluation is dangerous.
Theoretical noise vs actual damage
SQNR (signal-to-quantization-noise ratio) from signal processing theory predicts the exact degradation gradient across all practical metrics.
Four metrics that disagree
At Int4 NF4, different metrics tell different stories. Which one do you trust?
0.988
Cosine Similarity
“Embeddings are 98.8% similar. Ship it.”
-0.5%
kNN Accuracy
“Classification barely moved. Safe.”
0.989
Spearman ρ
“1.1% of pairwise rankings shuffled. Retrieval may be affected.”
16.1 dB
SQNR
“High quantization noise floor. Theoretical damage is real.”
A single metric is insufficient. Transfer Oracle runs multiple independent analyses spanning standard metrics (cosine, kNN, rank preservation, SQNR) and proprietary structural methods — including per-class breakdown, distribution coverage, and multi-dimensional integrity checks.
Can LoRA recover quantization damage?
We trained LoRA adapters on each quantized base to test if fine-tuning can compensate for precision loss. The answer is nuanced.
| Config | kNN Acc | Cosine vs FP32 | Structural Prediction | Integrity | Verdict |
|---|---|---|---|---|---|
| FP32 (no LoRA) | 89.0% | 1.000 | -- | -- | baseline |
| LoRA + FP32 | 89.4% | 0.486 | 90.6% | 25% | accuracy up, structure changed |
| LoRA + Int8 | 89.2% | 0.981 | 88.0% | 79.4% | best: preserves accuracy AND geometry |
The LoRA paradox
LoRA + FP32 improves accuracy to 90.6% but breaks the embedding geometry — only 25% structural integrity, cosine 0.486. LoRA reorganized the entire representation. Accuracy recovered, structure didn't.
The recommendation
LoRA + Int8 is the sweet spot. 89.2% accuracy, 0.981 cosine, 79.4% structural integrity. The only configuration that preserves both accuracy AND embedding geometry.
LoRA adapters don't survive quantization
Can you train a LoRA on float32 and deploy it on an int8-quantized model? No. The adapter is base-specific.
| Config | kNN Acc | Structural Prediction | Integrity |
|---|---|---|---|
| Int8 + Int8-LoRA (native) | 89.2% | 88.0% | 79.4% |
| Int8 + FP32-LoRA (cross-base) | 63.4% | 75.2% | 21.4% |
Int8 cross-base: 63.4% vs 89.2% native (-25.8%). The fp32-trained LoRA completely fails on the int8 base. Even though raw int8 embeddings have 0.996 cosine similarity to fp32, the LoRA adaptation operates in a different subspace after quantization. You must retrain the adapter on the actual quantized base.
Methodology
Model: ViT-B/16 pretrained on ImageNet (86.6M params, 768-dim embeddings)
Dataset: CIFAR-10 (10 classes, 5,000 samples)
Quantization variants:
- Float32 — full precision baseline
- Float16 — half precision
- Int8 (bitsandbytes) — LLM.int8() applied to all Linear layers
- Int4 NF4 (bitsandbytes) — NormalFloat4 applied to all Linear layers
- Additional formats supported: ternary (BitNet), GGUF, GPTQ, AWQ — any format that produces embeddings. Contact us for a ternary pilot.
LoRA: rank=8, alpha=16, 5 epochs (peft library)
Standard metrics:
- Cosine similarity — per-sample embedding directional similarity
- kNN accuracy (k=5) — classification quality from embedding neighbors
- Spearman rank correlation — pairwise distance rank preservation
- SQNR — signal-to-quantization-noise ratio (dB)
- Per-class breakdown — class-specific degradation detection
Proprietary analysis (Transfer Oracle):
- Structural prediction — proprietary structural analysis
- Distribution coverage — training region reachability assessment
- Transfer risk — composite structural integrity score
- Anomaly scoring — per-sample vulnerability detection
- + additional proprietary analyses
Audit your quantized model
Don't deploy quantized models blind. Know exactly which classes survived, which collapsed, and whether your LoRA adapter will transfer.