Benchmark Results

Your quantized model passes every benchmark.
Here's what it's hiding.

We quantized ViT-B/16 (86.6M params) at four precision levels and ran each through standard ML metrics and proprietary structural analysis on 5,000 samples. Standard metrics say int4 is “fine” (cosine 0.988, kNN -0.5%). Our analysis reveals per-class blind spots and ranking degradation that perplexity and cosine similarity miss entirely.

Cosine says “safe.” Your search results disagree.

Every class passes the cosine similarity test. But look at what happens to actual retrieval neighbors. The left side is what your monitoring dashboard shows. The right side is reality.

Your users got different search results

Pick a sample. See exactly which neighbors changed after int4 quantization. These are real embedding comparisons — not synthetic examples.

The damage averages hide

Average cosine is 0.988. But zoom into the worst 10% and you see which classes are disproportionately damaged. Cat and bird absorb most of the quantization damage while truck and automobile are barely touched.

Standard ML metrics

Industry-standard comparison metrics. All vs float32 baseline. ViT-B/16 on CIFAR-10, 5,000 samples.

Variant	Cosine Sim	kNN Acc	Spearman ρ	SQNR (dB)
Float32 (baseline)	1.000	94.4%	1.000	120.0%
Float16	1.000	94.4%	1.000	54.8%
Int8 (bitsandbytes)	0.996	94.3%	0.996	21.3%
Int4 NF4	0.988	93.9%	0.989	16.1%

Cosine similarity = per-sample directional similarity. kNN accuracy = classification from embedding neighbors (k=5). Spearman ρ = rank-order correlation of pairwise distances. SQNR = signal-to-quantization-noise ratio.

What standard metrics miss

Transfer Oracle's proprietary structural analysis goes beyond standard metrics. Where cosine says “fine,” structural analysis reveals hidden damage.

Variant	Structural Prediction	Distribution Coverage	Structural Integrity	Transfer Risk
Float32 (baseline)	100.0%	100.0%	1.000	0.00
Float16	100.0%	100.0%	1.000	0.00
Int8 (bitsandbytes)	98.3%	98.0%	0.991	0.02
Int4 NF4	97.4%	91.0%	0.984	0.03

Structural Prediction

How accurately the quantized model's internal structure predicts correct behavior. 100% = structurally identical.

Distribution Coverage

Fraction of training distribution still reachable. Lower coverage means the model lost access to learned regions.

Structural Integrity

Composite score from multiple independent structural analyses. 1.0 = perfect preservation. Detects damage invisible to cosine similarity.

Transfer Risk

Overall deployment risk. 0 = safe to deploy, 1 = do not deploy. Combines all structural signals into a single go/no-go metric.

Feature importance spectrum

How information is distributed across representation dimensions. Healthy models spread information broadly. Collapsed models concentrate it.

The quantization gradient

Float16

1.000

Cosine similarity

Lossless. 2x smaller. Spearman ρ 0.99999. No reason not to use it.

Int8

0.996

Cosine similarity

Near-lossless for classification. Spearman ρ 0.996 — ranking well preserved.

Int4 NF4

0.988

Cosine similarity

Looks fine. But Spearman drops to 0.989 — ~1% of retrieval rankings shuffled. Per-class blind spots emerge.

More Formats

Transfer Oracle also supports ternary (BitNet), GGUF, GPTQ, AWQ, and other quantization formats. Any format that produces embeddings can be audited.

Working with ternary quantization or planning a BitNet deployment? Contact us for a pilot program.

Int4 has blind spots

Overall kNN accuracy drops just 0.5%. But per-class analysis reveals cat (class 3) loses 1.4% and drops to 85.9% kNN — the damage is non-uniform. Average metrics hide class-specific damage.

Class	Float32	Int4 NF4	Delta	Impact
airplane	95.0%	95.0%	--	No degradation
automobile	93.2%	93.2%	--	No degradation
bird	92.0%	92.0%	--	No degradation
cat	87.3%	85.9%	-1.4%
deer	93.5%	93.5%	--	No degradation
dog	91.8%	92.0%	+0.2%	No degradation
frog	97.8%	97.8%	--	No degradation
horse	93.5%	93.5%	--	No degradation
ship	96.8%	96.8%	--	No degradation
truck	98.2%	98.2%	--	No degradation

Why these classes? Cat requires fine-grained feature discrimination against similar classes (dog, deer). Int4 precision loss degrades these subtle distinctions. Classes with highly distinctive shapes (frog, truck) survive quantization intact.

Per-class similarity map

Cosine similarity vs float32, broken down by class. Green = preserved, red = destroyed. Notice how int4 degrades selectively across classes — some stay intact while others lose structure.

Distribution of per-sample similarity

The mean hides the spread. Int4 has a wider tail of damaged samples than int8. Int8 stays tightly correlated. Int4 shows subtle spread — the structural damage that accuracy alone misses.

Embedding space projection

Same 500 samples projected into 2D. Click each variant to see how class clusters deform. Float32 → Int8: clusters hold. Int4: subtle deformation visible in class boundaries.

Multi-metric profile

All 8 metrics on one chart. Float32 is a perfect octagon. Lower quantization levels show which axes degrade first. Int4 shows selective damage — geometry and kNN shrink while cosine stays high.

Do metrics agree on Int4?

Six metrics say “safe.” Two say “damaged.” This is why single-metric evaluation is dangerous.

Theoretical noise vs actual damage

SQNR (signal-to-quantization-noise ratio) from signal processing theory predicts the exact degradation gradient across all practical metrics.

Four metrics that disagree

At Int4 NF4, different metrics tell different stories. Which one do you trust?

0.988

Cosine Similarity

“Embeddings are 98.8% similar. Ship it.”

-0.5%

kNN Accuracy

“Classification barely moved. Safe.”

0.989

Spearman ρ

“1.1% of pairwise rankings shuffled. Retrieval may be affected.”

16.1 dB

SQNR

“High quantization noise floor. Theoretical damage is real.”

A single metric is insufficient. Transfer Oracle runs multiple independent analyses spanning standard metrics (cosine, kNN, rank preservation, SQNR) and proprietary structural methods — including per-class breakdown, distribution coverage, and multi-dimensional integrity checks.

Can LoRA recover quantization damage?

We trained LoRA adapters on each quantized base to test if fine-tuning can compensate for precision loss. The answer is nuanced.

Config	kNN Acc	Cosine vs FP32	Structural Prediction	Integrity	Verdict
FP32 (no LoRA)	89.0%	1.000	--	--	baseline
LoRA + FP32	89.4%	0.486	90.6%	25%	accuracy up, structure changed
LoRA + Int8	89.2%	0.981	88.0%	79.4%	best: preserves accuracy AND geometry

The LoRA paradox

LoRA + FP32 improves accuracy to 90.6% but breaks the embedding geometry — only 25% structural integrity, cosine 0.486. LoRA reorganized the entire representation. Accuracy recovered, structure didn't.

The recommendation

LoRA + Int8 is the sweet spot. 89.2% accuracy, 0.981 cosine, 79.4% structural integrity. The only configuration that preserves both accuracy AND embedding geometry.

LoRA adapters don't survive quantization

Can you train a LoRA on float32 and deploy it on an int8-quantized model? No. The adapter is base-specific.

Config	kNN Acc	Structural Prediction	Integrity
Int8 + Int8-LoRA (native)	89.2%	88.0%	79.4%
Int8 + FP32-LoRA (cross-base)	63.4%	75.2%	21.4%

Int8 cross-base: 63.4% vs 89.2% native (-25.8%). The fp32-trained LoRA completely fails on the int8 base. Even though raw int8 embeddings have 0.996 cosine similarity to fp32, the LoRA adaptation operates in a different subspace after quantization. You must retrain the adapter on the actual quantized base.

Methodology

Model: ViT-B/16 pretrained on ImageNet (86.6M params, 768-dim embeddings)

Dataset: CIFAR-10 (10 classes, 5,000 samples)

Quantization variants:

Float32 — full precision baseline
Float16 — half precision
Int8 (bitsandbytes) — LLM.int8() applied to all Linear layers
Int4 NF4 (bitsandbytes) — NormalFloat4 applied to all Linear layers
Additional formats supported: ternary (BitNet), GGUF, GPTQ, AWQ — any format that produces embeddings. Contact us for a ternary pilot.

LoRA: rank=8, alpha=16, 5 epochs (peft library)

Standard metrics:

Cosine similarity — per-sample embedding directional similarity
kNN accuracy (k=5) — classification quality from embedding neighbors
Spearman rank correlation — pairwise distance rank preservation
SQNR — signal-to-quantization-noise ratio (dB)
Per-class breakdown — class-specific degradation detection

Proprietary analysis (Transfer Oracle):

Structural prediction — proprietary structural analysis
Distribution coverage — training region reachability assessment
Transfer risk — composite structural integrity score
Anomaly scoring — per-sample vulnerability detection
+ additional proprietary analyses

Audit your quantized model

Don't deploy quantized models blind. Know exactly which classes survived, which collapsed, and whether your LoRA adapter will transfer.

Get started Edge FPGA monitoring Quantum validation NIST compliance

Your quantized model passes every benchmark.Here's what it's hiding.

Cosine says “safe.” Your search results disagree.

Your users got different search results

The damage averages hide

Standard ML metrics

What standard metrics miss

Feature importance spectrum

The quantization gradient

Float16

Int8

Int4 NF4

More Formats

Int4 has blind spots

Per-class similarity map

Distribution of per-sample similarity

Embedding space projection

Multi-metric profile

Do metrics agree on Int4?

Theoretical noise vs actual damage

Four metrics that disagree

Can LoRA recover quantization damage?

The LoRA paradox

The recommendation

LoRA adapters don't survive quantization

Methodology

Audit your quantized model

Your quantized model passes every benchmark.
Here's what it's hiding.