Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- ================================================================================
- COMPREHENSIVE T5 TEXT ENCODER EVALUATION
- FP16 Baseline vs FP16 Fast vs Q8 GGUF Quantization
- ================================================================================
- Loading tokenizer...
- You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
- ================================================================================
- BENCHMARK 1: FP16 BASELINE
- ================================================================================
- Loading FP16 baseline model...
- `torch_dtype` is deprecated! Use `dtype` instead!
- Encoding prompts...
- Benchmarking speed...
- ✓ Speed: 0.1296s ± 0.0045s
- ✓ VRAM: 10.76 GB
- ✓ Embedding shape: (6, 4096)
- ✓ Embedding dtype: float16
- ================================================================================
- BENCHMARK 2: FP16 WITH FAST ACCUMULATION (TF32)
- ================================================================================
- Loading FP16 fast model...
- Encoding prompts...
- Benchmarking speed...
- ✓ Speed: 0.1150s ± 0.0005s
- ✓ VRAM: 10.76 GB
- ================================================================================
- BENCHMARK 3: Q8 GGUF QUANTIZATION (MIXED PRECISION)
- ================================================================================
- 📊 Analyzing GGUF file structure...
- 📁 Analyzing GGUF model: /home/local/Downloads/paw/model_cache/models--city96--t5-v1_1-xxl-encoder-gguf/snapshots/005a6ea51a7d0b84d677b3e633bb52a8c85a83d9/t5-v1_1-xxl-encoder-Q8_0.gguf
- Architecture: [116 53 101 110 99 111 100 101 114]
- Total tensors: 219
- Quantization breakdown:
- • 0: 50 tensors (22.8%)
- • 8: 169 tensors (77.2%)
- Sample tensor types:
- • enc.blk.0.attn_k.weight: 8 [4096 4096]
- • enc.blk.0.attn_o.weight: 8 [4096 4096]
- • enc.blk.0.attn_q.weight: 8 [4096 4096]
- • enc.blk.0.attn_rel_b.weight: 0 [64 32]
- • enc.blk.0.attn_v.weight: 8 [4096 4096]
- • enc.blk.0.attn_norm.weight: 0 [4096]
- • enc.blk.0.ffn_gate.weight: 8 [ 4096 10240]
- • enc.blk.0.ffn_up.weight: 8 [ 4096 10240]
- • enc.blk.0.ffn_down.weight: 8 [10240 4096]
- • enc.blk.0.ffn_norm.weight: 0 [4096]
- ⚠️ CRITICAL FINDING:
- Q8_0 GGUF is MIXED PRECISION, not pure Q8!
- Contains: {'8': 169, '0': 50}
- This means some blocks are Q8_0 (quantized) and some are F32 (full precision)
- Even the 'quantized' parts have F32 scales per block!
- 🔄 Loading Q8 GGUF model (simulating dequantization)...
- 🔄 Loading Q8 GGUF and dequantizing to FP16 (simulating ComfyUI-GGUF)
- Simulating Q8_0 quantization artifacts...
- (Q8_0 = 8-bit int + FP16 scale per block of 32 values)
- Quantized 170 weight tensors
- Encoding prompts...
- Benchmarking speed...
- ✓ Speed: 0.1059s ± 0.0009s
- ✓ VRAM: 11.50 GB
- ✓ Embedding shape: (6, 4096)
- ================================================================================
- EMBEDDING ACCURACY COMPARISON
- ================================================================================
- [FP16 Fast vs FP16 Baseline]
- Cosine Similarity: 0.999999
- (std: 0.000000, min: 0.999999)
- MSE: 0.00e+00
- MAE: 3.43e-05
- L2 norm difference: 8.01e-04
- Max difference: 9.77e-04
- Perplexity metric: 2352149.492204
- [Q8 GGUF vs FP16 Baseline]
- Cosine Similarity: 0.999648
- (std: 0.000381, min: 0.998807)
- MSE: 1.49e-06
- MAE: 8.27e-04
- L2 norm difference: 7.02e-02
- Max difference: 2.29e-02
- Perplexity metric: 5173.351121
- [FP16 Fast vs Q8 GGUF] - THE CRITICAL COMPARISON
- Cosine Similarity: 0.999648
- (std: 0.000385)
- MSE: 1.49e-06
- MAE: 8.27e-04
- Per-prompt comparison (Cosine Similarity):
- Prompt FP16 Fast Q8 GGUF Winner
- ------------------------------------------------------- ------------ ------------ ------------
- a cat sitting on a chair 1.000000 0.999849 FP16 Fast
- cinematic shot of a futuristic cyberpunk city at n... 1.000000 0.999692 FP16 Fast
- close-up of delicate water droplets on a spider we... 1.000000 0.999865 FP16 Fast
- abstract concept of time dissolving into fractals 0.999999 0.998807 FP16 Fast
- professional product photography of a luxury watch... 1.000000 0.999872 FP16 Fast
- anime style illustration of a magical forest with ... 0.999999 0.999804 FP16 Fast
- ================================================================================
- PERFORMANCE SUMMARY
- ================================================================================
- Speed Comparison (lower is better):
- FP16 Baseline: 0.1296s ± 0.0045s
- FP16 Fast: 0.1150s ± 0.0005s
- Q8 GGUF: 0.1059s ± 0.0009s
- FP16 Fast speedup vs Baseline: 11.3%
- Q8 GGUF speedup vs FP16 Fast: 7.9%
- VRAM Usage (lower is better):
- FP16 Baseline: 10.76 GB
- FP16 Fast: 10.76 GB
- Q8 GGUF: 11.50 GB
- Q8 GGUF VRAM savings vs FP16: -6.9%
- ================================================================================
- EMBEDDING ACCURACY SUMMARY (Higher cosine similarity = Better)
- ================================================================================
- FP16 Fast vs Baseline:
- ✓ Cosine Similarity: 0.99999946
- ✓ Quality Loss: 0.000054%
- ✓ Status: NEGLIGIBLE DIFFERENCE (>0.9999 threshold)
- Q8 GGUF vs Baseline:
- ⚠️ Cosine Similarity: 0.99964816
- ⚠️ Quality Loss: 0.035184%
- ❌ Q8 is WORSE than FP16 Fast by 0.035130% cosine similarity
- ================================================================================
- 🏆 FINAL VERDICT
- ================================================================================
- Quality Ranking (Cosine Similarity to FP16 Baseline):
- 1. FP16 Baseline: 1.00000000 (reference)
- 2. 🥇 FP16 Fast: 0.99999946 ✓ WINNER
- 3. Q8 GGUF: 0.99964816
- Speed Ranking (Time per batch):
- 1. Q8 GGUF 0.1059s 🥇 FASTEST
- 2. FP16 Fast 0.1150s
- 3. FP16 Baseline 0.1296s
- 🎯 RECOMMENDATION FOR TEXT-TO-IMAGE/VIDEO (Flux, HunyuanVideo):
- Use FP16 + Fast Accumulation (TF32/BF16)
- WHY:
- ✓ FP16 Fast has 0.035130% BETTER quality than Q8 GGUF
- ✓ FP16 Fast is 11.3% faster than baseline
- ✓ No quantization artifacts (Q8 has rounding errors)
- ✓ Native hardware support (no dequantization overhead)
- ⚠️ Q8 GGUF is MIXED PRECISION (Q8_0 + F32 blocks)
- ⚠️ Q8 requires dequantization which adds latency
- ================================================================================
Advertisement
Add Comment
Please, Sign In to add comment