Untitled

================================================================================
COMPREHENSIVE T5 TEXT ENCODER EVALUATION
FP16 Baseline vs FP16 Fast vs Q8 GGUF Quantization
================================================================================

Loading tokenizer...
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565

================================================================================
BENCHMARK 1: FP16 BASELINE
================================================================================
Loading FP16 baseline model...
`torch_dtype` is deprecated! Use `dtype` instead!

Encoding prompts...
Benchmarking speed...
✓ Speed: 0.1296s ± 0.0045s
✓ VRAM: 10.76 GB
✓ Embedding shape: (6, 4096)
✓ Embedding dtype: float16

================================================================================
BENCHMARK 2: FP16 WITH FAST ACCUMULATION (TF32)
================================================================================
Loading FP16 fast model...
Encoding prompts...
Benchmarking speed...
✓ Speed: 0.1150s ± 0.0005s
✓ VRAM: 10.76 GB

================================================================================
BENCHMARK 3: Q8 GGUF QUANTIZATION (MIXED PRECISION)
================================================================================

📊 Analyzing GGUF file structure...
📁 Analyzing GGUF model: /home/local/Downloads/paw/model_cache/models--city96--t5-v1_1-xxl-encoder-gguf/snapshots/005a6ea51a7d0b84d677b3e633bb52a8c85a83d9/t5-v1_1-xxl-encoder-Q8_0.gguf
   Architecture: [116  53 101 110  99 111 100 101 114]
   Total tensors: 219
   Quantization breakdown:
     • 0: 50 tensors (22.8%)
     • 8: 169 tensors (77.2%)

   Sample tensor types:
     • enc.blk.0.attn_k.weight: 8 [4096 4096]
     • enc.blk.0.attn_o.weight: 8 [4096 4096]
     • enc.blk.0.attn_q.weight: 8 [4096 4096]
     • enc.blk.0.attn_rel_b.weight: 0 [64 32]
     • enc.blk.0.attn_v.weight: 8 [4096 4096]
     • enc.blk.0.attn_norm.weight: 0 [4096]
     • enc.blk.0.ffn_gate.weight: 8 [ 4096 10240]
     • enc.blk.0.ffn_up.weight: 8 [ 4096 10240]
     • enc.blk.0.ffn_down.weight: 8 [10240  4096]
     • enc.blk.0.ffn_norm.weight: 0 [4096]

⚠️  CRITICAL FINDING:
   Q8_0 GGUF is MIXED PRECISION, not pure Q8!
   Contains: {'8': 169, '0': 50}
   This means some blocks are Q8_0 (quantized) and some are F32 (full precision)
   Even the 'quantized' parts have F32 scales per block!

🔄 Loading Q8 GGUF model (simulating dequantization)...
🔄 Loading Q8 GGUF and dequantizing to FP16 (simulating ComfyUI-GGUF)
   Simulating Q8_0 quantization artifacts...
   (Q8_0 = 8-bit int + FP16 scale per block of 32 values)
   Quantized 170 weight tensors
Encoding prompts...
Benchmarking speed...
✓ Speed: 0.1059s ± 0.0009s
✓ VRAM: 11.50 GB
✓ Embedding shape: (6, 4096)

================================================================================
EMBEDDING ACCURACY COMPARISON
================================================================================

[FP16 Fast vs FP16 Baseline]
  Cosine Similarity: 0.999999
    (std: 0.000000, min: 0.999999)
  MSE: 0.00e+00
  MAE: 3.43e-05
  L2 norm difference: 8.01e-04
  Max difference: 9.77e-04
  Perplexity metric: 2352149.492204

[Q8 GGUF vs FP16 Baseline]
  Cosine Similarity: 0.999648
    (std: 0.000381, min: 0.998807)
  MSE: 1.49e-06
  MAE: 8.27e-04
  L2 norm difference: 7.02e-02
  Max difference: 2.29e-02
  Perplexity metric: 5173.351121

[FP16 Fast vs Q8 GGUF] - THE CRITICAL COMPARISON
  Cosine Similarity: 0.999648
    (std: 0.000385)
  MSE: 1.49e-06
  MAE: 8.27e-04

  Per-prompt comparison (Cosine Similarity):
  Prompt                                                  FP16 Fast    Q8 GGUF      Winner
  ------------------------------------------------------- ------------ ------------ ------------
  a cat sitting on a chair                                1.000000     0.999849     FP16 Fast
  cinematic shot of a futuristic cyberpunk city at n...   1.000000     0.999692     FP16 Fast
  close-up of delicate water droplets on a spider we...   1.000000     0.999865     FP16 Fast
  abstract concept of time dissolving into fractals       0.999999     0.998807     FP16 Fast
  professional product photography of a luxury watch...   1.000000     0.999872     FP16 Fast
  anime style illustration of a magical forest with ...   0.999999     0.999804     FP16 Fast

================================================================================
PERFORMANCE SUMMARY
================================================================================

Speed Comparison (lower is better):
  FP16 Baseline:  0.1296s ± 0.0045s
  FP16 Fast:      0.1150s ± 0.0005s
  Q8 GGUF:        0.1059s ± 0.0009s

  FP16 Fast speedup vs Baseline: 11.3%
  Q8 GGUF speedup vs FP16 Fast: 7.9%

VRAM Usage (lower is better):
  FP16 Baseline:  10.76 GB
  FP16 Fast:      10.76 GB
  Q8 GGUF:        11.50 GB

  Q8 GGUF VRAM savings vs FP16: -6.9%

================================================================================
EMBEDDING ACCURACY SUMMARY (Higher cosine similarity = Better)
================================================================================

  FP16 Fast vs Baseline:
    ✓ Cosine Similarity: 0.99999946
    ✓ Quality Loss: 0.000054%
    ✓ Status: NEGLIGIBLE DIFFERENCE (>0.9999 threshold)

  Q8 GGUF vs Baseline:
    ⚠️  Cosine Similarity: 0.99964816
    ⚠️  Quality Loss: 0.035184%
    ❌ Q8 is WORSE than FP16 Fast by 0.035130% cosine similarity

================================================================================
🏆 FINAL VERDICT
================================================================================

  Quality Ranking (Cosine Similarity to FP16 Baseline):
    1. FP16 Baseline:  1.00000000 (reference)
    2. 🥇 FP16 Fast:    0.99999946 ✓ WINNER
    3. Q8 GGUF:        0.99964816

  Speed Ranking (Time per batch):
    1. Q8 GGUF         0.1059s 🥇 FASTEST
    2. FP16 Fast       0.1150s
    3. FP16 Baseline   0.1296s

  🎯 RECOMMENDATION FOR TEXT-TO-IMAGE/VIDEO (Flux, HunyuanVideo):
     Use FP16 + Fast Accumulation (TF32/BF16)

  WHY:
     ✓ FP16 Fast has 0.035130% BETTER quality than Q8 GGUF
     ✓ FP16 Fast is 11.3% faster than baseline
     ✓ No quantization artifacts (Q8 has rounding errors)
     ✓ Native hardware support (no dequantization overhead)
     ⚠️  Q8 GGUF is MIXED PRECISION (Q8_0 + F32 blocks)
     ⚠️  Q8 requires dequantization which adds latency

================================================================================