Colab Framepack error

Currently enabled native sdp backends: ['flash', 'math', 'mem_efficient', 'cudnn']
Xformers is not installed!
Flash Attn is not installed!
Sage Attn is installed!
Namespace(share=True, server='0.0.0.0', port=None, inbrowser=False)
Free VRAM 14.6407470703125 GB
High-VRAM Mode: False
Downloading shards: 100% 4/4 [00:00<00:00, 13127.71it/s]
Loading checkpoint shards: 100% 4/4 [00:01<00:00,  3.79it/s]
Fetching 3 files: 100% 3/3 [00:00<00:00, 8045.34it/s]
Loading checkpoint shards: 100% 3/3 [00:01<00:00,  1.65it/s]
transformer.high_quality_fp32_output_for_inference = True
* Running on local URL:  http://
* Running on public URL: https://

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
Unloaded DynamicSwap_LlamaModel as complete.
Unloaded CLIPTextModel as complete.
Unloaded SiglipVisionModel as complete.
Unloaded AutoencoderKLHunyuanVideo as complete.
Unloaded DynamicSwap_HunyuanVideoTransformer3DModelPacked as complete.
Loaded CLIPTextModel to cuda:0 as complete.
Unloaded CLIPTextModel as complete.
Loaded AutoencoderKLHunyuanVideo to cuda:0 as complete.
Unloaded AutoencoderKLHunyuanVideo as complete.
Loaded SiglipVisionModel to cuda:0 as complete.
latent_padding_size = 27, is_last_section = False
Unloaded SiglipVisionModel as complete.
Moving DynamicSwap_HunyuanVideoTransformer3DModelPacked to cuda:0 with preserved memory: 6 GB
  0% 0/25 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/sageattention/attn_qk_int8_per_block.py:18:23: error: 'tt.fp_to_fp' op operand #0 must be floating-point or ranked tensor of floating-point values, but got 'tensor<128x128xi8, #ttg.dot_op<{opIdx = 0, parent = #ttg.blocked<{sizePerThread = [4, 4], threadsPerWarp = [2, 16], warpsPerCTA = [8, 1], order = [1, 0]}>}>>'
        qk = tl.dot(q, k).to(tl.float32) * q_scale * k_scale
                      ^
/usr/local/lib/python3.10/dist-packages/sageattention/attn_qk_int8_per_block.py:78:55: note: called from
                                    4 - STAGE, offs_m, offs_n
                                                      ^
module {
  tt.func public @_attn_fwd(%arg0: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg3: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg4: !tt.ptr<f32> {tt.divisibility = 16 : i32}, %arg5: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg6: i32 {tt.divisibility = 16 : i32}, %arg7: i32 {tt.divisibility = 16 : i32}, %arg8: i32 {tt.divisibility = 16 : i32}, %arg9: i32 {tt.divisibility = 16 : i32}, %arg10: i32 {tt.divisibility = 16 : i32}, %arg11: i32 {tt.divisibility = 16 : i32}, %arg12: i32 {tt.divisibility = 16 : i32}, %arg13: i32 {tt.divisibility = 16 : i32}, %arg14: i32 {tt.divisibility = 16 : i32}, %arg15: i32 {tt.divisibility = 16 : i32}, %arg16: i32 {tt.divisibility = 16 : i32}, %arg17: i32 {tt.divisibility = 16 : i32}, %arg18: i32, %arg19: i32) attributes {noinline = false} {
    %cst = arith.constant dense<1.000000e+00> : tensor<128xf32>
    %cst_0 = arith.constant dense<0xFF800000> : tensor<128xf32>
    %c0_i32 = arith.constant 0 : i32
    %cst_1 = arith.constant dense<0> : tensor<128x64xi32>
    %cst_2 = arith.constant dense<0.000000e+00> : tensor<128x128xf16>
    %c1_i32 = arith.constant 1 : i32
    %cst_3 = arith.constant dense<0.000000e+00> : tensor<128x128xf32>
    %c63_i32 = arith.constant 63 : i32
    %c64_i32 = arith.constant 64 : i32
    %c127_i32 = arith.constant 127 : i32
    %c128_i32 = arith.constant 128 : i32
    %c24_i64 = arith.constant 24 : i64
    %0 = tt.get_program_id x : i32
    %1 = tt.get_program_id z : i32
    %2 = arith.extsi %1 : i32 to i64
    %3 = tt.get_program_id y : i32
    %4 = arith.extsi %3 : i32 to i64
    %5 = arith.muli %2, %c24_i64 : i64
    %6 = arith.addi %5, %4 : i64
    %7 = arith.addi %arg18, %c127_i32 : i32
    %8 = arith.divsi %7, %c128_i32 : i32
    %9 = arith.extsi %8 : i32 to i64
    %10 = arith.muli %6, %9 : i64
    %11 = arith.addi %arg19, %c63_i32 : i32
    %12 = arith.divsi %11, %c64_i32 : i32
    %13 = arith.extsi %12 : i32 to i64
    %14 = arith.muli %6, %13 : i64
    %15 = arith.muli %0, %c128_i32 : i32
    %16 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
    %17 = tt.splat %15 : i32 -> tensor<128xi32>
    %18 = arith.addi %17, %16 : tensor<128xi32>
    %19 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32>
    %20 = arith.extsi %arg6 : i32 to i64
    %21 = arith.muli %2, %20 : i64
    %22 = arith.extsi %arg7 : i32 to i64
    %23 = arith.muli %4, %22 : i64
    %24 = arith.addi %21, %23 : i64
    %25 = tt.addptr %arg0, %24 : !tt.ptr<i8>, i64
    %26 = tt.expand_dims %18 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
    %27 = tt.splat %arg8 : i32 -> tensor<128x1xi32>
    %28 = arith.muli %26, %27 : tensor<128x1xi32>
    %29 = tt.splat %25 : !tt.ptr<i8> -> tensor<128x1x!tt.ptr<i8>>
    %30 = tt.addptr %29, %28 : tensor<128x1x!tt.ptr<i8>>, tensor<128x1xi32>
    %31 = tt.expand_dims %16 {axis = 0 : i32} : tensor<128xi32> -> tensor<1x128xi32>
    %32 = tt.broadcast %30 : tensor<128x1x!tt.ptr<i8>> -> tensor<128x128x!tt.ptr<i8>>
    %33 = tt.broadcast %31 : tensor<1x128xi32> -> tensor<128x128xi32>
    %34 = tt.addptr %32, %33 : tensor<128x128x!tt.ptr<i8>>, tensor<128x128xi32>
    %35 = tt.addptr %arg3, %10 : !tt.ptr<f32>, i64
    %36 = tt.addptr %35, %0 : !tt.ptr<f32>, i32
    %37 = arith.extsi %arg9 : i32 to i64
    %38 = arith.muli %2, %37 : i64
    %39 = arith.extsi %arg10 : i32 to i64
    %40 = arith.muli %4, %39 : i64
    %41 = arith.addi %38, %40 : i64
    %42 = tt.addptr %arg1, %41 : !tt.ptr<i8>, i64
    %43 = tt.expand_dims %19 {axis = 0 : i32} : tensor<64xi32> -> tensor<1x64xi32>
    %44 = tt.splat %arg11 : i32 -> tensor<1x64xi32>
    %45 = arith.muli %43, %44 : tensor<1x64xi32>
    %46 = tt.splat %42 : !tt.ptr<i8> -> tensor<1x64x!tt.ptr<i8>>
    %47 = tt.addptr %46, %45 : tensor<1x64x!tt.ptr<i8>>, tensor<1x64xi32>
    %48 = tt.expand_dims %16 {axis = 1 : i32} : tensor<128xi32> -> tensor<128x1xi32>
    %49 = tt.broadcast %47 : tensor<1x64x!tt.ptr<i8>> -> tensor<128x64x!tt.ptr<i8>>
    %50 = tt.broadcast %48 : tensor<128x1xi32> -> tensor<128x64xi32>
    %51 = tt.addptr %49, %50 : tensor<128x64x!tt.ptr<i8>>, tensor<128x64xi32>
    %52 = tt.addptr %arg4, %14 : !tt.ptr<f32>, i64
    %53 = arith.extsi %arg12 : i32 to i64
    %54 = arith.muli %2, %53 : i64
    %55 = arith.extsi %arg13 : i32 to i64
    %56 = arith.muli %4, %55 : i64
    %57 = arith.addi %54, %56 : i64
    %58 = tt.addptr %arg2, %57 : !tt.ptr<f16>, i64
    %59 = tt.expand_dims %19 {axis = 1 : i32} : tensor<64xi32> -> tensor<64x1xi32>
    %60 = tt.splat %arg14 : i32 -> tensor<64x1xi32>
    %61 = arith.muli %59, %60 : tensor<64x1xi32>
    %62 = tt.splat %58 : !tt.ptr<f16> -> tensor<64x1x!tt.ptr<f16>>
    %63 = tt.addptr %62, %61 : tensor<64x1x!tt.ptr<f16>>, tensor<64x1xi32>
    %64 = tt.broadcast %63 : tensor<64x1x!tt.ptr<f16>> -> tensor<64x128x!tt.ptr<f16>>
    %65 = tt.broadcast %31 : tensor<1x128xi32> -> tensor<64x128xi32>
    %66 = tt.addptr %64, %65 : tensor<64x128x!tt.ptr<f16>>, tensor<64x128xi32>
    %67 = arith.extsi %arg15 : i32 to i64
    %68 = arith.muli %2, %67 : i64
    %69 = arith.extsi %arg16 : i32 to i64
    %70 = arith.muli %4, %69 : i64
    %71 = arith.addi %68, %70 : i64
    %72 = tt.addptr %arg5, %71 : !tt.ptr<bf16>, i64
    %73 = tt.splat %arg17 : i32 -> tensor<128x1xi32>
    %74 = arith.muli %26, %73 : tensor<128x1xi32>
    %75 = tt.splat %72 : !tt.ptr<bf16> -> tensor<128x1x!tt.ptr<bf16>>
    %76 = tt.addptr %75, %74 : tensor<128x1x!tt.ptr<bf16>>, tensor<128x1xi32>
    %77 = tt.broadcast %76 : tensor<128x1x!tt.ptr<bf16>> -> tensor<128x128x!tt.ptr<bf16>>
    %78 = tt.addptr %77, %33 : tensor<128x128x!tt.ptr<bf16>>, tensor<128x128xi32>
    %79 = tt.splat %arg18 : i32 -> tensor<128x1xi32>
    %80 = arith.cmpi slt, %26, %79 : tensor<128x1xi32>
    %81 = tt.broadcast %80 : tensor<128x1xi1> -> tensor<128x128xi1>
    %82 = tt.load %34, %81 : tensor<128x128x!tt.ptr<i8>>
    %83 = tt.load %36 : !tt.ptr<f32>
    %84:6 = scf.for %arg20 = %c0_i32 to %arg19 step %c64_i32 iter_args(%arg21 = %cst, %arg22 = %cst_3, %arg23 = %cst_0, %arg24 = %51, %arg25 = %52, %arg26 = %66) -> (tensor<128xf32>, tensor<128x128xf32>, tensor<128xf32>, tensor<128x64x!tt.ptr<i8>>, !tt.ptr<f32>, tensor<64x128x!tt.ptr<f16>>)  : i32 {
      %89 = arith.subi %arg19, %arg20 : i32
      %90 = tt.splat %89 : i32 -> tensor<1x64xi32>
      %91 = arith.cmpi slt, %43, %90 : tensor<1x64xi32>
      %92 = tt.broadcast %91 : tensor<1x64xi1> -> tensor<128x64xi1>
      %93 = tt.load %arg24, %92 : tensor<128x64x!tt.ptr<i8>>
      %94 = tt.load %arg25 : !tt.ptr<f32>
      %95 = tt.dot %82, %93, %cst_1, inputPrecision = tf32 : tensor<128x128xi8> * tensor<128x64xi8> -> tensor<128x64xi32>
      %96 = arith.sitofp %95 : tensor<128x64xi32> to tensor<128x64xf32>
      %97 = tt.splat %83 : f32 -> tensor<128x64xf32>
      %98 = arith.mulf %96, %97 : tensor<128x64xf32>
      %99 = tt.splat %94 : f32 -> tensor<128x64xf32>
      %100 = arith.mulf %98, %99 : tensor<128x64xf32>
      %101 = "tt.reduce"(%100) <{axis = 1 : i32}> ({
      ^bb0(%arg27: f32, %arg28: f32):
        %130 = arith.maxnumf %arg27, %arg28 : f32
        tt.reduce.return %130 : f32
      }) : (tensor<128x64xf32>) -> tensor<128xf32>
      %102 = arith.maxnumf %arg23, %101 : tensor<128xf32>
      %103 = tt.expand_dims %102 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
      %104 = tt.broadcast %103 : tensor<128x1xf32> -> tensor<128x64xf32>
      %105 = arith.subf %100, %104 : tensor<128x64xf32>
      %106 = math.exp2 %105 : tensor<128x64xf32>
      %107 = "tt.reduce"(%106) <{axis = 1 : i32}> ({
      ^bb0(%arg27: f32, %arg28: f32):
        %130 = arith.addf %arg27, %arg28 : f32
        tt.reduce.return %130 : f32
      }) : (tensor<128x64xf32>) -> tensor<128xf32>
      %108 = arith.subf %arg23, %102 : tensor<128xf32>
      %109 = math.exp2 %108 : tensor<128xf32>
      %110 = arith.mulf %arg21, %109 : tensor<128xf32>
      %111 = arith.addf %110, %107 : tensor<128xf32>
      %112 = tt.expand_dims %109 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
      %113 = tt.broadcast %112 : tensor<128x1xf32> -> tensor<128x128xf32>
      %114 = arith.mulf %arg22, %113 : tensor<128x128xf32>
      %115 = tt.splat %89 : i32 -> tensor<64x1xi32>
      %116 = arith.cmpi slt, %59, %115 : tensor<64x1xi32>
      %117 = tt.broadcast %116 : tensor<64x1xi1> -> tensor<64x128xi1>
      %118 = tt.load %arg26, %117 : tensor<64x128x!tt.ptr<f16>>
      %119 = arith.truncf %106 : tensor<128x64xf32> to tensor<128x64xf16>
      %120 = tt.dot %119, %118, %cst_2, inputPrecision = tf32 : tensor<128x64xf16> * tensor<64x128xf16> -> tensor<128x128xf16>
      %121 = arith.extf %120 : tensor<128x128xf16> to tensor<128x128xf32>
      %122 = arith.addf %114, %121 : tensor<128x128xf32>
      %123 = arith.muli %arg11, %c64_i32 : i32
      %124 = tt.splat %123 : i32 -> tensor<128x64xi32>
      %125 = tt.addptr %arg24, %124 : tensor<128x64x!tt.ptr<i8>>, tensor<128x64xi32>
      %126 = tt.addptr %arg25, %c1_i32 : !tt.ptr<f32>, i32
      %127 = arith.muli %arg14, %c64_i32 : i32
      %128 = tt.splat %127 : i32 -> tensor<64x128xi32>
      %129 = tt.addptr %arg26, %128 : tensor<64x128x!tt.ptr<f16>>, tensor<64x128xi32>
      scf.yield %111, %122, %102, %125, %126, %129 : tensor<128xf32>, tensor<128x128xf32>, tensor<128xf32>, tensor<128x64x!tt.ptr<i8>>, !tt.ptr<f32>, tensor<64x128x!tt.ptr<f16>>
    } {tt.divisibility_arg1 = dense<64> : tensor<1xi32>}
    %85 = tt.expand_dims %84#0 {axis = 1 : i32} : tensor<128xf32> -> tensor<128x1xf32>
    %86 = tt.broadcast %85 : tensor<128x1xf32> -> tensor<128x128xf32>
    %87 = arith.divf %84#1, %86 : tensor<128x128xf32>
    %88 = arith.truncf %87 : tensor<128x128xf32> to tensor<128x128xbf16>
    tt.store %78, %88, %81 : tensor<128x128x!tt.ptr<bf16>>
    tt.return
  }
}

{-#
  external_resources: {
    mlir_reproducer: {
      pipeline: "builtin.module(convert-triton-to-tritongpu{num-ctas=1 num-warps=8 target=cuda:75 threads-per-warp=32}, tritongpu-coalesce, triton-nvidia-gpu-plan-cta, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritongpu-accelerate-matmul, tritongpu-remove-layout-conversions, tritongpu-optimize-dot-operands{hoist-layout-conversion=false}, cse, loop-invariant-code-motion, tritongpu-prefetch, tritongpu-optimize-dot-operands{hoist-layout-conversion=false}, tritongpu-coalesce-async-copy, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tritongpu-reorder-instructions, cse, symbol-dce, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true})",
      disable_threading: false,
      verify_each: true
    }
  }
#-}
/usr/local/lib/python3.10/dist-packages/sageattention/attn_qk_int8_per_block.py:40:0: error: Failures have been detected while processing an MLIR pass pipeline
/usr/local/lib/python3.10/dist-packages/sageattention/attn_qk_int8_per_block.py:40:0: note: Pipeline failed while executing [`TritonGPUAccelerateMatmul` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
  0% 0/25 [00:06<?, ?it/s]
Traceback (most recent call last):
  File "/content/FramePack/demo_gradio.py", line 241, in worker
    generated_latents = sample_hunyuan(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/content/FramePack/diffusers_helper/pipelines/k_diffusion_hunyuan.py", line 116, in sample_hunyuan
    results = sample_unipc(k_model, latents, sigmas, extra_args=sampler_kwargs, disable=False, callback=callback)
  File "/content/FramePack/diffusers_helper/k_diffusion/uni_pc_fm.py", line 141, in sample_unipc
    return FlowMatchUniPC(model, extra_args=extra_args, variant=variant).sample(noise, sigmas=sigmas, callback=callback, disable_pbar=disable)
  File "/content/FramePack/diffusers_helper/k_diffusion/uni_pc_fm.py", line 118, in sample
    model_prev_list = [self.model_fn(x, vec_t)]
  File "/content/FramePack/diffusers_helper/k_diffusion/uni_pc_fm.py", line 23, in model_fn
    return self.model(x, t, **self.extra_args)
  File "/content/FramePack/diffusers_helper/k_diffusion/wrapper.py", line 37, in k_model
    pred_positive = transformer(hidden_states=hidden_states, timestep=timestep, return_dict=False, **extra_args['positive'])[0].float()
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/FramePack/diffusers_helper/models/hunyuan_video_packed.py", line 973, in forward
    hidden_states, encoder_hidden_states = self.gradient_checkpointing_method(
  File "/content/FramePack/diffusers_helper/models/hunyuan_video_packed.py", line 832, in gradient_checkpointing_method
    result = block(*args)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/FramePack/diffusers_helper/models/hunyuan_video_packed.py", line 652, in forward
    attn_output, context_attn_output = self.attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 605, in forward
    return self.processor(
  File "/content/FramePack/diffusers_helper/models/hunyuan_video_packed.py", line 172, in __call__
    hidden_states = attn_varlen_func(query, key, value, cu_seqlens_q, cu_seqlens_kv, max_seqlen_q, max_seqlen_kv)
  File "/content/FramePack/diffusers_helper/models/hunyuan_video_packed.py", line 111, in attn_varlen_func
    x = sageattn(q, k, v, tensor_layout='NHD')
  File "/usr/local/lib/python3.10/dist-packages/sageattention/core.py", line 110, in sageattn
    o = attn_false(q_int8, k_int8, v, q_scale, k_scale, tensor_layout=tensor_layout, output_dtype=dtype)
  File "/usr/local/lib/python3.10/dist-packages/sageattention/attn_qk_int8_per_block.py", line 113, in forward
    _attn_fwd[grid](
  File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 347, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 569, in run
    kernel = self.compile(src, target=target, options=options.__dict__)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 284, in compile
    next_module = compile_ir(module, metadata)
  File "/usr/local/lib/python3.10/dist-packages/triton/backends/nvidia/compiler.py", line 449, in <lambda>
    stages["ttgir"] = lambda src, metadata: self.make_ttgir(src, metadata, options, capability)
  File "/usr/local/lib/python3.10/dist-packages/triton/backends/nvidia/compiler.py", line 312, in make_ttgir
    pm.run(mod)
RuntimeError: PassManager::run failed
Unloaded DynamicSwap_LlamaModel as complete.
Unloaded CLIPTextModel as complete.
Unloaded SiglipVisionModel as complete.
Unloaded AutoencoderKLHunyuanVideo as complete.