Hi,
i tried to port this to llama.cpp on a DGX Spark (GB10).
There seems to be a fundamental hardware limitation on this hardware.
i get an error in the prepare_h kernel:
tvm.error.InternalError: Failed to set the allowed dynamic shared
memory size to 196608 [192 KB]
GB10 specs (sm_121a): maxSharedMemoryPerBlockOptin = 99 KB (101376 bytes)
the kernel seems to be dimensioned on the sm_90 (hopper?) at 228Kb so it would not work on sm_121a at 99kb/block.
Is this a confirmed issue? are you considering adding support?
I'm wondering if aggressive re-tiling would still bring benefits for these devices.
something like:
- num_stages = 2 → 1
- block_DV = 128 → 64 (-50% V tile)
- chunk_size = 64 → 32
could eventually fit in 80-85Kb/block, but this would require some concrete CUDA engineering and the biggest question is if the expected outcome would be worth the effort.
anyone looking into this?
Hi,
i tried to port this to llama.cpp on a DGX Spark (GB10).
There seems to be a fundamental hardware limitation on this hardware.
i get an error in the prepare_h kernel:
tvm.error.InternalError: Failed to set the allowed dynamic shared
memory size to 196608 [192 KB]
GB10 specs (sm_121a): maxSharedMemoryPerBlockOptin = 99 KB (101376 bytes)
the kernel seems to be dimensioned on the sm_90 (hopper?) at 228Kb so it would not work on sm_121a at 99kb/block.
Is this a confirmed issue? are you considering adding support?
I'm wondering if aggressive re-tiling would still bring benefits for these devices.
something like:
could eventually fit in 80-85Kb/block, but this would require some concrete CUDA engineering and the biggest question is if the expected outcome would be worth the effort.
anyone looking into this?