A way to utilize tensor cores is needed, which should draw from the family of VectorXXX intrinsics in .NET and/or Vulkan Cooperative Matrix extension proposed by NVidia.
Related CUDA documentation: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions
This is also mentioned in #923 , but the later is more about the support for shorter floats in general.
A way to utilize tensor cores is needed, which should draw from the family of
VectorXXXintrinsics in .NET and/or Vulkan Cooperative Matrix extension proposed by NVidia.Related CUDA documentation: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions
This is also mentioned in #923 , but the later is more about the support for shorter floats in general.