Implementing GPU FastMath Injection in NCNN to Accelerate Computation #6278

futz12 · 2025-08-23T05:39:14Z

futz12
Aug 23, 2025

Introduction

Anyone with a background in competitive programming (like OI/ICPC) knows that adding a long list of optimization headers to your code can make it run significantly faster. One of the most effective of these is fastmath.

According to the GCC documentation on Floating-Point Math, fastmath essentially allows floating-point operations to disregard strict IEEE 754 standards. This gives the CPU the freedom to perform calculations in whatever way it deems fastest. While this can affect precision to some degree, the CPU's designers know their hardware best, so letting the CPU use its own optimized methods is bound to be faster than adhering to the rigid IEEE rules.

If enabling fastmath can speed up CPUs, can it do the same for GPUs?

Nihui gave me the answer:
SPV_KHR_float_controls2

This extension allows us to enable fastmath for SPIR-V, unlocking faster performance.

According to the Khronos Group, to use this extension, the following OpExtension must be present in the SPIR-V module.:

OpExtension "SPV_KHR_float_controls2"

So, our first step is to insert this extension.

Next, we look at FPFastMathDefault:

FPFastMathDefault
Set the default fast math flags for instructions not themselves decorated with FPFastMathMode. This only affects instructions operating on or resulting in a type that is Target Type or an OpTypeMatrix or OpTypeVector derived from it. Target Type must be a scalar, floating-point type. Fast-Math Mode must be the of a constant instruction of 32-bit integer type containing a valid FP Fast Math Mode bitmask. Fast-Math Mode must not be a specialization-constant instruction. May be applied at most once per Target Type to any execution mode.

Then, we need to set the execution mode for a specific data type. This requires that the Fast-Math Mode must be the of a constant instruction of 32-bit integer type containing a valid FP Fast Math Mode bitmask. Therefore, we also need to create a Constant to handle this. It's also important to note that the FloatControls2 capability must be enabled to successfully set the execution mode.

This means we need to modify the SPIR-V binary directly to insert these features.

To summarize, we will need the following Opcodes:

Op
`OpCapability`
`OpExtension`
`OpExecutionMode`
`OpConstant`

By consulting the SPIR-V Unified Specification, I found the specific parameters and IDs for these Opcodes:

Op	ID	Word Count	Parameters	Description
`OpCapability`	17	2	Capability	Declare a capability used by this module.
`OpExtension`	10	2+	Name	Declare the use of a SPIR-V extension.
`OpExecutionMode`	16	3+	1. Entry Point 2. TargetType 3. Execution Mode	Declare an execution mode for an entry point.
`OpConstant`	43	4+	1. Var Type 2. ID 3. Value	Declare a new composite constant.

With this, we can outline the pseudo-code for our injection:

OpCapability FloatControls2
OpExtension "SPV_KHR_float_controls2"
OpExecutionMode EntryPointID FPFastMathDefault FloatTypeID ConstantID

\ === Constant Section === \
OpConstant UINT_TypeID New_MaxID+1 AllowContract|AllowReassoc|...|OtherFlags

First function

This means we need to obtain the following information from the SPIR-V binary:

The Entry Point ID
The original maximum variable ID
The ID for the UINT type

Let's examine a SPIR-V binary to see what it looks like. To keep things simple, let's start with the absval operator, which I've disassembled into a human-readable text format using spirv-dis from the SPIRV-Tools.

; SPIR-V
; Version: 1.3
; Generator: Khronos Glslang Reference Front End; 11
; Bound: 54
; Schema: 0
               OpCapability Shader
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint GLCompute %main "main" %gl_GlobalInvocationID
               OpExecutionMode %main LocalSize 32 1 1
               OpSource GLSL 450
               OpSourceExtension "GL_EXT_shader_8bit_storage"
               OpSourceExtension "GL_EXT_shader_explicit_arithmetic_types_int64"
               OpName %main "main"
               OpName %gi "gi"
               OpName %gl_GlobalInvocationID "gl_GlobalInvocationID"
               OpName %n "n"
               OpName %parameter "parameter"
               OpMemberName %parameter 0 "n"
               OpName %p "p"
               OpName %v "v"
               OpName %bottom_top_blob "bottom_top_blob"
               OpMemberName %bottom_top_blob 0 "bottom_top_blob_data"
               OpName %_ ""
               OpDecorate %gl_GlobalInvocationID BuiltIn GlobalInvocationId
               OpDecorate %n SpecId 0
               OpDecorate %parameter Block
               OpMemberDecorate %parameter 0 Offset 0
               OpDecorate %_runtimearr_v4float ArrayStride 16
               OpDecorate %bottom_top_blob Block
               OpMemberDecorate %bottom_top_blob 0 Offset 0
               OpDecorate %_ Binding 0
               OpDecorate %_ DescriptorSet 0
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
       %uint = OpTypeInt 32 0
%_ptr_Function_uint = OpTypePointer Function %uint
     %v3uint = OpTypeVector %uint 3
%_ptr_Input_v3uint = OpTypePointer Input %v3uint
%gl_GlobalInvocationID = OpVariable %_ptr_Input_v3uint Input
     %uint_0 = OpConstant %uint 0
%_ptr_Input_uint = OpTypePointer Input %uint
          %n = OpSpecConstant %uint 0
       %bool = OpTypeBool
         %19 = OpSpecConstantOp %bool IEqual %n %uint_0
  %parameter = OpTypeStruct %uint
%_ptr_PushConstant_parameter = OpTypePointer PushConstant %parameter
          %p = OpVariable %_ptr_PushConstant_parameter PushConstant
        %int = OpTypeInt 32 1
      %int_0 = OpConstant %int 0
%_ptr_PushConstant_uint = OpTypePointer PushConstant %uint
      %float = OpTypeFloat 32
    %v4float = OpTypeVector %float 4
%_ptr_Function_v4float = OpTypePointer Function %v4float
%_runtimearr_v4float = OpTypeRuntimeArray %v4float
%bottom_top_blob = OpTypeStruct %_runtimearr_v4float
%_ptr_StorageBuffer_bottom_top_blob = OpTypePointer StorageBuffer %bottom_top_blob
          %_ = OpVariable %_ptr_StorageBuffer_bottom_top_blob StorageBuffer
%_ptr_StorageBuffer_v4float = OpTypePointer StorageBuffer %v4float
       %main = OpFunction %void None %3
          %5 = OpLabel
         %gi = OpVariable %_ptr_Function_uint Function
         %20 = OpVariable %_ptr_Function_uint Function
          %v = OpVariable %_ptr_Function_v4float Function
         %14 = OpAccessChain %_ptr_Input_uint %gl_GlobalInvocationID %uint_0
         %15 = OpLoad %uint %14
               OpStore %gi %15
         %16 = OpLoad %uint %gi
               OpSelectionMerge %22 None
               OpBranchConditional %19 %21 %31
         %21 = OpLabel
         %29 = OpAccessChain %_ptr_PushConstant_uint %p %int_0
         %30 = OpLoad %uint %29
               OpStore %20 %30
               OpBranch %22
         %31 = OpLabel
               OpStore %20 %n
               OpBranch %22
         %22 = OpLabel
         %32 = OpLoad %uint %20
         %33 = OpUGreaterThanEqual %bool %16 %32
               OpSelectionMerge %35 None
               OpBranchConditional %33 %34 %35
         %34 = OpLabel
               OpReturn
         %35 = OpLabel
         %45 = OpLoad %uint %gi
         %47 = OpAccessChain %_ptr_StorageBuffer_v4float %_ %int_0 %45
         %48 = OpLoad %v4float %47
               OpStore %v %48
         %49 = OpLoad %v4float %v
         %50 = OpExtInst %v4float %1 FAbs %49
               OpStore %v %50
         %51 = OpLoad %uint %gi
         %52 = OpLoad %v4float %v
         %53 = OpAccessChain %_ptr_StorageBuffer_v4float %_ %int_0 %51
               OpStore %53 %52
               OpReturn
               OpFunctionEnd

From observing the code, we can find:

The abs calculation is here:

         %50 = OpExtInst %v4float %1 FAbs %49

This is the program's entry point:

               OpEntryPoint GLCompute %main "main"

And the original maximum variable ID is:

; Bound: 54

Drawing inspiration from ncnn's inject_local_xyz, we can implement the injection logic:

static void inject_fast_math(const uint32_t* code, size_t size, std::vector<uint32_t>& dstcode, uint32_t fast_math_flag)
{
    // check spv magic number
    if (size < 20 || code[0] != 0x07230203)
    {
        dstcode.assign(code, code + size / sizeof(uint32_t));
        return;
    }

    // analyze spv
    uint32_t bound = code[3];
    uint32_t entry_point_id = 0;
    uint32_t float32_type_id = 0;
    uint32_t uint32_type_id = 0;
    bool has_float_controls2_capability = false;
    bool has_float_controls2_extension = false;

    const uint32_t* memory_model_ptr = nullptr;
    const uint32_t* first_function_ptr = nullptr;

    const uint32_t* p = code + 5;
    const uint32_t* end = code + (size / sizeof(uint32_t));

    while (p < end)
    {
        uint16_t wordcount = p[0] >> 16;
        if (wordcount == 0 || p + wordcount > end) break; // for safety
        uint16_t op = p[0] & 0xffff;

        switch (op)
        {
        case 14: // OpMemoryModel
            if (!memory_model_ptr) memory_model_ptr = p;
            break;
        case 15: // OpEntryPoint
            if (p[1] == 5 /* GLCompute */) entry_point_id = p[2];
            break;
        case 21: // OpTypeInt
            if (wordcount == 4 && p[2] == 32 && p[3] == 0) uint32_type_id = p[1];
            break;
        case 22: // OpTypeFloat
            if (wordcount == 3 && p[2] == 32) float32_type_id = p[1];
            break;
        case 54: // OpFunction
            if (!first_function_ptr) first_function_ptr = p;
            break;
        case 17: // OpCapability
            if (p[1] == 6029 /* FloatControls2 */) has_float_controls2_capability = true;
            break;
        case 10: // OpExtension
            if (strcmp((const char*)&p[1], "SPV_KHR_float_controls2") == 0) has_float_controls2_extension = true;
            break;
        }

        // fin
        if (first_function_ptr) break;

        p += wordcount;
    }

    // cannot find key elements
    if (entry_point_id == 0 || float32_type_id == 0 || uint32_type_id == 0 || !memory_model_ptr || !first_function_ptr)
    {
        dstcode.assign(code, code + size / sizeof(uint32_t));
        return;
    }

    // build spirv
    dstcode.clear();
    dstcode.reserve(size / sizeof(uint32_t) + 20);

    // prepare
    uint32_t fast_math_constant_id = bound;
    uint32_t new_bound = bound + 1; // for new OpConstant

    // header
    dstcode.insert(dstcode.end(), code, code + 5);
    dstcode[3] = new_bound;

    p = code + 5;
    while (p < end)
    {
        uint16_t wordcount = p[0] >> 16;
        if (wordcount == 0) break;

        // constant need before at first function
        if (p == first_function_ptr)
        {
            dstcode.push_back((4u << 16) | 43 /* OpConstant */);
            dstcode.push_back(uint32_type_id);
            dstcode.push_back(fast_math_constant_id);
            dstcode.push_back(fast_math_flag);
        }

        // Pass
        dstcode.insert(dstcode.end(), p, p + wordcount);

        // inject new instructions
        if (p == memory_model_ptr)
        {
            if (!has_float_controls2_capability)
            {
                dstcode.push_back((2u << 16) | 17 /* OpCapability */);
                dstcode.push_back(6029 /* FloatControls2 */);
            }
            if (!has_float_controls2_extension)
            {
                const char ext_name[] = "SPV_KHR_float_controls2";
                size_t ext_word_count = (sizeof(ext_name) + 3) / 4;
                dstcode.push_back(((ext_word_count + 1) << 16) | 10 /* OpExtension */);
                std::vector<uint32_t> ext_words(ext_word_count, 0);
                memcpy(ext_words.data(), ext_name, sizeof(ext_name));
                dstcode.insert(dstcode.end(), ext_words.begin(), ext_words.end());
            }
        }
        else if ((p[0] & 0xffff) == 15 /* OpEntryPoint */ && p[2] == entry_point_id)
        {
            dstcode.push_back((5u << 16) | 16 /* OpExecutionMode */);
            dstcode.push_back(entry_point_id);
            dstcode.push_back(6028 /* FPFastMathDefault */);
            dstcode.push_back(float32_type_id);
            dstcode.push_back(fast_math_constant_id);
        }

        p += wordcount;
    }
}

Let's look at the result after injection:

; SPIR-V
; Version: 1.3
; Generator: Khronos Glslang Reference Front End; 11
; Bound: 55
; Schema: 0
               OpCapability Shader
          %1 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpCapability FloatControls2
               OpExtension "SPV_KHR_float_controls2"
               OpEntryPoint GLCompute %main "main" %gl_GlobalInvocationID
               OpExecutionMode %main FPFastMathDefault %float %uint_458752
               OpExecutionMode %main LocalSize 32 1 1
               OpSource GLSL 450
               OpSourceExtension "GL_EXT_shader_8bit_storage"
               OpSourceExtension "GL_EXT_shader_explicit_arithmetic_types_int64"
               OpName %main "main"
               OpName %gi "gi"
               OpName %gl_GlobalInvocationID "gl_GlobalInvocationID"
               OpName %n "n"
               OpName %parameter "parameter"
               OpMemberName %parameter 0 "n"
               OpName %p "p"
               OpName %v "v"
               OpName %bottom_top_blob "bottom_top_blob"
               OpMemberName %bottom_top_blob 0 "bottom_top_blob_data"
               OpName %_ ""
               OpDecorate %gl_GlobalInvocationID BuiltIn GlobalInvocationId
               OpDecorate %n SpecId 0
               OpDecorate %parameter Block
               OpMemberDecorate %parameter 0 Offset 0
               OpDecorate %_runtimearr_v4float ArrayStride 16
               OpDecorate %bottom_top_blob Block
               OpMemberDecorate %bottom_top_blob 0 Offset 0
               OpDecorate %_ Binding 0
               OpDecorate %_ DescriptorSet 0
       %void = OpTypeVoid
          %3 = OpTypeFunction %void
       %uint = OpTypeInt 32 0
%_ptr_Function_uint = OpTypePointer Function %uint
     %v3uint = OpTypeVector %uint 3
%_ptr_Input_v3uint = OpTypePointer Input %v3uint
%gl_GlobalInvocationID = OpVariable %_ptr_Input_v3uint Input
     %uint_0 = OpConstant %uint 0
%_ptr_Input_uint = OpTypePointer Input %uint
          %n = OpSpecConstant %uint 0
       %bool = OpTypeBool
         %19 = OpSpecConstantOp %bool IEqual %n %uint_0
  %parameter = OpTypeStruct %uint
%_ptr_PushConstant_parameter = OpTypePointer PushConstant %parameter
          %p = OpVariable %_ptr_PushConstant_parameter PushConstant
        %int = OpTypeInt 32 1
      %int_0 = OpConstant %int 0
%_ptr_PushConstant_uint = OpTypePointer PushConstant %uint
      %float = OpTypeFloat 32
    %v4float = OpTypeVector %float 4
%_ptr_Function_v4float = OpTypePointer Function %v4float
%_runtimearr_v4float = OpTypeRuntimeArray %v4float
%bottom_top_blob = OpTypeStruct %_runtimearr_v4float
%_ptr_StorageBuffer_bottom_top_blob = OpTypePointer StorageBuffer %bottom_top_blob
          %_ = OpVariable %_ptr_StorageBuffer_bottom_top_blob StorageBuffer
%_ptr_StorageBuffer_v4float = OpTypePointer StorageBuffer %v4float
%uint_458752 = OpConstant %uint 458752
       %main = OpFunction %void None %3
          %5 = OpLabel
         %gi = OpVariable %_ptr_Function_uint Function
         %20 = OpVariable %_ptr_Function_uint Function
          %v = OpVariable %_ptr_Function_v4float Function
         %14 = OpAccessChain %_ptr_Input_uint %gl_GlobalInvocationID %uint_0
         %15 = OpLoad %uint %14
               OpStore %gi %15
         %16 = OpLoad %uint %gi
               OpSelectionMerge %22 None
               OpBranchConditional %19 %21 %31
         %21 = OpLabel
         %29 = OpAccessChain %_ptr_PushConstant_uint %p %int_0
         %30 = OpLoad %uint %29
               OpStore %20 %30
               OpBranch %22
         %31 = OpLabel
               OpStore %20 %n
               OpBranch %22
         %22 = OpLabel
         %32 = OpLoad %uint %20
         %33 = OpUGreaterThanEqual %bool %16 %32
               OpSelectionMerge %35 None
               OpBranchConditional %33 %34 %35
         %34 = OpLabel
               OpReturn
         %35 = OpLabel
         %45 = OpLoad %uint %gi
         %47 = OpAccessChain %_ptr_StorageBuffer_v4float %_ %int_0 %45
         %48 = OpLoad %v4float %47
               OpStore %v %48
         %49 = OpLoad %v4float %v
         %50 = OpExtInst %v4float %1 FAbs %49
               OpStore %v %50
         %51 = OpLoad %uint %gi
         %52 = OpLoad %v4float %v
         %53 = OpAccessChain %_ptr_StorageBuffer_v4float %_ %int_0 %51
               OpStore %53 %52
               OpReturn
               OpFunctionEnd

And with that, we have successfully injected fast_math into the SPIR-V module.

Feel free to check out my PR: #6223

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implementing GPU FastMath Injection in NCNN to Accelerate Computation #6278

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Implementing GPU FastMath Injection in NCNN to Accelerate Computation #6278

Uh oh!

Uh oh!

futz12 Aug 23, 2025

Introduction

Replies: 0 comments

futz12
Aug 23, 2025