Skip to content

Conversation

inbelic
Copy link
Contributor

@inbelic inbelic commented Sep 17, 2025

Adds some basic testing of blocking functionality with the GroupMemoryBarrierWithGroupSync intrinsic.

It should be noted that the test cases are sized such that they are required to be evaluated over multiple waves. In this regard, each scenario fails using WARP without the barriers. Observing a failure is dependent on the scheduling of waves.

Resolves: #142.

Copy link
Collaborator

@spall spall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm but you should probably get a review from someone who knows more about this function.

Copy link
Collaborator

@bogner bogner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case looks reasonable, if a bit hard to follow. in terms of where the individual result values are coming from. A few comments on the pipeline definition.

- Name: ExpectedOut
Format: Int32
Stride: 16
Data: [ 9, 90, 900, 9000, 9, 90, 1800, 18000, 10, 20, 120, 1120, 1, 1000, 100, 10 ]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's clearer if you split the lines so that the components of the vector each get their own line

Suggested change
Data: [ 9, 90, 900, 9000, 9, 90, 1800, 18000, 10, 20, 120, 1120, 1, 1000, 100, 10 ]
Data: [
9, 90, 900, 9000,
9, 90, 1800, 18000,
10, 20, 120, 1120,
1, 1000, 100, 10
]

DispatchSize: [1, 1, 1]
Buffers:
- Name: In
Format: Int32
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HLSL uses uint for each of the buffers, but here we say Int32 - we should either make the HLSL used signed ints or use UInt32 here.

Buffers:
- Name: In
Format: Int32
Stride: 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Stride" is in bytes, and since the elements here are uint4 vectors that means that this should be 16. In any case, Stride is really for structs, and you can just use "Channels: 4" here to say we have 4-element vectors.

Data: [ 1, 10, 100, 1000]
- Name: Out
Format: Int32
Stride: 16
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here (though this stride matches the data type), better to just say "Channels: 4" instead of specifying stride.

Copy link
Collaborator

@damyanp damyanp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been discussing offline, but I figured I'd get something into the record to see if @bogner or anyone else has opinions.

It looks like this test passes if you remove all of the barriers from the shader. This isn't surprising, because with only 4 threads per-thread group there's not really anything for the barrier to synchronize.

We should come up with something where it'll only pass if the barriers are actually in place.

Finn Plummer and others added 2 commits September 18, 2025 14:40
it was noted that if the group size is not larger than the wave size,
the barrier option is redundant

HLSL waves size can be at most 128. Hence, we initialize a group of 512
threads so that it is forced to evaluate over multiple waves.

We also remove the divergent test case as this is not applicable due to
it being undefined behaviour
Co-authored-by: Justin Bogner <[email protected]>
@inbelic
Copy link
Contributor Author

inbelic commented Sep 18, 2025

I have updated the test-cases such that if you remove the barriers it fails locally for me using WARP.

I had to increase the group size such that is larger than a wave size, and then it will depend on how the waves are scheduled to get the datarace. But it at least consistently fails for me.

Copy link
Collaborator

@damyanp damyanp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably doing more than we need to do to convince ourselves that the barrier works, but other than that lgtm.

Comment on lines 43 to 44
// Strided Read/Write:
Indices[ThreadID.x][ThreadID.y] = ThreadID.x + ThreadID.y * 128;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need any more tests than we have after Out[1]? Arguably Out[0] is enough too, but it does seem that the other one is a bit more interesting.

@inbelic inbelic added the test-all When applied to a PR this will opt-in to additional pre-merge test configurations.. label Sep 19, 2025
@inbelic inbelic removed the test-all When applied to a PR this will opt-in to additional pre-merge test configurations.. label Sep 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add test for GroupMemoryBarrierWithGroupSync
4 participants