Skip to content

Commit 7325cb1

Browse files
GordonGordon
authored andcommitted
Update exercises.
1 parent 600a81a commit 7325cb1

File tree

5 files changed

+91
-70
lines changed

5 files changed

+91
-70
lines changed

docs/sycl_02_hello_world.md

Lines changed: 10 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -10,49 +10,34 @@ In this first exercise you will learn:
1010

1111
---
1212

13-
Once you have a queue you can now submit work for the device to execute, and this is done via command groups, which are made up of commands and data dependencies.
13+
Once you have a queue you can now submit work for the device to be executed, and this is done via command groups, which are made up of commands and data dependencies.
1414

1515
1.) Define a command group
1616

17-
Define a lambda to represent your command group and pass it to the submit member function of the queue as follows:
18-
19-
```
20-
myQueue.submit([&](cl::sycl::handler &cgh) {
21-
22-
});
23-
```
17+
Define a lambda to represent your command group and pass it to the submit member function of the queue.
2418

2519
Note that submitting a command group without any commands will result in an error.
2620

2721
2.) Define a SYCL kernel function
2822

29-
Define a SYCL kernel function via the single_task command within the command group as follows:
30-
31-
```
32-
cgh.single_task<hello_world>([=](){
33-
34-
});
35-
```
23+
Define a SYCL kernel function via the `single_task` command within the command group, which takes only a function object which itself doesn't take any parameters.
3624

3725
Remember to declare a class for your kernel name in the global namespace.
3826

3927
3.) Stream “Hello World!” to stdout from the SYCL kernel function
4028

41-
Construct a stream within the scope of the command group as follows:
29+
Create a `stream` object within the command group scope as follows. The two parameters to the constructor of the `stream` class are the total buffer size and the statement size respectively.
4230

43-
```
44-
auto os = cl::sycl::stream{128, 128};
45-
```
31+
Then use the stream you constructed within the SYCL kernel function to print “Hello world!” using the `<<` operator.
4632

47-
Then use the stream you constructed within the SYCL kernel function to print “Hello world!” as follows:
33+
4.) Try another command
4834

49-
```
50-
os << “Hello world!” << cl::sycl::endl;
51-
```
35+
Instead of `single_task` try another command for defining a SYCL kernel function (see [SYCL 1.2.1 specification][sycl-specification], sec 4.8.5).
5236

53-
4.) Try another command
37+
Remember the function object for the `parallel_for` which takes a `range` can be an `id` or an `item` and the function object for the `parallel_for` which takes an `nd_range` must be an `nd_item`.
5438

55-
Instead of single_task try another command for defining a SYCL kernel function (see [SYCL 1.2.1 specification][sycl-specification], sec 4.8.5).
39+
5.) Try a different dimensionality
5640

41+
Instead of a 1-dimensional range for your SYCL kernel function, try a 2 or 3-dimensional range.
5742

5843
[sycl-specification]: https://www.khronos.org/registry/SYCL/specs/sycl-1.2.1.pdf

docs/sycl_03_vector_add.md

Lines changed: 11 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -14,46 +14,26 @@ In SYCL buffers are used to manage data across the host and device(s), and acces
1414

1515
1.) Allocate your input and output vectors
1616

17-
Allocate memory on the host for your input and output data using std::vectors and initialise the input with values.
18-
19-
```
20-
auto input = std::vector<float>{};
21-
auto output = std::vector<float>{};
22-
23-
input.reserve(size);
24-
output.reserve(size);
25-
26-
std::iota(begin(input), end(output), 0.0f);
27-
std::fill(begin(input), end(output), 0.0f);
28-
```
17+
Allocate memory on the host for your input and output data using `std::vector`s and initialize the input with values.
2918

3019
2.) Construct buffers
3120

32-
Construct a buffer to manage your input and output data.
21+
Construct a buffer to manage your input and output data. The template parameters for the the `buffer` class are the type and then the dimensionality. The parameters to construct a buffer are a pointer to the host data and a `range`.
3322

34-
```
35-
auto inputBuf = cl::sycl::buffer<float, 1>(input.data(),
36-
cl::sycl::range<1>(intput.size());
37-
auto outputBuf = cl::sycl::buffer<float, 1>(input.data(),
38-
cl::sycl::range<1>(intput.size());
39-
```
23+
Remember the dimensionality of the `range` must match the dimensionality of the `buffer`.
4024

4125
3.) Construct accessors
4226

43-
Construct an accessor for your input and output buffers.
27+
Construct an accessor for your input and output buffers. The template parameter to `get_access` is the access mode that specifies how you wish to use the data managed by the buffer.
4428

45-
```
46-
auto inputAcc = inputBuf.get_access<cl:sycl::access::mode::read>(cgh);
47-
auto outputAcc = outputBuf.get_access<cl:sycl::access::mode::write>(cgh);
48-
```
29+
Remember to pass the `handler` to `get_access`, if you don't this will construct a host accessor, which behaves differently to a regular accessor.
4930

5031
4.) Declare your kernel
5132

52-
Declare a SYCL kernel function using the parallel_for command that takes ...
33+
Declare a SYCL kernel function using the `parallel_for` command with a range matching the size of the `std::vector`s. The kernel function should use the `operator[]` of the `accessor` objects to read from the inputs and write the sum to the output.
34+
35+
Remember the `accessor`'s `operator[]` can take either a `size_t` (when the dimensionality is 1) and an `id`.
36+
37+
5.) Try a temporary buffer
5338

54-
```
55-
cgh.parallel_for<vector_add>(range<1>(input.size()),
56-
[=](cl::sycl::id<1> id) {
57-
outputAcc[id] = inputAAcc[id] + inputBAcc[id];
58-
});
59-
```
39+
You can construct a temporary `buffer` that doesn't copy back on destruction by initialising it with just a `range` and no host pointer.

docs/sycl_04_image_grayscale.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@ An image can be grayscaled using the following algorithm:
1616
Y = (R * 0.229) + (G * 0.587) + (B * 0.114)
1717
```
1818

19+
Where RGA are the red, green and blue channels of an RGBA four channel image format.
20+
1921
For the purposes of this exercise the STB image loading and writing library has been made available and the source for this exercise already contains the appropriate API calls to get you started.
2022

2123
1.) Write a SYCL kernel function for performing a grayscaling
@@ -24,27 +26,27 @@ The source for this example provides a stub which loads and write an image using
2426

2527
The source also contains a call to a benchmarking utility that will print the time taken to execute the SYCL code, the SYCL code should go inside the lambda that is passed to the `benchmark` function.
2628

27-
+ Change the path to the image, feel free to use your own image, but be wary of the size.
28-
+ Image loaded has four channels (RGBA)
29-
+ It's recommended that you use a 2 dimensional range for your kernel, but a 1 dimensional range for you buffer.
29+
It's recommended that you use a 2-dimensional `range` for `parallel_for` when working with images.
30+
31+
Note you will have to update the path to an image. There is an image in the repository but feel free to use any image you choose. Though it's recommend that you use a png image whose dimensions are multiples of 2 (for example 512x512) and has four channels (RGBA).
3032

3133
2.) Evaluate global memory access
3234

3335
Now that you have a working grayscaling kernel you should evaluate whether the global memory access patterns in your kernel are coalesced.
3436

35-
Consider two alternative ways to linearise the global id:
37+
Consider two alternative ways to linearize the global id:
3638

3739
```
3840
auto rowMajorLinearId = (idx[1] * width) + idx[0]; // row-major
3941
auto columnMajorLinearId = (idx[0] * height) + idx[1]; // column-major
4042
```
4143

42-
Try using both of these and compare the execution time of each.
44+
Try using both of these and compare the execution time of each. Though note that the benchmark facility provided measures who application time which is less accurate than measuring the actual kernel times.
4345

44-
3.) Use vectorisation
46+
3.) Use vectorization
4547

46-
Now that global memory access is coalesced another optimization you could do here would be to use SYCL vectors to present the pixels in the image.
48+
Now that global memory access is coalesced another optimization you could do here would be to use the SYCL `vec` class to present the pixels in the image.
4749

48-
You can reinterpret a buffer to be represented as a different type using the `buffer` class' `reinterpret` member function template. When calling this function you must specify the new type as a template parameter and a new `range` that will represent elements of the new type within the same space in memory as a function parameter.
50+
You can reinterpret a `buffer` to be represented as a different type using the `reinterpret` member function template of the `buffer` class. When calling this function you must specify the new type as a template parameter and a new `range` that will represent elements of the new type within the same space in memory as a function parameter.
4951

50-
Try reinterpreting your buffer to use `cl::sycl::float4` instead of `float`.
52+
Try reinterpreting your buffer to use `cl::sycl::float4` instead of `float`.

docs/sycl_05_transpose.md

Lines changed: 34 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,40 @@ In this first exercise you will learn:
1111

1212
---
1313

14-
TODO
14+
Matrix transpose is a very useful operation when working in linear algebra applications and can be performed efficiently on a GPU.
1515

16-
1.) Write a SYCL kernel for transposing matrices.
16+
A matrix transpositions switches the dimensions of a matrix, for example:
1717

18-
2.) Use local memory to improve global memory coalescing.
18+
```
19+
A = [1, 2, 3] => A` = [1, 4, 7]
20+
[4, 5, 6] [2, 5, 8]
21+
[7, 8, 9] [3, 6, 9]
22+
```
1923

20-
3.) Try different work-group sizes.
24+
1.) Write a SYCL kernel for transposing matrices
25+
26+
For the purposes of this exercise the source file provides a stub that defines a simple matrix class, whose data can be retrieved using the `data` member function and can be printed for evaluating the results using the `print` member function. Note for representation purposes `print` will display in row-major linearization.
27+
28+
Define a SYCL kernel function that takes an input matrix and an output matrix, and assigns the elements of the input the transposed position in the output. As a hint try calculating the the row-major and column-major liniearizations of the `id`.
29+
30+
It's recommended that you use a 2-dimensional `range` for the `parallel_for`.
31+
32+
Try observing that no matter how you change the linearization the performance will be largely unaffected.
33+
34+
2.) Use local memory to improve global memory coalescing
35+
36+
Create a local `accessor` (an `accesor` with the `access::target::local` access target), remember a local `accessor` must have the `access::mode::read_write` access mode. The constructor the local `accessor` just takes a `range` specifying the number of elements to allocate per work-group and the `handler`.
37+
38+
Once you've created an accessor pass it to the SYCL kernel function as you did the buffer `accessor`s. You can then copy the elements of global memory from the buffer `accessor` to local memory in the local `accessor`.
39+
40+
Make sure to read coalesce the reads from global memory and then assign into local memory already transposed so that the writes to global memory can also be coalesced.
41+
42+
You should be able to observe a performance gain from doing this.
43+
44+
3.) Try different work-group sizes
45+
46+
Try using different work-group sizes for you SYCL kernel function. Remember you will have to specify an `nd_range` in order to specify the local range.
47+
48+
Work-group sizes you could try are 8x8, 16x16, 16x32. Note that some of these may not work if your GPU does not support work-groups that large.
49+
50+
Remember you can query the maximum work-group size using the `device` class' `get_info` member function.

solutions/sycl_03_vector_add.cpp

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ void parallel_add(std::vector<T> &inputA, std::vector<T> &inputB,
5050
});
5151
}
5252

53-
TEST_CASE("sycl_03_vector_add", "add_floats") {
53+
TEST_CASE("add_floats", "sycl_03_vector_add") {
5454
const int size = 1024;
5555

5656
std::vector<float> inputA(size);
@@ -67,3 +67,27 @@ TEST_CASE("sycl_03_vector_add", "add_floats") {
6767
REQUIRE(output[i] == static_cast<float>(i * 2.0f));
6868
}
6969
}
70+
71+
TEST_CASE("intermediate_buffer", "sycl_03_vector_add") {
72+
const int size = 1024;
73+
74+
std::vector<float> inputA(size);
75+
std::vector<float> inputB(size);
76+
std::vector<float> inputC(size);
77+
std::vector<float> temp(size);
78+
std::vector<float> output(size);
79+
80+
std::iota(begin(inputA), end(inputA), 0.0f);
81+
std::iota(begin(inputB), end(inputB), 0.0f);
82+
std::iota(begin(inputC), end(inputC), 0.0f);
83+
std::fill(begin(temp), end(temp), 0.0f);
84+
std::fill(begin(output), end(output), 0.0f);
85+
86+
parallel_add(inputA, inputB, temp);
87+
88+
parallel_add(temp, inputC, output);
89+
90+
for (int i = 0; i < size; i++) {
91+
REQUIRE(output[i] == static_cast<float>(i * 3.0f));
92+
}
93+
}

0 commit comments

Comments
 (0)