Update exercises.

Gordon · Gordon · commit 7325cb153c41 · 2019-09-22T06:22:43.000+01:00
diff --git a/docs/sycl_02_hello_world.md b/docs/sycl_02_hello_world.md
@@ -10,49 +10,34 @@ In this first exercise you will learn:
 
 ---
 
-Once you have a queue you can now submit work for the device to execute, and this is done via command groups, which are made up of commands and data dependencies.
+Once you have a queue you can now submit work for the device to be executed, and this is done via command groups, which are made up of commands and data dependencies.
 
 1.) Define a command group
 
-Define a lambda to represent your command group and pass it to the submit member function of the queue as follows:
-
-```
-myQueue.submit([&](cl::sycl::handler &cgh) {
-  
-});
-```
+Define a lambda to represent your command group and pass it to the submit member function of the queue.
 
 Note that submitting a command group without any commands will result in an error.
 
 2.) Define a SYCL kernel function
 
-Define a SYCL kernel function via the single_task command within the command group as follows:
-
-```
-cgh.single_task<hello_world>([=](){
-
-});
-```
+Define a SYCL kernel function via the `single_task` command within the command group, which takes only a function object which itself doesn't take any parameters.
 
 Remember to declare a class for your kernel name in the global namespace.
 
 3.) Stream “Hello World!” to stdout from the SYCL kernel function
 
-Construct a stream within the scope of the command group as follows:
+Create a `stream` object within the command group scope as follows. The two parameters to the constructor of the `stream` class are the total buffer size and the statement size respectively.
 
-```
-auto os = cl::sycl::stream{128, 128};
-```
+Then use the stream you constructed within the SYCL kernel function to print “Hello world!” using the `<<` operator.
 
-Then use the stream you constructed within the SYCL  kernel function to print “Hello world!” as follows:
+4.) Try another command
 
-```
-os << “Hello world!” << cl::sycl::endl;
-```
+Instead of `single_task` try another command for defining a SYCL kernel function (see [SYCL 1.2.1 specification][sycl-specification], sec 4.8.5).
 
-4.) Try another command
+Remember the function object for the `parallel_for` which takes a `range` can be an `id` or an `item` and the function object for the `parallel_for` which takes an `nd_range` must be an `nd_item`.
 
-Instead of single_task try another command for defining a SYCL kernel function (see [SYCL 1.2.1 specification][sycl-specification], sec 4.8.5).
+5.) Try a different dimensionality
 
+Instead of a 1-dimensional range for your SYCL kernel function, try a 2 or 3-dimensional range.
 
 [sycl-specification]: https://www.khronos.org/registry/SYCL/specs/sycl-1.2.1.pdf
diff --git a/docs/sycl_03_vector_add.md b/docs/sycl_03_vector_add.md
@@ -14,46 +14,26 @@ In SYCL buffers are used to manage data across the host and device(s), and acces
 
 1.) Allocate your input and output vectors
 
-Allocate memory on the host for your input and output data using std::vectors and initialise the input with values.
-
-```
-auto input = std::vector<float>{};
-auto output = std::vector<float>{};
-
-input.reserve(size);
-output.reserve(size);
-
-std::iota(begin(input), end(output), 0.0f);
-std::fill(begin(input), end(output), 0.0f);
-```
+Allocate memory on the host for your input and output data using `std::vector`s and initialize the input with values.
 
 2.) Construct buffers
 
-Construct a buffer to manage your input and output data.
+Construct a buffer to manage your input and output data. The template parameters for the the `buffer` class are the type and then the dimensionality. The parameters to construct a buffer are a pointer to the host data and a `range`.
 
-```
-auto inputBuf = cl::sycl::buffer<float, 1>(input.data(),
-  cl::sycl::range<1>(intput.size());
-auto outputBuf = cl::sycl::buffer<float, 1>(input.data(),
-  cl::sycl::range<1>(intput.size());
-```
+Remember the dimensionality of the `range` must match the dimensionality of the `buffer`.
 
 3.) Construct accessors
 
-Construct an accessor for your input and output buffers.
+Construct an accessor for your input and output buffers. The template parameter to `get_access` is the access mode that specifies how you wish to use the data managed by the buffer.
 
-```
-auto inputAcc = inputBuf.get_access<cl:sycl::access::mode::read>(cgh);
-auto outputAcc = outputBuf.get_access<cl:sycl::access::mode::write>(cgh);
-```
+Remember to pass the `handler` to `get_access`, if you don't this will construct a host accessor, which behaves differently to a regular accessor.
 
 4.) Declare your kernel
 
-Declare a SYCL kernel function using the parallel_for command that takes ...
+Declare a SYCL kernel function using the `parallel_for` command with a range matching the size of the `std::vector`s. The kernel function should use the `operator[]` of the `accessor` objects to read from the inputs and write the sum to the output.
+
+Remember the `accessor`'s `operator[]` can take either a `size_t` (when the dimensionality is 1) and an `id`.
+
+5.) Try a temporary buffer
 
-```
-cgh.parallel_for<vector_add>(range<1>(input.size()),
-  [=](cl::sycl::id<1> id) {
-  outputAcc[id] = inputAAcc[id] + inputBAcc[id];
-});
-```
+You can construct a temporary `buffer` that doesn't copy back on destruction by initialising it with just a `range` and no host pointer.
diff --git a/docs/sycl_04_image_grayscale.md b/docs/sycl_04_image_grayscale.md
@@ -16,6 +16,8 @@ An image can be grayscaled using the following algorithm:
 Y = (R * 0.229) + (G * 0.587) + (B * 0.114)
 ```
 
+Where RGA are the red, green and blue channels of an RGBA four channel image format.
+
 For the purposes of this exercise the STB image loading and writing library has been made available and the source for this exercise already contains the appropriate API calls to get you started.
 
 1.) Write a SYCL kernel function for performing a grayscaling
@@ -24,27 +26,27 @@ The source for this example provides a stub which loads and write an image using
 
 The source also contains a call to a benchmarking utility that will print the time taken to execute the SYCL code, the SYCL code should go inside the lambda that is passed to the `benchmark` function.
 
-+ Change the path to the image, feel free to use your own image, but be wary of the size.
-+ Image loaded has four channels (RGBA)
-+ It's recommended that you use a 2 dimensional range for your kernel, but a 1 dimensional range for you buffer.
+It's recommended that you use a 2-dimensional `range` for `parallel_for` when working with images.
+
+Note you will have to update the path to an image. There is an image in the repository but feel free to use any image you choose. Though it's recommend that you use a png image whose dimensions are multiples of 2 (for example 512x512) and has four channels (RGBA).
 
 2.) Evaluate global memory access
 
 Now that you have a working grayscaling kernel you should evaluate whether the global memory access patterns in your kernel are coalesced.
 
-Consider two alternative ways to linearise the global id:
+Consider two alternative ways to linearize the global id:
 
 ```
 auto rowMajorLinearId    = (idx[1] * width) + idx[0];  // row-major
 auto columnMajorLinearId = (idx[0] * height) + idx[1];  // column-major
 ```
 
-Try using both of these and compare the execution time of each.
+Try using both of these and compare the execution time of each. Though note that the benchmark facility provided measures who application time which is less accurate than measuring the actual kernel times.
 
-3.) Use vectorisation
+3.) Use vectorization
 
-Now that global memory access is coalesced another optimization you could do here would be to use SYCL vectors to present the pixels in the image.
+Now that global memory access is coalesced another optimization you could do here would be to use the SYCL `vec` class to present the pixels in the image.
 
-You can reinterpret a buffer to be represented as a different type using the `buffer` class' `reinterpret` member function template. When calling this function you must specify the new type as a template parameter and a new `range` that will represent elements of the new type within the same space in memory as a function parameter.
+You can reinterpret a `buffer` to be represented as a different type using the `reinterpret` member function template of the `buffer` class. When calling this function you must specify the new type as a template parameter and a new `range` that will represent elements of the new type within the same space in memory as a function parameter.
 
-Try reinterpreting your buffer to use `cl::sycl::float4` instead of `float`.
+Try reinterpreting your buffer to use `cl::sycl::float4` instead of `float`.
diff --git a/docs/sycl_05_transpose.md b/docs/sycl_05_transpose.md
@@ -11,10 +11,40 @@ In this first exercise you will learn:
 
 ---
 
-TODO
+Matrix transpose is a very useful operation when working in linear algebra applications and can be performed efficiently on a GPU.
 
-1.) Write a SYCL kernel for transposing matrices.
+A matrix transpositions switches the dimensions of a matrix, for example:
 
-2.) Use local memory to improve global memory coalescing.
+```
+A = [1, 2, 3]  =>  A` = [1, 4, 7]
+    [4, 5, 6]           [2, 5, 8]
+    [7, 8, 9]           [3, 6, 9]
+```
 
-3.) Try different work-group sizes.
+1.) Write a SYCL kernel for transposing matrices
+
+For the purposes of this exercise the source file provides a stub that defines a simple matrix class, whose data can be retrieved using the `data` member function and can be printed for evaluating the results using the `print` member function. Note for representation purposes `print` will display in row-major linearization.
+
+Define a SYCL kernel function that takes an input matrix and an output matrix, and assigns the elements of the input the transposed position in the output. As a hint try calculating the the row-major and column-major liniearizations of the `id`.
+
+It's recommended that you use a 2-dimensional `range` for the `parallel_for`.
+
+Try observing that no matter how you change the linearization the performance will be largely unaffected.
+
+2.) Use local memory to improve global memory coalescing
+
+Create a local `accessor` (an `accesor` with the `access::target::local` access target), remember a local `accessor` must have the `access::mode::read_write` access mode. The constructor the local `accessor` just takes a `range` specifying the number of elements to allocate per work-group and the `handler`.
+
+Once you've created an accessor pass it to the SYCL kernel function as you did the buffer `accessor`s. You can then copy the elements of global memory from the buffer `accessor` to local memory in the local `accessor`.
+
+Make sure to read coalesce the reads from global memory and then assign into local memory already transposed so that the writes to global memory can also be coalesced.
+
+You should be able to observe a performance gain from doing this.
+
+3.) Try different work-group sizes
+
+Try using different work-group sizes for you SYCL kernel function. Remember you will have to specify an `nd_range` in order to specify the local range.
+
+Work-group sizes you could try are 8x8, 16x16, 16x32. Note that some of these may not work if your GPU does not support work-groups that large.
+
+Remember you can query the maximum work-group size using the `device` class' `get_info` member function.
diff --git a/solutions/sycl_03_vector_add.cpp b/solutions/sycl_03_vector_add.cpp
@@ -50,7 +50,7 @@ void parallel_add(std::vector<T> &inputA, std::vector<T> &inputB,
   });
 }
 
-TEST_CASE("sycl_03_vector_add", "add_floats") {
+TEST_CASE("add_floats", "sycl_03_vector_add") {
   const int size = 1024;
 
   std::vector<float> inputA(size);
@@ -67,3 +67,27 @@ TEST_CASE("sycl_03_vector_add", "add_floats") {
     REQUIRE(output[i] == static_cast<float>(i * 2.0f));
   }
 }
+
+TEST_CASE("intermediate_buffer", "sycl_03_vector_add") {
+  const int size = 1024;
+
+  std::vector<float> inputA(size);
+  std::vector<float> inputB(size);
+  std::vector<float> inputC(size);
+  std::vector<float> temp(size);
+  std::vector<float> output(size);
+
+  std::iota(begin(inputA), end(inputA), 0.0f);
+  std::iota(begin(inputB), end(inputB), 0.0f);
+  std::iota(begin(inputC), end(inputC), 0.0f);
+  std::fill(begin(temp), end(temp), 0.0f);
+  std::fill(begin(output), end(output), 0.0f);
+
+  parallel_add(inputA, inputB, temp);
+
+  parallel_add(temp, inputC, output);
+
+  for (int i = 0; i < size; i++) {
+    REQUIRE(output[i] == static_cast<float>(i * 3.0f));
+  }
+}