Parallel computing using Nvidia CUDA

Assignment submission for WEC systems recruitment by T Ranjit

Index:-

Setup
Task_1
2.1 Task in serial
2.2 Task in parallel
2.3 Performance comparisons
Task_2
3.1 Task in serial
3.2 Task in parallel
Task_3
4.1 Task in serial
4.2 Task in parallel
4.3 Performance comparisons

1. Setup

Note: The code has been tested in windows 11 only. I am not sure if it'll run without any issues on linux.

You can follow this installation guide by Nvidia to get your CUDA toolkit is setup.

https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_version=11&target_type=exe_local

Follow section 2.2 from the following document: https://docs.nvidia.com/cuda/cuda-quick-start-guide/index.html

Once your CUDA toolkit is installed, start a new project in visual studio, and select CUDA template. you'll be presented with a file with kernel.cu file with boilerplate code. Ctrl+A, delete it all, and paste the code from the kernel.cu files on GitHub to run that particular task. for example, if you want to run Task_1, go to the GitHub folder for Task_1 and copy paste the code from Task_1/Kernel.cu into your local kernel.cu that's open in vs. once that's done, you can just hit run to compile the program.

2. Task_1

Write a parallel program to find the sum of all the elements in a given array.

2.1. Serial implementation:-

This task in serial is pretty simple. just take 2 arrays of the same size and initialize them to some value less than hundred, and add them.

2.2. Parallel implementation:-

For the parallel version of this task, I initialized 3 arrays on the GPU, one empty and the other two having values equal to their index. Then i assigned each thread to adding the values at its index. The program runs fine, but due to the way I've defined the arrays, any array size over 2^15 gives a stack overflow error. I didnt know CPP when i wrote this task, and if i had to do it agian, id fix it, but i didnt have the time to go back and fix it.

2.3. Performance comparision:-

For the values of array size like 2^15, the serial program is faster than the parallel one, because CUDA programs require copying the data to the gpu, and due to the relatively simple nature of the task, memory bandwidth, not computational speed, is the bottleneck here. Hence, for this task, i have included the values of time taken for the whole parallel program, and the time taken minus the memory allocation part.

Array size	Serial time (ms)	Parallel net time (ms)	Parallel time without bandwidth bottleneck (ms)
2^1	0.004300	1609.94	0.5857
2^3	0.002200	1029.4	0.5421
2^7	0.007600	155.454	0.5416
2^11	0.160900	1621.57	0.6242
2^15	1.627700	1709.51	0.659

You can see how the parallel compute starts being faster at 2^15. Unfortunately my program crashes for values more than that, but for huge values like 2^25 or so, even the parallel net time would be shorter than the serial time.

3. Task_2

Given a graph, write a parallel program to find the number of connected components in the graph, and find which component a given vertex belongs to. This task is incomplete

3.1. Serial implementation:-

DFS on the unvisited nodes of the graph would give you the number of components. The number of times DFS is called at the first recursive level is the number of components, and the nodes found in each DFS traversal are the elements of the component.

3.2. Parallel implementation:-

For the parallel case, implementation would require feeding in each row of the adjacency matrix to one thread. Now each thread will perform the serial algorithm on the part of the graph it can see. Then we need to use union find/join to combine the components found by individual threads. However, me being new to DSA (I learnt C and CPP for this assignment, never used them before), wasnt able to understand how to implement union find in the kernel. This is why i could not finish the parallel implementation of this task.

4. Task_3

Write a parallel program to run a monte carlo simulation to find the probability of getting the total 3n when n 6-sided dice are rolled simultaneously. Make sure that your PRNG is working as expected in the parallel algorithm.

4.1 Serial implementation-

For this case, i just rolled a random number 1-6 an arbritrarily large number of times, and then counted up the number of times you get 3*N. Once I have that, I divided it by the total number of runs to get the chance of getting that sum.

4.2 Parallel implementation-

After assigning a cache array to each block, each thread rolls a random number 3 times and saves its sum total to its index in the cache. Then I perform reduction on the cache to convert it into an array of size 1, and then finally I add the value of cache[0] across all the blocks using atomicAdd. The values I'm getting from the simulation are a bit off from the actual probabilities. I do think my PRNG is working fine, as I used a different seed for the PRNG and I still got similar results. There might be some logical error in the code, but you can see from the next section that running it in parallel is much faster than running it in serial. While performance testing, I did encounter an error when the sample size is 2^30, however, I didnt have the time to fix it.

4.3 Performance comparision:-

Number of dice = 1

Sample size	Serial time (ms)	Parallel simulation time (ms)
2^10	0.0618	6.0239
2^15	1.8051	6.4478
2^20	22.1784	6.7859
2^25	726.391	51.1241
2^30	21592	-

Number of dice = 3

Sample size	Serial time (ms)	Parallel simulation time (ms)
2^10	0.2017	6.0266
2^15	1.9766	6.384
2^20	51.4456	5.9969
2^25	1897.94	50.9596
2^30	Forever(t>1 minute)	-

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
task_1		task_1
task_2		task_2
task_3		task_3
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Parallel computing using Nvidia CUDA

Assignment submission for WEC systems recruitment by T Ranjit

1. Setup

2. Task_1

2.1. Serial implementation:-

2.2. Parallel implementation:-

2.3. Performance comparision:-

3. Task_2

3.1. Serial implementation:-

3.2. Parallel implementation:-

4. Task_3

4.1 Serial implementation-

4.2 Parallel implementation-

4.3 Performance comparision:-

About

Uh oh!

Releases

Packages

Languages

AmissDrake/Parallel_Computing

Folders and files

Latest commit

History

Repository files navigation

Parallel computing using Nvidia CUDA

Assignment submission for WEC systems recruitment by T Ranjit

1. Setup

2. Task_1

2.1. Serial implementation:-

2.2. Parallel implementation:-

2.3. Performance comparision:-

3. Task_2

3.1. Serial implementation:-

3.2. Parallel implementation:-

4. Task_3

4.1 Serial implementation-

4.2 Parallel implementation-

4.3 Performance comparision:-

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages