Could we avoid the overhead when I transfer a tensor from cpu to a iGPU? #87

rikoras · 2024-08-22T17:58:04Z

rikoras
Aug 22, 2024

Firstly, thanks for the great work! 😄
I just hands on this repository, and I runned something like this:

a = torch.randn(10, device='cpu)
a.data_ptr()
a = a.to('ocl:0')
a.data_ptr()

In my case, 'ocl:0' refers to an Intel HD Graphics IGPU.
I noticed that the address of tensor changed after .to('ocl:0').

From hardware's perspective, iGPU shares the DRAM with cpu, so physically this 'transfer' in software may not incur to a big overhead.
But I have observed .to() just took a lot of time.

So I wonder, is it possible to take the advantage of unified memory architecture and eliminate this overhead? (i.e. change the tensor's backend without doing copy and momory allocate/free)

I may misunderstand something, if so, please let me know!
😃

artyom-beilis · 2024-08-23T04:22:32Z

artyom-beilis
Aug 23, 2024
Maintainer

Generally speaking yes. But AFAIR you still need to allocate buffer by OpenCL and so it can be shared with host (there are some paging restrictions). There also some syncing etc.

I remember at the beginning tried to do it but put it aside since it wasn't that significant (comparable to other performance issues with Intel iGPU). It may have more sense when using AMD's APU that is usually more powerful.

Anyway data transfer between device and host is relatively small part of the overall resource use in typical GPU training. And intel iGPU is so weak so there is no real advantage of using one over CPU that is highly optimised. (Actually virtually always CPU training is faster than intel iGPU)

1 reply

rikoras Aug 23, 2024
Author

Thanks, that makes it much clearer!
Now I see that it's true the Intel iGPU has poor performance (even over ten times weaker compared to my CPU).
Recently, I've been curious about the possibility of CPU/iGPU parallelism for training or inference. It seems like taking Intel iGPU into consideration would not bring a significant improvement.

It may have more sense when using AMD's APU that is usually more powerful.

I will switch to AMD and see what happens. 😄
Previously, I've been frustrated by the lack of AMD support. ROCM doesn't work for APU, and Ryzen AI software is incompatible with Linux.
Anyway, thanks for your reply and I will keep learning and stay focused on this project!

artyom-beilis · 2024-08-23T16:17:11Z

artyom-beilis
Aug 23, 2024
Maintainer

I will switch to AMD and see what happens

If you have AMD with APU to run some tests and benchmarks I'd like to see it.

Previously, I've been frustrated by the lack of AMD support. ROCM doesn't work for APU

I agree on that - while their hardware is generally good, their software kida poor - especially with dropping ROCM support on older hardware (BTW officially only prof. HW is supported. For example my rx6600xt requires env variable to tell that it is different (but compatible) gpu type.

0 replies

rikoras · 2024-08-23T19:47:31Z

rikoras
Aug 23, 2024
Author

Just did some operator performance tests on my Laptop with Ryzen 6800H with 680M.

elementwise(add / multiply constant) for 4096x4096 tensor

cpu: 0.005085025897249579
ocl0: 0.002343313197023235

4096x4096 * 4096x4096 matmul

Since matmul not supported yet, I uesd Linear instead.
cpu: 1.2002651615587707
ocl0: 0.1403098010015674

4096x4096 * 4096x1 matmul

cpu:0.0027509444990428166
ocl0:0.001639160817192698

4096x4096 * 4096x4096 element-wise mul

cpu:0.010118292996194213
ocl0:0.0028227029021363707

It seems like AMD's APU behaves more like a 'typical' GPU than Intel's . Its advantage over CPU is much more noticeable when it comes to large shape tensors and compute-intensive tasks.

As for tasks like gemv, the CPU could still contribute somewhat.

I also attempted to test some larger tensors, but it just shut down my computer(due to OOM maybe, it reached the limit of dedicated memory in my task manager). I guess tensors are only allocated onto dedicated the frame buffer, but can't utilize shared memory yet.

FP16 also fails. However I infer that the throughput has a simple 2:1 ratio.

3 replies

artyom-beilis Aug 23, 2024
Maintainer

Nice - look promising. Do you build code from source or use preinstalled package?

You can run something like: python dlprimitives/tools/validate_network.py --benchmark --train --model resnet50 --device=ocl:0 to see how it runs on different networks

You can also build pure dlprimitives and there is ./dlprim_flops tool you can run

FP16 also fails.

It isn't supported yet.

rikoras Aug 24, 2024
Author

Do you build code from source or use preinstalled package?

Tried to. But currently I only got a OCL_SDK_Light_AMD which does not support OpenCL2. I am not very get uesd to developing with this Windows laptop, and managing all these environments is a little troublesome. Maybe RCOM-OpenCL is a better chioce I guess.

With preinstalled package, I got this statistics for dlprimitives/tools/validate_network.py --benchmark --train --model resnet50 --device=ocl:0

Accessing device #0:gfx1035 on AMD Accelerated Parallel Processing
Warming up
Step -5 573.667ms  warming up
Step -4 3089.193ms
Step -3 3856.676ms
Step -2 3865.463ms
Step -1 3858.081ms
Step  0 3855.654ms  started
Step  1 3553.289ms
Step  2 3546.257ms
Step  3 3675.008ms
Step  4 3879.611ms
Step  5 3769.562ms
Step  6 3831.712ms
Step  7 3612.443ms
Step  8 3526.409ms
Step  9 3549.164ms
Step 10 3646.907ms
Step 11 3837.260ms
Step 12 4152.030ms
Step 13 5356.438ms
Step 14 4422.596ms
Step 15 3460.641ms  
Step 16 3319.086ms  
Step 17 3416.003ms  
Step 18 3413.525ms  
Step 19 3602.708ms  
Time per item  235.707 ms
Time fwd batch  3758.264 ms
Time bwd batch  13.051 ms
Time io  batch  3746.554 ms
Time zro batch  0.000 ms
Time opt batch  0.000 ms
Time per batch 3771.315 ms

Had to mention this training process ate up nearly all my dedicated memory (2G and cant be reallocated as for laptop BIOS), and some data is allocated to shared memory, so I am not sure is this incurs to some bottleneck.

From Step 1 to Step 14, I continuously dragged my vscode window, so the latencies raised up.

Also, for cpu:

Warming up
Step -5 5597.310ms  warming up
Step -4 5399.931ms  
Step -3 5366.873ms  
Step -2 5471.432ms  
Step -1 5764.016ms  
Step  0 5599.594ms  started
Step  1 5441.427ms  
Step  2 5362.411ms  
Step  3 5333.028ms  
Step  4 5350.288ms  
Step  5 5366.684ms  
Step  6 5373.646ms  
Step  7 5290.825ms  
Step  8 5319.006ms  
Step  9 5310.154ms  
Step 10 5365.730ms  
Step 11 5320.652ms  
Step 12 5357.203ms  
Step 13 5321.475ms  
Step 14 5332.291ms  
Step 15 5319.637ms  
Step 16 5344.003ms  
Step 17 5330.214ms  
Step 18 5332.248ms  
Step 19 5327.188ms  
Time per item  334.680 ms
Time fwd batch  2127.654 ms
Time bwd batch  3227.232 ms
Time io  batch  0.000 ms
Time zro batch  0.000 ms
Time opt batch  0.000 ms
Time per batch 5354.885 ms

rikoras Aug 24, 2024
Author

Time fwd batch 2127.654 ms
Time bwd batch 3227.232 ms

Time fwd batch 3758.264 ms
Time bwd batch 13.051 ms

The time breakdowns between CPU and APU are quite different. That surprised me.

artyom-beilis · 2024-08-24T19:12:48Z

artyom-beilis
Aug 24, 2024
Maintainer

Really strange:

Step -5 573.667ms warming up
Step -4 3089.193ms

For sure time shouldn't go up from warmup frame to next.

I suggest, since memory is limited

Try resnet18 instead of resnet50 - it is much smaller and faster network
Reduce batch size if it still does not help (default is 16)

If you hit memory limit on normal GPU it would continue to work but will slow down. Not sure how it works on APU.

0 replies

Could we avoid the overhead when I transfer a tensor from cpu to a iGPU? #87

Uh oh!

Uh oh!

rikoras Aug 22, 2024

Replies: 4 comments · 4 replies

Uh oh!

artyom-beilis Aug 23, 2024 Maintainer

Uh oh!

Uh oh!

rikoras Aug 23, 2024 Author

Uh oh!

artyom-beilis Aug 23, 2024 Maintainer

Uh oh!

Uh oh!

rikoras Aug 23, 2024 Author

elementwise(add / multiply constant) for 4096x4096 tensor

4096x4096 * 4096x4096 matmul

4096x4096 * 4096x1 matmul

4096x4096 * 4096x4096 element-wise mul

Uh oh!

artyom-beilis Aug 23, 2024 Maintainer

Uh oh!

Uh oh!

rikoras Aug 24, 2024 Author

Uh oh!

rikoras Aug 24, 2024 Author

Uh oh!

artyom-beilis Aug 24, 2024 Maintainer

rikoras
Aug 22, 2024

Replies: 4 comments 4 replies

artyom-beilis
Aug 23, 2024
Maintainer

rikoras Aug 23, 2024
Author

artyom-beilis
Aug 23, 2024
Maintainer

rikoras
Aug 23, 2024
Author

artyom-beilis Aug 23, 2024
Maintainer

rikoras Aug 24, 2024
Author

rikoras Aug 24, 2024
Author

artyom-beilis
Aug 24, 2024
Maintainer