Replies: 4 comments 4 replies
-
|
Generally speaking yes. But AFAIR you still need to allocate buffer by OpenCL and so it can be shared with host (there are some paging restrictions). There also some syncing etc. I remember at the beginning tried to do it but put it aside since it wasn't that significant (comparable to other performance issues with Intel iGPU). It may have more sense when using AMD's APU that is usually more powerful. Anyway data transfer between device and host is relatively small part of the overall resource use in typical GPU training. And intel iGPU is so weak so there is no real advantage of using one over CPU that is highly optimised. (Actually virtually always CPU training is faster than intel iGPU) |
Beta Was this translation helpful? Give feedback.
-
If you have AMD with APU to run some tests and benchmarks I'd like to see it.
I agree on that - while their hardware is generally good, their software kida poor - especially with dropping ROCM support on older hardware (BTW officially only prof. HW is supported. For example my rx6600xt requires env variable to tell that it is different (but compatible) gpu type. |
Beta Was this translation helpful? Give feedback.
-
|
Just did some operator performance tests on my Laptop with Ryzen 6800H with 680M. elementwise(add / multiply constant) for 4096x4096 tensorcpu: 0.005085025897249579 4096x4096 * 4096x4096 matmulSince matmul not supported yet, I uesd Linear instead. 4096x4096 * 4096x1 matmulcpu:0.0027509444990428166 4096x4096 * 4096x4096 element-wise mulcpu:0.010118292996194213 It seems like AMD's APU behaves more like a 'typical' GPU than Intel's . Its advantage over CPU is much more noticeable when it comes to large shape tensors and compute-intensive tasks. As for tasks like gemv, the CPU could still contribute somewhat. I also attempted to test some larger tensors, but it just shut down my computer(due to OOM maybe, it reached the limit of dedicated memory in my task manager). I guess tensors are only allocated onto dedicated the frame buffer, but can't utilize shared memory yet. FP16 also fails. However I infer that the throughput has a simple 2:1 ratio. |
Beta Was this translation helpful? Give feedback.
-
|
Really strange:
For sure time shouldn't go up from warmup frame to next. I suggest, since memory is limited
If you hit memory limit on normal GPU it would continue to work but will slow down. Not sure how it works on APU. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Firstly, thanks for the great work! 😄
I just hands on this repository, and I runned something like this:
In my case, 'ocl:0' refers to an Intel HD Graphics IGPU.
I noticed that the address of tensor changed after
.to('ocl:0').From hardware's perspective, iGPU shares the DRAM with cpu, so physically this 'transfer' in software may not incur to a big overhead.
But I have observed
.to()just took a lot of time.So I wonder, is it possible to take the advantage of unified memory architecture and eliminate this overhead? (i.e. change the tensor's backend without doing copy and momory allocate/free)
I may misunderstand something, if so, please let me know!
😃
Beta Was this translation helpful? Give feedback.
All reactions