You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Replace or emplace outer dimensions for GPU schedules
In the nested parallelism scheduling algorithm, whenever the dimension
is marked for GPU acceleration, e.g. `y -> y_i, y_o`, replace the
corresponding variable `y` with `y_o` in `outer_dims`.
This ensures the internal assertion `dims.size() >= outer_dims.size()`
is always true for GPU schedules.
The immediate effect is that for a downstream stage having GPU
schedules: `g.gpu_tile(x, xi, xo, ...)`, the upstream stage correctly
specifies the dimension `xo` by `f.compute_at(g, xo)`. This is in
accordance to the original design intent of the Mullapudi2016 paper.
As a result, the GPU IR correctly synthesizes shared GPU memory to cache
the intermediate results of stage `f`, optimizing for caching.
---
Also, for all stages at are `computed_at`, mark all vectorizable inner
dimensions as `gpu_threads`.
---
In the correctness tests at `test/autoscheduler/mullapudi/*.cpp` and
performance regression tests at `apps/*`, down adjust the estimated GPU
shared memory limit by specifying `autoscheduler.last_level_cache_size
<= 10000`. Except for pipline `conv_layer`, all pipelines should observe
an improvement of caching.
0 commit comments