Skip to content

Conversation

@kpu
Copy link
Member

@kpu kpu commented Jul 26, 2020

The existing implementation of Select for CPU is very slow.

Reimplemented the Select function for CPU using std::copy for the case when stuff to copy after the axis is contiguous, which it always is in practice.

Fixes #684

Measured:
enes.student.tiny11
xzcat sources.shuf.xz |head -n 10000
var (Cascade Lake) single core
based on intgemm_reintegrated_computestats branch

Before Total time: 66.69077s wall
After Total time: 61.20206s wall

Testing

Ran Marian unit tests (which already includes a tesf tor Select)
Ran enes student on Linux with cmake .. -DCMAKE_BUILD_TYPE=Debug -DCOMPILE_CUDA=OFF -DUSE_FBGEMM=ON -DUSE_SENTENCEPIECE=ON -DCOMPILE_TESTS=ON

Checklist

  • I have tested the code manually
  • I have run regression tests
  • I have read and followed CONTRIBUTING.md
  • I have updated CHANGELOG.md

kpu added 4 commits July 26, 2020 19:26
Fixes #684

Measured:
enes.student.tiny11
xzcat sources.shuf.xz |head -n 10000
var (Cascade Lake) single core
based on intgemm_reintegrated_computestats branch

Before Total time: 66.69077s wall
After Total time: 61.20206s wall
@emjotde emjotde self-assigned this Jul 27, 2020
@emjotde
Copy link
Member

emjotde commented Jul 27, 2020

@snukky Cool, the checks are already doing their job.

idxBase += idxStride;
}
});
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do an if {} else {} here instead of the empty return. Could you also mention that the code below is execute if the above conditions are not met and that this code is generic but slow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's also add a positive and a negative example where each branch is used? There are a couple of them in tests_operators, I think at least one will still go to the default version?

const functional::Shape &inShape,
const functional::Shape &idxShape,
int axisCPU,
Backend backend) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use a different name than Backend. We use backend for CPU/GPU device things, had me confused for a sec.

@emjotde
Copy link
Member

emjotde commented Aug 1, 2020

Nice change. You mentioned in the issue that this improves things for your tiny models, but not for the WNGT models? Can you briefly say here what the difference is?

Copy link
Contributor

@frankseide frankseide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change indeed. LGTM!

@snukky
Copy link
Member

snukky commented Aug 26, 2020

Ping @kpu

@emjotde
Copy link
Member

emjotde commented Aug 31, 2020

@kpu if you address the review comments this is ready to merge.

@kpu
Copy link
Member Author

kpu commented Aug 31, 2020

I know, will fix it this week.

@emjotde
Copy link
Member

emjotde commented Oct 26, 2020

Ping :P

@emjotde
Copy link
Member

emjotde commented Mar 23, 2021

@kpu Beep bop boop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Select function on CPU takes 10% of time on tiny students, can be optimized

5 participants