Description
Description
I ran gcc's missed optimizer reporter on the arK example in stat_comp_benchmarks
. The below contains all the stan math related optimization misses
https://gist.github.com/SteveBronder/eeb870bd25e9cb5c10d6e7bff44d24bb
In particular the optimization misses like below were interesting to me
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not vectorized: not suitable for gather load _6 = _5->adj_;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: bad data references.
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not vectorized: not enough data-refs in basic block.
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:73:33: note: not vectorized: no grouped stores in basic block.
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not consecutive access _5 = *_4;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not consecutive access _10 = *_9;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not consecutive access _6 = _5->adj_;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not consecutive access _5->adj_ = _12;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not consecutive access _7 = this_16(D)->D.202934.adj_;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not vectorized: no grouped stores in basic block.
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:75:3: note: not vectorized: not enough data-refs in basic block.
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _8 = _1->adj_;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _1->adj_ = _10;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _5 = _2->adj_;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _2->adj_ = _7;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _6 = this_13(D)->D.211540.D.211444.adj_;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _9 = this_13(D)->D.211540.D.211444.adj_;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not vectorized: no grouped stores in basic block.
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:21:18: note: not consecutive access _1->adj_ = Nan;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:21:18: note: not consecutive access _2->adj_ = Nan;
For operators tend to be defined pretty similarly, below is the op for addition
namespace internal {
class add_vv_vari : public op_vv_vari {
public:
add_vv_vari(vari* avi, vari* bvi)
: op_vv_vari(avi->val_ + bvi->val_, avi, bvi) {}
void chain() {
if (unlikely(is_any_nan(avi_->val_, bvi_->val_))) {
avi_->adj_ = std::numeric_limits<double>::quiet_NaN();
bvi_->adj_ = std::numeric_limits<double>::quiet_NaN();
} else {
avi_->adj_ += adj_;
bvi_->adj_ += adj_;
}
}
};
class add_vd_vari : public op_vd_vari {
public:
add_vd_vari(vari* avi, double b) : op_vd_vari(avi->val_ + b, avi, b) {}
void chain() {
if (unlikely(is_any_nan(avi_->val_, bd_))) {
avi_->adj_ = std::numeric_limits<double>::quiet_NaN();
} else {
avi_->adj_ += adj_;
}
}
};
} // namespace internal
The main miss we have in the above is that vari
stores it's adj_ and val_ next to each other on the stack allocator when we really just need access to the adj values. This is a known issue, Bob has a v nice post related to this on discourse, but I wonder if there is a dumber approach which could resolve this particular issue.
What if we added a template to stack_alloc
(or maybe vari
?) that defined the storage method we wanted. So say we defined enums joint
, val
, and adj
where joint
places the val_
and adj_
together on the same stack and val
/adj
places the val_
on one stack and adj_
on another stack. Would that help for chain methods like in precomputed_vari? It feels like it would give us better memory locality, if we doc'd it well idt this style would be that bad, and we could just default the template value to joint
so that
void chain() {
for (size_t i = 0; i < size_; ++i) {
varis_[i]->adj_ += adj_ * gradients_[i];
}
}
would get better memory locality and hit these optimizations for SIMD instructions
Expected Output
better memory locality
Current Version:
v3.0.0