Skip to content

Missed optimizations in precomputed gradients and operators for vari #1448

Open
@SteveBronder

Description

@SteveBronder

Description

I ran gcc's missed optimizer reporter on the arK example in stat_comp_benchmarks. The below contains all the stan math related optimization misses

https://gist.github.com/SteveBronder/eeb870bd25e9cb5c10d6e7bff44d24bb

In particular the optimization misses like below were interesting to me

stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not vectorized: not suitable for gather load _6 = _5->adj_;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: bad data references.
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not vectorized: not enough data-refs in basic block.
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:73:33: note: not vectorized: no grouped stores in basic block.
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not consecutive access _5 = *_4;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not consecutive access _10 = *_9;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not consecutive access _6 = _5->adj_;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not consecutive access _5->adj_ = _12;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not consecutive access _7 = this_16(D)->D.202934.adj_;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not vectorized: no grouped stores in basic block.
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:75:3: note: not vectorized: not enough data-refs in basic block.
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _8 = _1->adj_;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _1->adj_ = _10;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _5 = _2->adj_;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _2->adj_ = _7;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _6 = this_13(D)->D.211540.D.211444.adj_;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _9 = this_13(D)->D.211540.D.211444.adj_;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not vectorized: no grouped stores in basic block.
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:21:18: note: not consecutive access _1->adj_ =  Nan;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:21:18: note: not consecutive access _2->adj_ =  Nan;

For operators tend to be defined pretty similarly, below is the op for addition

namespace internal {
class add_vv_vari : public op_vv_vari {
 public:
  add_vv_vari(vari* avi, vari* bvi)
      : op_vv_vari(avi->val_ + bvi->val_, avi, bvi) {}
  void chain() {
    if (unlikely(is_any_nan(avi_->val_, bvi_->val_))) {
      avi_->adj_ = std::numeric_limits<double>::quiet_NaN();
      bvi_->adj_ = std::numeric_limits<double>::quiet_NaN();
    } else {
      avi_->adj_ += adj_;
      bvi_->adj_ += adj_;
    }
  }
};

class add_vd_vari : public op_vd_vari {
 public:
  add_vd_vari(vari* avi, double b) : op_vd_vari(avi->val_ + b, avi, b) {}
  void chain() {
    if (unlikely(is_any_nan(avi_->val_, bd_))) {
      avi_->adj_ = std::numeric_limits<double>::quiet_NaN();
    } else {
      avi_->adj_ += adj_;
    }
  }
};
}  // namespace internal

The main miss we have in the above is that vari stores it's adj_ and val_ next to each other on the stack allocator when we really just need access to the adj values. This is a known issue, Bob has a v nice post related to this on discourse, but I wonder if there is a dumber approach which could resolve this particular issue.

What if we added a template to stack_alloc (or maybe vari?) that defined the storage method we wanted. So say we defined enums joint, val, and adj where joint places the val_ and adj_ together on the same stack and val/adj places the val_ on one stack and adj_ on another stack. Would that help for chain methods like in precomputed_vari? It feels like it would give us better memory locality, if we doc'd it well idt this style would be that bad, and we could just default the template value to joint so that

  void chain() {
    for (size_t i = 0; i < size_; ++i) {
      varis_[i]->adj_ += adj_ * gradients_[i];
    }
  }

would get better memory locality and hit these optimizations for SIMD instructions

Expected Output

better memory locality

Current Version:

v3.0.0

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions