Missed optimizations in precomputed gradients and operators for vari

## Description

I ran gcc's missed optimizer reporter on the arK example in `stat_comp_benchmarks`. The below contains all the stan math related optimization misses

https://gist.github.com/SteveBronder/eeb870bd25e9cb5c10d6e7bff44d24bb

In particular the optimization misses like below were interesting to me

```
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not vectorized: not suitable for gather load _6 = _5->adj_;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: bad data references.
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not vectorized: not enough data-refs in basic block.
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:73:33: note: not vectorized: no grouped stores in basic block.
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not consecutive access _5 = *_4;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not consecutive access _10 = *_9;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not consecutive access _6 = _5->adj_;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not consecutive access _5->adj_ = _12;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not consecutive access _7 = this_16(D)->D.202934.adj_;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not vectorized: no grouped stores in basic block.
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:75:3: note: not vectorized: not enough data-refs in basic block.
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _8 = _1->adj_;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _1->adj_ = _10;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _5 = _2->adj_;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _2->adj_ = _7;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _6 = this_13(D)->D.211540.D.211444.adj_;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _9 = this_13(D)->D.211540.D.211444.adj_;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not vectorized: no grouped stores in basic block.
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:21:18: note: not consecutive access _1->adj_ =  Nan;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:21:18: note: not consecutive access _2->adj_ =  Nan;
```
For operators tend to be defined pretty similarly, below is the op for addition

```cpp
namespace internal {
class add_vv_vari : public op_vv_vari {
 public:
  add_vv_vari(vari* avi, vari* bvi)
      : op_vv_vari(avi->val_ + bvi->val_, avi, bvi) {}
  void chain() {
    if (unlikely(is_any_nan(avi_->val_, bvi_->val_))) {
      avi_->adj_ = std::numeric_limits<double>::quiet_NaN();
      bvi_->adj_ = std::numeric_limits<double>::quiet_NaN();
    } else {
      avi_->adj_ += adj_;
      bvi_->adj_ += adj_;
    }
  }
};

class add_vd_vari : public op_vd_vari {
 public:
  add_vd_vari(vari* avi, double b) : op_vd_vari(avi->val_ + b, avi, b) {}
  void chain() {
    if (unlikely(is_any_nan(avi_->val_, bd_))) {
      avi_->adj_ = std::numeric_limits<double>::quiet_NaN();
    } else {
      avi_->adj_ += adj_;
    }
  }
};
}  // namespace internal
```

The main miss we have in the above is that `vari` stores it's adj_ and val_ next to each other on the stack allocator when we really just need access to the adj values. This is a known issue, Bob has a v nice post related to this on [discourse](https://discourse.mc-stan.org/t/a-new-continuation-based-autodiff-by-refactoring/5037/3), but I wonder if there is a dumber approach which could resolve this particular issue.

What if we added a template to `stack_alloc` (or maybe `vari`?) that defined the storage method we wanted. So say we defined enums `joint`, `val`, and `adj` where `joint` places the `val_` and `adj_` together on the same stack and `val`/`adj` places the `val_` on one stack and `adj_` on another stack. Would that help for chain methods like in precomputed_vari? It feels like it would give us better memory locality, if we doc'd it well idt this style would be that bad, and we could just default the template value to `joint` so that

```cpp
  void chain() {
    for (size_t i = 0; i < size_; ++i) {
      varis_[i]->adj_ += adj_ * gradients_[i];
    }
  }
```

would get better memory locality and hit these optimizations for SIMD instructions

## Expected Output

better memory locality

#### Current Version:
v3.0.0



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Missed optimizations in precomputed gradients and operators for vari #1448

Description

Expected Output

Current Version:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Missed optimizations in precomputed gradients and operators for vari #1448

Description

Description

Expected Output

Current Version:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions