Using .item() to store results in routine call forces GPU to synchronize in order to have access at a lazy-evaluated Python number. This is suboptimal as kernel scheduling (CPU load) and kernel execution (GPU load) should be as parallel pipelines as possible, resulting in delays in the opposite case.
On the other hand, we need _all_epoch_results at the end of an epoch for visualization purposes.
As @obilaniu has noted elsewhere, it's better to use .detach() to store results within a training step, and
then let's process results+losses-as-results internally to get the Python/Numpy values, at the moment
they are actually needed - that's the end of an epoch.