When you run a training loop with the current cycle of dataloaders, the model learns to just predict "forward" for every action in FourRooms, regardless of what the robot sees. I assume it's just that "forward" comes up so often that it gets a big reward by always predicting that, and maybe it would work with enough training and the right hyperparameters and optimizer/scheduler. But it would be nice to have a demo to verify.