How to implement the joint training?

Dear authors, thanks for the amazing work.
I am confused in the implementation of video-image joint training. Do you train image step and video step alternately or do you implement some way that can train jointly in the same step? Such joint training can be a bit difficult to reduce the optimizer state.