I believe there are some major discrepancies between the implementation in this repo and the original paper. You would probably want to check on PixelCNN and gated PixelCNN papers and implement context predictor with masked CNN and gated unit. You don't need to split the encoded feature map into 3x3 pieces and do avg. pooling. A single PixelCNN layer stack will do the job of aggregating features from above. In your implementation, information aggregated is limited by the size of the sliced square pieces (3 x 3). One should try to use long term information to catch a "slow changing" structure (quoting the word from CPC paper).
I believe there are some major discrepancies between the implementation in this repo and the original paper. You would probably want to check on PixelCNN and gated PixelCNN papers and implement context predictor with masked CNN and gated unit. You don't need to split the encoded feature map into 3x3 pieces and do avg. pooling. A single PixelCNN layer stack will do the job of aggregating features from above. In your implementation, information aggregated is limited by the size of the sliced square pieces (3 x 3). One should try to use long term information to catch a "slow changing" structure (quoting the word from CPC paper).