-
Notifications
You must be signed in to change notification settings - Fork 72
Description
While the idea of adversarial training is straightforward—-generate adversarial examples during training and train on those examples until the model learns to classify them correctly—-in practice it is difficult to get right. The basic idea has been independently developed at least twice and was the focus of several papers before all of the right ideas were combined by Madry et al. to form the strongest defense to date. There are at least three flaws in the re-implementation of this defense after a cursory analysis:
-
Incorrect loss function. The loss function used in the original paper is only a loss on the adversarial examples whereas this paper mixes adversarial examples and original examples to form the loss function.
-
Incorrect model architectures. In the original paper, the authors make three claims for the novelty of their method. One of these claims states “To reliably withstand strong adversarial attacks, networks require a significantly larger capacity than for correctly classifying benign examples only.” The code that re-implements this defense does not follow this advice and instead uses a substantially smaller model than recommended.
-
Incorrect hyperparameter settings. The original paper trains their MNIST model for 83 epochs of training; In contrast, the paper here trains for only 20 epochs (4x fewer iterations).
Possibly because of these implementation differences, the DeepSec report finds (incorrectly) that a more basic form of adversarial training performs better than PGD adversarial training.
I didn't re-implement any of the other defenses; the fact that I'm not raising other issues is not because there are none, just that I didn't look for any others.