self training with noisy student improves imagenet classification

Self-training is a form of semi-supervised learning [10] which attempts to leverage unlabeled data to improve classification performance in the limited data regime. We start with the 130M unlabeled images and gradually reduce the number of images. . This paper reviews the state-of-the-art in both the field of CNNs for image classification and object detection and Autonomous Driving Systems (ADSs) in a synergetic way including a comprehensive trade-off analysis from a human-machine perspective. The model with Noisy Student can successfully predict the correct labels of these highly difficult images. Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. on ImageNet ReaL. - : self-training_with_noisy_student_improves_imagenet_classification Secondly, to enable the student to learn a more powerful model, we also make the student model larger than the teacher model. Note that these adversarial robustness results are not directly comparable to prior works since we use a large input resolution of 800x800 and adversarial vulnerability can scale with the input dimension[17, 20, 19, 61]. When data augmentation noise is used, the student must ensure that a translated image, for example, should have the same category with a non-translated image. Self-training with Noisy Student improves ImageNet classification Abstract. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. On ImageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. For more information about the large architectures, please refer to Table7 in Appendix A.1. . We then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. Semi-supervised medical image classification with relation-driven self-ensembling model. The algorithm is iterated a few times by treating the student as a teacher to relabel the unlabeled data and training a new student. The accuracy is improved by about 10% in most settings. Not only our method improves standard ImageNet accuracy, it also improves classification robustness on much harder test sets by large margins: ImageNet-A[25] top-1 accuracy from 16.6% to 74.2%, ImageNet-C[24] mean corruption error (mCE) from 45.7 to 31.2 and ImageNet-P[24] mean flip rate (mFR) from 27.8 to 16.1. Algorithm1 gives an overview of self-training with Noisy Student (or Noisy Student in short). Use Git or checkout with SVN using the web URL. Noisy Student Training achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. Self-training with Noisy Student improves ImageNet classification Original paper: https://arxiv.org/pdf/1911.04252.pdf Authors: Qizhe Xie, Eduard Hovy, Minh-Thang Luong, Quoc V. Le HOYA012 Introduction EfficientNet ImageNet SOTA EfficientNet unlabeled images , . Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. supervised model from 97.9% accuracy to 98.6% accuracy. However, manually annotating organs from CT scans is time . To achieve strong results on ImageNet, the student model also needs to be large, typically larger than common vision models, so that it can leverage a large number of unlabeled images. Probably due to the same reason, at =16, EfficientNet-L2 achieves an accuracy of 1.1% under a stronger attack PGD with 10 iterations[43], which is far from the SOTA results. mFR (mean flip rate) is the weighted average of flip probability on different perturbations, with AlexNets flip probability as a baseline. As can be seen, our model with Noisy Student makes correct and consistent predictions as images undergone different perturbations while the model without Noisy Student flips predictions frequently. This way, we can isolate the influence of noising on unlabeled images from the influence of preventing overfitting for labeled images. The method, named self-training with Noisy Student, also benefits from the large capacity of EfficientNet family. In our experiments, we also further scale up EfficientNet-B7 and obtain EfficientNet-L0, L1 and L2. Le. Le, and J. Shlens, Using videos to evaluate image model robustness, Deep residual learning for image recognition, Benchmarking neural network robustness to common corruptions and perturbations, D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song, Distilling the knowledge in a neural network, G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, G. Huang, Y. It is found that training and scaling strategies may matter more than architectural changes, and further, that the resulting ResNets match recent state-of-the-art models. Aerial Images Change Detection, Multi-Task Self-Training for Learning General Representations, Self-Training Vision Language BERTs with a Unified Conditional Model, 1Cademy @ Causal News Corpus 2022: Leveraging Self-Training in Causality The pseudo labels can be soft (a continuous distribution) or hard (a one-hot distribution). In contrast, changing architectures or training with weakly labeled data give modest gains in accuracy from 4.7% to 16.6%. Specifically, as all classes in ImageNet have a similar number of labeled images, we also need to balance the number of unlabeled images for each class. We present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. In all previous experiments, the students capacity is as large as or larger than the capacity of the teacher model. Lastly, we follow the idea of compound scaling[69] and scale all dimensions to obtain EfficientNet-L2. The main difference between our work and prior works is that we identify the importance of noise, and aggressively inject noise to make the student better. Chowdhury et al. We use the same architecture for the teacher and the student and do not perform iterative training. For each class, we select at most 130K images that have the highest confidence. Prior works on weakly-supervised learning require billions of weakly labeled data to improve state-of-the-art ImageNet models. Our experiments showed that self-training with Noisy Student and EfficientNet can achieve an accuracy of 87.4% which is 1.9% higher than without Noisy Student. We first report the validation set accuracy on the ImageNet 2012 ILSVRC challenge prediction task as commonly done in literature[35, 66, 23, 69] (see also [55]). on ImageNet, which is 1.0 Next, with the EfficientNet-L0 as the teacher, we trained a student model EfficientNet-L1, a wider model than L0. Self-training with noisy student improves imagenet classification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10687-10698, (2020 . A novel random matrix theory based damping learner for second order optimisers inspired by linear shrinkage estimation is developed, and it is demonstrated that the derived method works well with adaptive gradient methods such as Adam. First, we run an EfficientNet-B0 trained on ImageNet[69]. Models are available at this https URL. We then perform data filtering and balancing on this corpus. . over the JFT dataset to predict a label for each image. Due to duplications, there are only 81M unique images among these 130M images. On ImageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. Abdominal organ segmentation is very important for clinical applications. As stated earlier, we hypothesize that noising the student is needed so that it does not merely learn the teachers knowledge. We sample 1.3M images in confidence intervals. A tag already exists with the provided branch name. In the above experiments, iterative training was used to optimize the accuracy of EfficientNet-L2 but here we skip it as it is difficult to use iterative training for many experiments. You signed in with another tab or window. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet-P mean flip rate from 27.8 to 12.2. We use EfficientNets[69] as our baseline models because they provide better capacity for more data. Training these networks from only a few annotated examples is challenging while producing manually annotated images that provide supervision is tedious. Noisy Student leads to significant improvements across all model sizes for EfficientNet. Our experiments show that an important element for this simple method to work well at scale is that the student model should be noised during its training while the teacher should not be noised during the generation of pseudo labels. w Summary of key results compared to previous state-of-the-art models. [^reference-9] [^reference-10] A critical insight was to . We use EfficientNet-B0 as both the teacher model and the student model and compare using Noisy Student with soft pseudo labels and hard pseudo labels. possible. . The algorithm is basically self-training, a method in semi-supervised learning (. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet-P mean flip rate from 27.8 to 12.2. On ImageNet-C, it reduces mean corruption error (mCE) from 45.7 to 31.2. Then, that teacher is used to label the unlabeled data. With Noisy Student, the model correctly predicts dragonfly for the image. In particular, we set the survival probability in stochastic depth to 0.8 for the final layer and follow the linear decay rule for other layers. https://arxiv.org/abs/1911.04252, Accompanying notebook and sources to "A Guide to Pseudolabelling: How to get a Kaggle medal with only one model" (Dec. 2020 PyData Boston-Cambridge Keynote), Deep learning has shown remarkable successes in image recognition in recent years[35, 66, 62, 23, 69]. We then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. A. Alemi, Thirty-First AAAI Conference on Artificial Intelligence, C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, Rethinking the inception architecture for computer vision, C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, EfficientNet: rethinking model scaling for convolutional neural networks, Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results, H. Touvron, A. Vedaldi, M. Douze, and H. Jgou, Fixing the train-test resolution discrepancy, V. Verma, A. Lamb, J. Kannala, Y. Bengio, and D. Lopez-Paz, Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), J. Weston, F. Ratle, H. Mobahi, and R. Collobert, Deep learning via semi-supervised embedding, Q. Xie, Z. Dai, E. Hovy, M. Luong, and Q. V. Le, Unsupervised data augmentation for consistency training, S. Xie, R. Girshick, P. Dollr, Z. Tu, and K. He, Aggregated residual transformations for deep neural networks, I. The score is normalized by AlexNets error rate so that corruptions with different difficulties lead to scores of a similar scale. Self-Training With Noisy Student Improves ImageNet Classification Abstract: We present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. Since a teacher models confidence on an image can be a good indicator of whether it is an out-of-domain image, we consider the high-confidence images as in-domain images and the low-confidence images as out-of-domain images. EfficientNet-L0 is wider and deeper than EfficientNet-B7 but uses a lower resolution, which gives it more parameters to fit a large number of unlabeled images with similar training speed. This paper proposes a pipeline, based on a teacher/student paradigm, that leverages a large collection of unlabelled images to improve the performance for a given target architecture, like ResNet-50 or ResNext. Learn more. Noisy Student Training is based on the self-training framework and trained with 4 simple steps: Train a classifier on labeled data (teacher). This result is also a new state-of-the-art and 1% better than the previous best method that used an order of magnitude more weakly labeled data [ 44, 71]. Train a larger classifier on the combined set, adding noise (noisy student). The top-1 accuracy of prior methods are computed from their reported corruption error on each corruption. This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. To achieve this result, we first train an EfficientNet model on labeled As noise injection methods are not used in the student model, and the student model was also small, it is more difficult to make the student better than teacher. First, it makes the student larger than, or at least equal to, the teacher so the student can better learn from a larger dataset. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). self-mentoring outperforms data augmentation and self training. Instructions on running prediction on unlabeled data, filtering and balancing data and training using the stored predictions. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to . The proposed use of distillation to only handle easy instances allows for a more aggressive trade-off in the student size, thereby reducing the amortized cost of inference and achieving better accuracy than standard distillation. Work fast with our official CLI. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet-P mean flip rate from 27.8 to 12.2. Self-Training With Noisy Student Improves ImageNet Classification. to noise the student. It has three main steps: train a teacher model on labeled images use the teacher to generate pseudo labels on unlabeled images In the following, we will first describe experiment details to achieve our results. This model investigates a new method. We vary the model size from EfficientNet-B0 to EfficientNet-B7[69] and use the same model as both the teacher and the student. Do better imagenet models transfer better? These test sets are considered as robustness benchmarks because the test images are either much harder, for ImageNet-A, or the test images are different from the training images, for ImageNet-C and P. For ImageNet-C and ImageNet-P, we evaluate our models on two released versions with resolution 224x224 and 299x299 and resize images to the resolution EfficientNet is trained on. Use a model to predict pseudo-labels on the filtered data: This is not an officially supported Google product. We use the standard augmentation instead of RandAugment in this experiment. By clicking accept or continuing to use the site, you agree to the terms outlined in our. Similar to[71], we fix the shallow layers during finetuning. Here we show an implementation of Noisy Student Training on SVHN, which boosts the performance of a Parthasarathi et al. Significantly, after using the masks generated by student-SN, the classification performance improved by 0.9 of AC, 0.7 of SE, and 0.9 of AUC. For smaller models, we set the batch size of unlabeled images to be the same as the batch size of labeled images. During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. Are you sure you want to create this branch? Compared to consistency training[45, 5, 74], the self-training / teacher-student framework is better suited for ImageNet because we can train a good teacher on ImageNet using label data. to use Codespaces. We first improved the accuracy of EfficientNet-B7 using EfficientNet-B7 as both the teacher and the student. During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. In both cases, we gradually remove augmentation, stochastic depth and dropout for unlabeled images, while keeping them for labeled images. This paper standardizes and expands the corruption robustness topic, while showing which classifiers are preferable in safety-critical applications, and proposes a new dataset called ImageNet-P which enables researchers to benchmark a classifier's robustness to common perturbations. mCE (mean corruption error) is the weighted average of error rate on different corruptions, with AlexNets error rate as a baseline. Although they have produced promising results, in our preliminary experiments, consistency regularization works less well on ImageNet because consistency regularization in the early phase of ImageNet training regularizes the model towards high entropy predictions, and prevents it from achieving good accuracy. We have also observed that using hard pseudo labels can achieve as good results or slightly better results when a larger teacher is used. In our implementation, labeled images and unlabeled images are concatenated together and we compute the average cross entropy loss. We duplicate images in classes where there are not enough images. Scripts used for our ImageNet experiments: Similar scripts to run predictions on unlabeled data, filter and balance data and train using the filtered data.