{"id":667,"date":"2019-04-23T10:00:00","date_gmt":"2019-04-23T10:00:00","guid":{"rendered":"https:\/\/thehive.ai\/blog\/?p=667"},"modified":"2024-07-04T17:00:34","modified_gmt":"2024-07-04T17:00:34","slug":"effect-of-dirty-data-on-deep-learning","status":"publish","type":"post","link":"https:\/\/thehive.ai\/blog\/effect-of-dirty-data-on-deep-learning","title":{"rendered":"The Effect of Dirty Data on Deep Learning Systems"},"content":{"rendered":"\n<h2>Introduction<\/h2>\n\n\n\n<p>Better training data can significantly boost the performance of a deep learning model, especially when deployed in production. In this blog post, we will illustrate the impact of dirty data, and why correct labeling is important for increasing the model accuracy.<\/p>\n\n\n\n<h2>Background<\/h2>\n\n\n\n<p>An adversarial attack fools an image classifier by adding an imperceptible amount of noise to an image. One possible way to defend against this is to simply train machine learning models on adversarial examples. We can collect various hard mining examples and add them to the dataset. Another interesting model architecture to explore is generative adversarial network, which generally consist of two parts: a generator to generate fake examples in order to fool the discriminator, and a discriminator to discriminate between clean\/fake examples.<\/p>\n\n\n\n<p>Another possible type of attack, data poisoning, can happen during training time. The attacker can identify the weak parts of a machine learning architecture, and potentially modify the training data to confuse the model. Even slight perturbations to the training data and label can result in worse performance. There are several methods to defend against such data poisoning attacks. For example, it is possible to separate clean training examples from poisoned ones, so that the outliers are deleted from the dataset.<\/p>\n\n\n\n<p>In this blog post, we investigate the impact of data poisoning (dirty data) using the simulation method: random labeling loss. We will show that with the same model architecture and dataset size, we are able to get huge accuracy increase with better data labeling.<\/p>\n\n\n\n<h2>Data<\/h2>\n\n\n\n<p>We experiment with the <a href=\"https:\/\/www.cs.toronto.edu\/~kriz\/cifar.html\" target=\"_blank\" rel=\"noreferrer noopener\">CIFAR-100 dataset<\/a>, which has 100 classes and 600 32&#215;32 coloured images per class.<\/p>\n\n\n\n<div class=\"wp-block-columns\">\n<div class=\"wp-block-column\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" width=\"652\" height=\"524\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/2b6f6ee.jpg\" alt=\"\" class=\"wp-image-755\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/2b6f6ee.jpg 652w, https:\/\/staticblog.thehive.ai\/uploads\/2024\/07\/2b6f6ee-300x241.jpg 300w\" sizes=\"(max-width: 652px) 100vw, 652px\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column\">\n<p class=\"has-text-align-left\">We use the following steps to preprocess the images in the dataset<\/p>\n\n\n\n<ul><li>Pad each image to 36&#215;36, then randomly crop to 32&#215;32 patch<\/li><li>Apply random flip horizontally<\/li><li>Distort image brightness and contrast randomly<\/li><\/ul>\n<\/div>\n<\/div>\n\n\n\n<p>The dataset is randomly split into 50k training images and 10k evaluation images. <strong>Random labeling is the substitution of training data labels with random labels drawn from the marginal of data labels.<\/strong> Different amounts of random labeling loss are added to the training data. We simply shuffle certain amount of labels for each class. The images to be shuffled are chosen randomly from each class. Because of the randomness, the generated dataset is still balanced. Note that evaluation labels are not changed.<\/p>\n\n\n\n<p>We test the model with 4 different datasets, 1 clean and 3 noisy ones.<\/p>\n\n\n\n<ul><li>Clean: No random noise. We assume that all labeling is correct for CIFAR-100 dataset. Named as \u2018no_noise\u2019.<br><\/li><li>Noisy: 20% random labeling noise. Named as \u2018noise_20\u2019.<br><\/li><li>Noisy: 40% random labeling noise. Named as \u2018noise_40\u2019.<br><\/li><li>Noisy: 60% random labeling noise. Named as \u2018noise_60\u2019.<\/li><\/ul>\n\n\n\n<p>Note that we choose aggressive data poisoning because the production model we build is robust to small amount of random noise. Note that the random labeling scheme allows us to simulate the effect of dirty data (data poisoning) in real world scenario.<\/p>\n\n\n\n<h2>Model<\/h2>\n\n\n\n<p>We investigate the impact of dirty data on one of the popular model, ResNet-152 model architecture. Normally it is a good idea to perform fine-tuning on pre-trained checkpoints to get better accuracy with fewer training steps. In this blog the model is trained from scratch, because we want to get a general idea of how noisy data would affect the training and final results without any prior knowledge gained from pretraining.<\/p>\n\n\n\n<p>We optimize the model with SGD (stochastic gradient descent) optimizer with cosine learning rate decay.<\/p>\n\n\n\n<h2>Results<\/h2>\n\n\n\n<p><strong>Quantitative results:<\/strong><\/p>\n\n\n\n<p>Accuracy<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"273\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2019\/04\/1-1024x273.jpg\" alt=\"\" class=\"wp-image-763\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2019\/04\/1-1024x273.jpg 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2019\/04\/1-300x80.jpg 300w, https:\/\/staticblog.thehive.ai\/uploads\/2019\/04\/1-768x205.jpg 768w, https:\/\/staticblog.thehive.ai\/uploads\/2019\/04\/1-1536x410.jpg 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2019\/04\/1.jpg 1880w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Cleaner datasets consistently perform better on the validation set. The model trained on the original CIFAR-100 dataset gives us 0.65 accuracy, using top 5 predictions boost the accuracy to 0.87. Testing accuracy decreases with more noise added. <strong>Each time we add 20% more random noise to the training data, testing accuracy drop by about 10%.<\/strong> Note that even if we add 60% random labeling noise, our model still manages to get 0.24 accuracy on the validation set. The variance of the training data, preprocessing methods and regularization terms help increase the robustness of the model. So even if it is learning from a very noisy dataset, the model is still able to learn certain useful features, although the overall performance significantly degrades.<\/p>\n\n\n\n<p><strong>Qualitative results:<\/strong><\/p>\n\n\n\n<div class=\"wp-block-columns\">\n<div class=\"wp-block-column\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" width=\"898\" height=\"532\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2019\/04\/2.jpg\" alt=\"Learning curve\" class=\"wp-image-764\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2019\/04\/2.jpg 898w, https:\/\/staticblog.thehive.ai\/uploads\/2019\/04\/2-300x178.jpg 300w, https:\/\/staticblog.thehive.ai\/uploads\/2019\/04\/2-768x455.jpg 768w\" sizes=\"(max-width: 898px) 100vw, 898px\" \/><figcaption>Learning curve<\/figcaption><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" width=\"896\" height=\"532\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2019\/04\/3.jpg\" alt=\"Losses\" class=\"wp-image-765\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2019\/04\/3.jpg 896w, https:\/\/staticblog.thehive.ai\/uploads\/2019\/04\/3-300x178.jpg 300w, https:\/\/staticblog.thehive.ai\/uploads\/2019\/04\/3-768x456.jpg 768w\" sizes=\"(max-width: 896px) 100vw, 896px\" \/><figcaption>Losses<\/figcaption><\/figure>\n<\/div>\n<\/div>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"230\" src=\"https:\/\/staticblog.thehive.ai\/uploads\/2019\/04\/4-1024x230.jpg\" alt=\"Precision recall curves\" class=\"wp-image-766\" srcset=\"https:\/\/staticblog.thehive.ai\/uploads\/2019\/04\/4-1024x230.jpg 1024w, https:\/\/staticblog.thehive.ai\/uploads\/2019\/04\/4-300x67.jpg 300w, https:\/\/staticblog.thehive.ai\/uploads\/2019\/04\/4-768x172.jpg 768w, https:\/\/staticblog.thehive.ai\/uploads\/2019\/04\/4-1536x345.jpg 1536w, https:\/\/staticblog.thehive.ai\/uploads\/2019\/04\/4.jpg 1880w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption>Precision recall curves<\/figcaption><\/figure>\n\n\n\n<p>Cleaner datasets consistently perform better on the validation set. The model trained on the original CIFAR-100 dataset gives us 0.65 accuracy, using top 5 predictions boost the accuracy to 0.87. Testing accuracy decreases with more noise added. <strong>Each time we add 20% more random noise to the training data, testing accuracy drop by about 10%.<\/strong> Note that even if we add 60% random labeling noise, our model still manages to get 0.24 accuracy on the validation set. The variance of the training data, preprocessing methods and regularization terms help increase the robustness of the model. So even if it is learning from a very noisy dataset, the model is still able to learn certain useful features, although the overall performance significantly degrades.<\/p>\n\n\n\n<h2>Conclusion<\/h2>\n\n\n\n<p>In this post we investigate the impact of data poisoning attacks on performances using image classification as an example task, by the random labeling simulation method. We show that popular model (ResNet) is somewhat robust to data poisoning, but the performance still significantly degrades after poisoning. High-quality labeling is thus crucial to modern deep learning systems.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This blog post will illustrate the impact of dirty data and why correct labeling is important for increasing the model accuracy.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"kia_subtitle":""},"categories":[8,4],"tags":[],"_links":{"self":[{"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/posts\/667"}],"collection":[{"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/comments?post=667"}],"version-history":[{"count":4,"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/posts\/667\/revisions"}],"predecessor-version":[{"id":769,"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/posts\/667\/revisions\/769"}],"wp:attachment":[{"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/media?parent=667"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/categories?post=667"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/tags?post=667"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}