The Importance of Good Data in Machine Learning

Machine learning has been responsible for just about every big breakthrough in artificial intelligence over the past decade — from self-driving cars and speech recognition systems to cybersecurity systems able to sniff out bad actors online and stop them before they do any harm.

Machine learning tools don’t just follow pre-prescribed rulesets. As their name suggests, they are able to “learn” by changing over time to reflect the data that is shown to them. A machine learning-based spam filter, for example, can learn to pick out spam emails by being shown multiple examples of both spam and non-spam emails. Show it enough of these examples and it will soon be able to pick out examples on its own with little in the way of human intervention.

Provided that the examples are good ones, of course — and not subject to nefarious practices such as data skewing.

Training machine learning models

The process of teaching a machine learning model to learn is referred to as training. The examples mentioned above are what is known as “training data.” There are different approaches to training that can be employed to teach machine learning models.

In a supervised learning model, an algorithm is trained by being given a labeled dataset. That means that a machine learning algorithm taught to distinguish between dogs and non-dogs will be given pictures as an input and the desired output value (“this is a dog” or “this is not a dog.”) This can then be used to map new examples.

An unsupervised learning model, meanwhile, provides the algorithm with unlabeled data so that the algorithm can learn to extract patterns and features on its own. An unsupervised learning model is used in situations where you have input data but no output variables. In the dog example, that might be lots of photos showing dogs and non-dogs, but which no one has had the time to sort through and label. Unsupervised learning is useful for discovering more about data by letting the computer uncover the underlying distribution or structure.

Machine learning, as noted, is increasingly being used in cybersecurity systems. However, as with every other aspect of cybersecurity, it is not immune to attackers trying to find vulnerabilities in it that will allow them to exploit it to their advantage. Data skewing attacks are attacks designed to try and cause an organization to make a wrong decision in favor of the attacker. It does this by feeding it incorrect information to affect the conclusions it draws.

The perils of data skewing

A web analytics skewing attack works by modifying analytics data from the likes of Google Analytics or Adobe Analytics to make it seem like web visitors are carrying out particular actions more regularly than they do. This is done by performing large numbers of automated queries using bots.

Meanwhile, a machine learning data poisoning attack works by modifying the training used to teach a machine-learning algorithm. This can cause it to make the wrong decision. For example, a spam filter that uses machine learning will learn from every email that is received. After a while, most of these emails will be correctly categorized by the machine learning tool and cause no change to the way that it operates. However, occasionally a new email will be categorized incorrectly and will cause the system to reevaluate what it considers to be spam or non-spam. A machine learning data poisoning attack will send millions of emails to create fake data points intended to skew the algorithm. As a result, an attacker could then send malicious emails that will not be detected as malicious. Similar approaches could be used to fool security systems into thinking that abnormal, malicious bot behavior is completely innocent.

Just like it’s important that school textbooks contain accurate information, so too is it important that machine learning algorithms have good data to learn from. There are multiple measures that you can be put into place to stop learning models from being tainted by bad data points. Blocking outdated browsers or user agents can stop some of the lower-level attackers which use bots based on outdated browsers. Protecting exposed APIs, mobile apps, and other public-facing endpoints can also help stop bots before they strike. Evaluating traffic sources and, particularly, spikes in usage can also help reveal when a sudden surge of interest likely comes from bots. Once you’ve discovered them, you can then set about filtering them using firewalls and other protective measures.

Bring in the experts

Of course, not every business or organization has the time or know-how to stop these potential skewing attacks in their tracks. This is where cybersecurity experts can help. They will be able to introduce advanced protection measures such as device fingerprinting or machine learning behavioral analysis to identify potential bad bots as they surface — and, most importantly, before they do anything that could harm you.

Machine learning systems have been a game-changer in many ways for businesses and organizations. It is an incredibly useful tool but, at the end of the day, it’s just a tool. Good data in means good conclusions out, and junk data in means junk conclusions out. The machine learning system that’s poisoned with bad data will be more of a hindrance than a help to you.

Training datasets must be protected against modification. Fortunately, today the tools exist to help you do exactly that.

Share on Facebook

Post on X

Save

Tags: Data Machine Learning