• About WordPress
    • WordPress.org
    • Documentation
    • Learn WordPress
    • Support
    • Feedback
Skip to content
May 26, 2026
  • Linkedin
  • Twitter
  • Facebook
  • Youtube

Daily CyberSecurity

Zero-hour alerts. Unmatched analysis.

Primary Menu
  • Home
  • CVE Watchtower
  • Cyber Criminals
  • Data Leak
  • Linux
  • Malware
  • Vulnerability
  • Submit Press Release
  • Vulnerability Report
Light/Dark Button
  • Home
  • Technique
  • The Importance of Good Data in Machine Learning
  • Technique

The Importance of Good Data in Machine Learning

Ddos September 9, 2020 5 minutes read
advantages of AI

Machine learning has been responsible for just about every big breakthrough in artificial intelligence over the past decade — from self-driving cars and speech recognition systems to cybersecurity systems able to sniff out bad actors online and stop them before they do any harm.

Machine learning tools don’t just follow pre-prescribed rulesets. As their name suggests, they are able to “learn” by changing over time to reflect the data that is shown to them. A machine learning-based spam filter, for example, can learn to pick out spam emails by being shown multiple examples of both spam and non-spam emails. Show it enough of these examples and it will soon be able to pick out examples on its own with little in the way of human intervention.

Provided that the examples are good ones, of course — and not subject to nefarious practices such as data skewing.

Training machine learning models

The process of teaching a machine learning model to learn is referred to as training. The examples mentioned above are what is known as “training data.” There are different approaches to training that can be employed to teach machine learning models.

In a supervised learning model, an algorithm is trained by being given a labeled dataset. That means that a machine learning algorithm taught to distinguish between dogs and non-dogs will be given pictures as an input and the desired output value (“this is a dog” or “this is not a dog.”) This can then be used to map new examples.

An unsupervised learning model, meanwhile, provides the algorithm with unlabeled data so that the algorithm can learn to extract patterns and features on its own. An unsupervised learning model is used in situations where you have input data but no output variables. In the dog example, that might be lots of photos showing dogs and non-dogs, but which no one has had the time to sort through and label. Unsupervised learning is useful for discovering more about data by letting the computer uncover the underlying distribution or structure.

Machine learning, as noted, is increasingly being used in cybersecurity systems. However, as with every other aspect of cybersecurity, it is not immune to attackers trying to find vulnerabilities in it that will allow them to exploit it to their advantage. Data skewing attacks are attacks designed to try and cause an organization to make a wrong decision in favor of the attacker. It does this by feeding it incorrect information to affect the conclusions it draws.

The perils of data skewing

A web analytics skewing attack works by modifying analytics data from the likes of Google Analytics or Adobe Analytics to make it seem like web visitors are carrying out particular actions more regularly than they do. This is done by performing large numbers of automated queries using bots.

Meanwhile, a machine learning data poisoning attack works by modifying the training used to teach a machine-learning algorithm. This can cause it to make the wrong decision. For example, a spam filter that uses machine learning will learn from every email that is received. After a while, most of these emails will be correctly categorized by the machine learning tool and cause no change to the way that it operates. However, occasionally a new email will be categorized incorrectly and will cause the system to reevaluate what it considers to be spam or non-spam. A machine learning data poisoning attack will send millions of emails to create fake data points intended to skew the algorithm. As a result, an attacker could then send malicious emails that will not be detected as malicious. Similar approaches could be used to fool security systems into thinking that abnormal, malicious bot behavior is completely innocent.

Just like it’s important that school textbooks contain accurate information, so too is it important that machine learning algorithms have good data to learn from. There are multiple measures that you can be put into place to stop learning models from being tainted by bad data points. Blocking outdated browsers or user agents can stop some of the lower-level attackers which use bots based on outdated browsers. Protecting exposed APIs, mobile apps, and other public-facing endpoints can also help stop bots before they strike. Evaluating traffic sources and, particularly, spikes in usage can also help reveal when a sudden surge of interest likely comes from bots. Once you’ve discovered them, you can then set about filtering them using firewalls and other protective measures.

Bring in the experts

Of course, not every business or organization has the time or know-how to stop these potential skewing attacks in their tracks. This is where cybersecurity experts can help. They will be able to introduce advanced protection measures such as device fingerprinting or machine learning behavioral analysis to identify potential bad bots as they surface — and, most importantly, before they do anything that could harm you.

Machine learning systems have been a game-changer in many ways for businesses and organizations. It is an incredibly useful tool but, at the end of the day, it’s just a tool. Good data in means good conclusions out, and junk data in means junk conclusions out. The machine learning system that’s poisoned with bad data will be more of a hindrance than a help to you.

Training datasets must be protected against modification. Fortunately, today the tools exist to help you do exactly that.

Share this article:

Facebook Post LinkedIn Telegram

No related posts.

Tags: Data Machine Learning

Search

Translation

CVE WATCHTOWER
🚨

Receive alerts for vulnerabilities being exploited in the wild.

⚑

Get notified instantly when a Proof of Concept (PoC) exploit is published.

πŸ”

Access critical info on vulnerabilities even when marked as "RESERVED".

🧠

Insights powered by decades of expertise and global intelligence sources.

🎯

Customize alerts with up to 10 keywords for your specific tech stack.

πŸ“Š

Export the raw CVE database for SIEM integration and reporting.

Upgrade Package

πŸ”΄ Live Critical Threats

  • CVE-2026-7374CVSS 9.9
    A flaw was found in KubeVirt's virt-handler component. This vulnerability allows an...
  • CVE-2026-9543CVSS 9.8
    A vulnerability has been found in Totolink N300RH 6.1c.1353_B20190305. Affected is the...
  • CVE-2026-42773CVSS 9.3
    Improper Neutralization of Special Elements used in an SQL Command ('SQL Injection')...
  • CVE-2026-42774CVSS 9.3
    Improper Neutralization of Special Elements used in an SQL Command ('SQL Injection')...
  • CVE-2026-9478CVSS 9.8
    A weakness has been identified in Totolink A8000RU 7.1cu.643_b20200521. Impacted is the...
  • CVE-2026-9477CVSS 9.8
    A security flaw has been discovered in Totolink A8000RU 7.1cu.643_b20200521. This issue...
  • CVE-2026-9476CVSS 9.8
    A vulnerability was identified in Totolink A8000RU 7.1cu.643_b20200521. This vulnerability affects the...
  • CVE-2026-9475CVSS 9.8
    A vulnerability was determined in Totolink A8000RU 7.1cu.643_b20200521. This affects the function...
  • CVE-2026-9458CVSS 9.8
    A vulnerability was identified in Totolink A8000RU 7.1cu.643_b20200521. The impacted element is...
  • CVE-2026-9457CVSS 9.8
    A vulnerability was determined in Totolink A8000RU 7.1cu.643_b20200521. The affected element is...
Powered by CVE WATCHTOWER

Recent Zero-Day Vulnerabilities

  • Exploited in the Wild: Critical OWA Spoofing Flaw (CVE-2026-42897) Hits On-Premises Exchange Servers
  • Exploited in the Wild: Maximum CVSS 10 SD-WAN Flaw (CVE-2026-20182) Grants Admin Control
  • Exploited in the Wild: Critical 9.8 CVSS RCE Hits Canon GUARDIANWALL MailSuite
  • Exploit Code Released: Public PoC Dumps for Windows BitLocker Bypass and SYSTEM Elevation Zero-Days
  • Exploited in the Wild: “Dirty Frag” Linux Vulnerability Grants Instant Root Access
  • Under Active Attack: Ivanti EPMM Zero-Day Exploited in the Wild via Harvested Admin Credentials
Our Websites
  • Penetration Testing Tools
  • The Daily Information Technology
  • Daily CyberSecurity

    • About SecurityOnline.info
    • Advertise with us
    • Announcement
    • Contact
    • Contributor Register
    • Login
    • About SecurityOnline.info
    • Advertise on SecurityOnline.info
    • Contact Us

    When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works

    • Disclaimer
    • Privacy Policy
    • DMCA NOTICE
    • Linkedin
    • Twitter
    • Facebook
    • Youtube
    Copyright Daily CyberSecurity Β© All rights reserved.