Skip to content
June 22, 2026
  • Linkedin
  • Twitter
  • Facebook
  • Youtube

Daily CyberSecurity

Zero-hour alerts. Unmatched analysis.

Primary Menu
  • Home
  • CVE Watchtower
  • Cyber Criminals
  • Data Leak
  • Linux
  • Malware
  • Vulnerability
  • Submit Press Release
  • Vulnerability Report
Light/Dark Button
  • Home
  • Technique
  • The Importance of Good Data in Machine Learning
  • Technique

The Importance of Good Data in Machine Learning

Do Son September 9, 2020 5 minutes read
advantages of AI

Machine learning has been responsible for just about every big breakthrough in artificial intelligence over the past decade — from self-driving cars and speech recognition systems to cybersecurity systems able to sniff out bad actors online and stop them before they do any harm.

Machine learning tools don’t just follow pre-prescribed rulesets. As their name suggests, they are able to “learn” by changing over time to reflect the data that is shown to them. A machine learning-based spam filter, for example, can learn to pick out spam emails by being shown multiple examples of both spam and non-spam emails. Show it enough of these examples and it will soon be able to pick out examples on its own with little in the way of human intervention.

Provided that the examples are good ones, of course — and not subject to nefarious practices such as data skewing.

Training machine learning models

The process of teaching a machine learning model to learn is referred to as training. The examples mentioned above are what is known as “training data.” There are different approaches to training that can be employed to teach machine learning models.

In a supervised learning model, an algorithm is trained by being given a labeled dataset. That means that a machine learning algorithm taught to distinguish between dogs and non-dogs will be given pictures as an input and the desired output value (“this is a dog” or “this is not a dog.”) This can then be used to map new examples.

An unsupervised learning model, meanwhile, provides the algorithm with unlabeled data so that the algorithm can learn to extract patterns and features on its own. An unsupervised learning model is used in situations where you have input data but no output variables. In the dog example, that might be lots of photos showing dogs and non-dogs, but which no one has had the time to sort through and label. Unsupervised learning is useful for discovering more about data by letting the computer uncover the underlying distribution or structure.

Machine learning, as noted, is increasingly being used in cybersecurity systems. However, as with every other aspect of cybersecurity, it is not immune to attackers trying to find vulnerabilities in it that will allow them to exploit it to their advantage. Data skewing attacks are attacks designed to try and cause an organization to make a wrong decision in favor of the attacker. It does this by feeding it incorrect information to affect the conclusions it draws.

The perils of data skewing

A web analytics skewing attack works by modifying analytics data from the likes of Google Analytics or Adobe Analytics to make it seem like web visitors are carrying out particular actions more regularly than they do. This is done by performing large numbers of automated queries using bots.

Meanwhile, a machine learning data poisoning attack works by modifying the training used to teach a machine-learning algorithm. This can cause it to make the wrong decision. For example, a spam filter that uses machine learning will learn from every email that is received. After a while, most of these emails will be correctly categorized by the machine learning tool and cause no change to the way that it operates. However, occasionally a new email will be categorized incorrectly and will cause the system to reevaluate what it considers to be spam or non-spam. A machine learning data poisoning attack will send millions of emails to create fake data points intended to skew the algorithm. As a result, an attacker could then send malicious emails that will not be detected as malicious. Similar approaches could be used to fool security systems into thinking that abnormal, malicious bot behavior is completely innocent.

Just like it’s important that school textbooks contain accurate information, so too is it important that machine learning algorithms have good data to learn from. There are multiple measures that you can be put into place to stop learning models from being tainted by bad data points. Blocking outdated browsers or user agents can stop some of the lower-level attackers which use bots based on outdated browsers. Protecting exposed APIs, mobile apps, and other public-facing endpoints can also help stop bots before they strike. Evaluating traffic sources and, particularly, spikes in usage can also help reveal when a sudden surge of interest likely comes from bots. Once you’ve discovered them, you can then set about filtering them using firewalls and other protective measures.

Bring in the experts

Of course, not every business or organization has the time or know-how to stop these potential skewing attacks in their tracks. This is where cybersecurity experts can help. They will be able to introduce advanced protection measures such as device fingerprinting or machine learning behavioral analysis to identify potential bad bots as they surface — and, most importantly, before they do anything that could harm you.

Machine learning systems have been a game-changer in many ways for businesses and organizations. It is an incredibly useful tool but, at the end of the day, it’s just a tool. Good data in means good conclusions out, and junk data in means junk conclusions out. The machine learning system that’s poisoned with bad data will be more of a hindrance than a help to you.

Training datasets must be protected against modification. Fortunately, today the tools exist to help you do exactly that.

Share this article:

Facebook Post LinkedIn Telegram
Tags: Data Machine Learning

Search

Translation

CVE WATCHTOWER
🚨

Receive alerts for vulnerabilities being exploited in the wild.

⚡

Get notified instantly when a Proof of Concept (PoC) exploit is published.

🔍

Access critical info on vulnerabilities even when marked as "RESERVED".

🧠

Insights powered by decades of expertise and global intelligence sources.

🎯

Customize alerts with up to 10 keywords for your specific tech stack.

📊

Export the raw CVE database for SIEM integration and reporting.

Upgrade Package

🔴 Live Critical Threats

  • CVE-2026-5366CVSS 9.9
    Prefect version 3.6.23 is vulnerable to remote code execution due to improper...
  • CVE-2024-58351CVSS 9.8
    Flowise before 2.1.4 allows configuration to be injected into the Chainflow during...
  • CVE-2022-50972CVSS 9.8
    WooCommerce 7.1.0 contains a remote code execution vulnerability that allows attackers to...
  • CVE-2019-25763CVSS 9.8
    WordPress Ultimate Addons for Beaver Builder 1.2.4.1 contains an authentication bypass vulnerability...
  • CVE-2026-11551CVSS 9.8
    The Branda plugin for WordPress is vulnerable to privilege escalation via account...
  • CVE-2026-56081CVSS 9.1
    Cap-go before 12.128.2 contains an authentication logic flaw that lets an attacker...
  • CVE-2026-56073CVSS 9.4
    Cap-go before 12.128.2 contains an authentication bypass vulnerability in OTP verification that...
  • CVE-2026-55447CVSS 9.6
    ### Summary All components based on `BaseFileComponent` are vulnerable to the following...
  • CVE-2026-48584CVSS 9.9
    Execution with unnecessary privileges in Azure Synapse allows an authorized attacker to...
  • CVE-2026-48582CVSS 9.6
    Missing authorization in Microsoft Exchange Online allows an authorized attacker to elevate...
Powered by CVE WATCHTOWER

Recent Zero-Day Vulnerabilities

  • GreatXML BitLocker Bypass: Public PoC Exploit Disclosed
  • Check Point VPN Vulnerability Exploited in the Wild with Ransomware Links
  • Weekly Threat Intelligence: June 1 to June 7, 2026
  • Cisco SD-WAN Vulnerability Exploited in the Wild with Root RCE Risks
  • Android Zero-Day Flaw Exploited in the Wild: June 2026 Patches Released
  • Exploited in the Wild: Critical OWA Spoofing Flaw (CVE-2026-42897) Hits On-Premises Exchange Servers
Our Websites
  • Penetration Testing Tools
  • The Daily Information Technology
  • Daily CyberSecurity

    • About SecurityOnline.info
    • Advertise with us
    • Announcement
    • Contact
    • Contributor Register
    • Login
    • About SecurityOnline.info
    • Advertise on SecurityOnline.info
    • Contact Us

    When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works

    • Disclaimer
    • Privacy Policy
    • DMCA NOTICE
    • Linkedin
    • Twitter
    • Facebook
    • Youtube
    © 2017 - 2026 Daily CyberSecurity. All Rights Reserved.