Talking about Big Data

With the increase in the size of enterprises and the increase in security equipment, the amount of information security analysis data into the exponential growth. Information security analysis is becoming more and more important, and the data source is rich, the data type is large, the data analysis dimension is wide, and the data generation speed is faster, and the response capability of information security analysis is correspondingly increased.

The Development of Security Analysis Technology

Traditional information security analysis is mainly based on traffic and log two categories of data, and with assets, business behavior, external intelligence and other correlation analysis. Traffic-based security analysis applications include malicious code detection, zombie detection, abnormal traffic, and Web security analysis. Log-based security analysis applications include security auditing, host intrusion detection, and so on.

The introduction of large data analysis technology into information security analysis is to integrate the scattered security data. Through efficient collection, storage, retrieval and analysis, we use multi-stage, multi-level correlation analysis and abnormal behavior classification forecasting model. Found that APT attacks, data leakage, DDoS attacks, harassment fraud, spam, etc., to enhance the safety and defense of the initiative.

Moreover, large data analysis involves more comprehensive data, including the application of the data generated by the scene, through an activity or content “to create” out of the data, related background data and context-related data. How to deal with and analyze these data efficiently and efficiently is a problem that should be studied in the case of secure large data technology.

Large data with 4V features (Volume, Variety, Velocity and Value), can achieve large capacity, low cost, high efficiency security analysis capabilities, to meet our safety data processing and analysis requirements, large data applied to the field of information security Can effectively identify a variety of attacks or security incidents, with significant research significance and value.

Spring of Large Data Technology

The core idea of security large data analysis technology is based on the analysis of network abnormal behavior, through the massive data processing and learning modeling, from the massive data to identify abnormal behavior and related characteristics; for different security scenarios designed targeted correlation analysis method, Large data storage and analysis of computing power advantage, from the rich data source for deep mining, and then dig out the security issues. Security large data analysis mainly includes large data collection, storage, retrieval and security intelligent visualization and analysis and other aspects.

security data collection, storage and retrieval: based on large data collection, storage, retrieval and other technologies, can fundamentally improve the efficiency of security data analysis. Collecting a variety of types of data, such as business data, traffic data, security equipment, log data and public opinion data. For different data using a specific collection methods to enhance the collection efficiency. For the log information can be used Chukwa, Flume, Scribe and other tools; for traffic data can be used traffic scene method, and use Storm and Spark technology to store and analyze data; for the format of fixed business data, you can use HBase, GBase Storage mechanism, through MapReduce and Hive analysis methods, real-time data retrieval, greatly enhance the efficiency of data processing.
intelligent analysis of security data: parallel storage and NoSQL database to enhance the efficiency of data analysis and query, from the massive data to accurately explore the safety issues also need intelligent analysis tools, including ETL (such as pre-processing), statistics Modeling tools (such as regression analysis, time series prediction, multivariate statistical analysis theory), machine learning tools (such as Bayesian network, logical regression, decision tree, random Morley), social networking tools (such as correlation analysis, hidden Marco Husband model, conditional random field) and so on. Commonly used large data analysis ideas are a priori analysis method, classification prediction analysis method, probability diagram model, correlation analysis method. You can use Mahout and MLlib analysis tools such as data mining analysis. To sum up, a complete security data analysis platform should be divided into data acquisition layer, large data storage layer, data mining analysis layer, visual display layer. Mainly through the data flow, log, business data, intelligence information and other multi-source heterogeneous data distributed fusion analysis, for different scenarios to build analysis model, and ultimately achieve information security can be controlled, to show the overall security situation.

Spark Technology ‘s Emergence

In order to optimize and solve the computational deficiencies of the distributed Hadoop platform and improve the operation of its algorithms, Spark was born at the University of Berkeley from a new generation of next generation computing platform technology in 2009. The new generation of large data distributed processing framework, in many ways to make up for the lack of Hadoop, making the platform calculation and batch processing more efficient and have a lower delay.

Spark is based on Hadoop platform HDFS distributed file system, using Driver, Worker distributed master-slave node mode, distributed memory abstract computing form to provide working set services. Very good language programming development interface, support for multiple languages JAVA, Scala, Python, etc., while greatly simplifying the amount of code, making the original parallel code from hundreds of lines compressed to dozens of lines. Rich technical components, making the program developers in the application to achieve more easy to use.

Spark Streaming

Spark Streaming is based on the calculation and processing of micro-batch methods and can be used to process real-time streaming data. It uses DStream type data, which is simply a flexible distributed data set (RDD) series that handles real-time data.

Spark SQL

Spark SQL can extract Spark data sets through the JDBC API, and you can also use traditional BI and visualization tools to perform SQL-like queries on Spark data. Users can also use Spark SQL to perform ETL data processing operations on different formats of data (such as JSON, Parquet, and databases), convert them, and then provide them to specific queries.

Spark MLlib

MLlib is an extensible Spark machine learning library consisting of a common learning algorithm and tools, including binary classification, linear regression, clustering, collaborative filtering, gradient descent, and underlying optimization primitives.

Spark GraphX

Spark GraphX is a distributed graph processing framework, Spark GraphX based on the Spark platform to provide a simple and easy to use and easy to use and rich interface, greatly facilitate the need for distributed graphics processing.

Rapid Flow of Data

In the large data system, often encounter a problem, the entire large data is composed of various subsystems, the data needs in each subsystem in high-performance, low-delay non-stop flow. While the traditional enterprise messaging system is not very suitable for large-scale data processing. Kafka appeared in order to get online real-time messages and offline data files, log customer requirements.

Kafka is a distributed publish-subscribe messaging system. It was originally developed by LinkedIn and later became part of the Apache project. Kafka is a distributed, fragmentable, redundant backup of the persistent log service. It is mainly used to process active streaming data.

Kafka news distribution map

Quick Retrieval of Data

We build a website or app and want to add the search function so that we are hit by: the search job is hard. We want our search solution to be fast, we want to have a zero configuration and a completely free search mode, we would like to be able to simply use JSON through HTTP index data, we hope our search server is always available and we want to be able to Taiwan started and expanded to hundreds, we wanted to search in real time, we had to simply multiply tenants, and we wanted to build a cloud solution. Elasticsearch is designed to solve all these problems and more.

Elasticsearch is a search engine based on lucence technology, which provides us with a distributed multi-user ability of the full-text search engine, based on a good RESTful web interface, Elasticsearch allows us to quickly and easily match the establishment of their own enterprise real-time information search engine.

Tags: big data big data analysis