ClickHouse: a free analytic DBMS for big data

ClickHouse is an open-source column-oriented database management system that allows generating analytical data reports in real time.

ClickHouse’s performance exceeds comparable column-oriented DBMS currently available on the market. It processes hundreds of millions to more than a billion rows and tens of gigabytes of data per single server per second.

ClickHouse uses all available hardware to its full potential to process each query as fast as possible. The peak processing performance for a single query (after decompression, only used columns) stands at more than 2 terabytes per second.

ClickHouse allows companies to add servers to their clusters when necessary without investing time or money into any additional DBMS modification. The system has been successfully serving Yandex.Metrica, while the count of servers in its main production cluster has grown from 60 to 394 in two years, which are by the way located in six geographically distributed datacenters.

ClickHouse scales well both vertically and horizontally. ClickHouse is easily adaptable to perform either on the cluster with hundreds of nodes or on a single server or even on a tiny virtual machine. Currently, there are installations with more than two trillion rows per single node, as well as installations with 100Tb of storage per single node.

ClickHouse processes typical analytical queries two to three orders of magnitude faster than traditional row-oriented systems with the same available I/O throughput. The system’s columnar storage format allows fitting more hot data in RAM, which leads to shorter response times.

ClickHouse allows minimizing the number of seeks for range queries, which increases the efficiency of using rotational disk drives, as it maintains locality of reference for continually stored data.

ClickHouse is CPU efficient because of it’s vectorized query execution involving relevant processor instructions and runtime code generation.

By minimizing data transfers for most types of queries, ClickHouse enables companies to manage their data and create reports without using specialized networks that are aimed at high-performance computing.

ClickHouse supports multi-master asynchronous replication and can be deployed across multiple data centers. Downtime of a single node or the whole data center won’t affect the system’s availability for both reads and writes. Distributed reads are automatically balanced to live replicas to avoid increasing latency. Replicated data are synchronized automatically or semi-automatically after server downtime.

Feature

True column-oriented storage
Vectorized query execution
Data compression
Parallel and distributed query execution
Real-time query processing
Real-time data ingestion
On-disk locality of reference
Cross-datacenter replication
High availability
SQL support

Local and distributed joins
Pluggable external dimension tables
Arrays and nested data types
Approximate query processing
Probabilistic data structures
Full support of IPv6
Features for web analytics
State-of-the-art algorithms
Detailed documentation
Clean documented code

Install

Tutorial

Source: https://github.com/yandex/

Tags: ClickHouse