ClickHouse: a free analytic DBMS for big data
ClickHouse is an open-source column-oriented database management system that allows generating analytical data reports in real time.
ClickHouse’s performance exceeds comparable column-oriented DBMS currently available on the market. It processes hundreds of millions to more than a billion rows and tens of gigabytes of data per single server per second.
ClickHouse uses all available hardware to its full potential to process each query as fast as possible. The peak processing performance for a single query (after decompression, only used columns) stands at more than 2 terabytes per second.
ClickHouse allows companies to add servers to their clusters when necessary without investing time or money into any additional DBMS modification. The system has been successfully serving Yandex.Metrica, while the count of servers in its main production cluster has grown from 60 to 394 in two years, which are by the way located in six geographically distributed datacenters.
ClickHouse scales well both vertically and horizontally. ClickHouse is easily adaptable to perform either on the cluster with hundreds of nodes or on a single server or even on a tiny virtual machine. Currently, there are installations with more than two trillion rows per single node, as well as installations with 100Tb of storage per single node.
ClickHouse processes typical analytical queries two to three orders of magnitude faster than traditional row-oriented systems with the same available I/O throughput. The system’s columnar storage format allows fitting more hot data in RAM, which leads to shorter response times.
ClickHouse allows minimizing the number of seeks for range queries, which increases the efficiency of using rotational disk drives, as it maintains locality of reference for continually stored data.
ClickHouse is CPU efficient because of it’s vectorized query execution involving relevant processor instructions and runtime code generation.
By minimizing data transfers for most types of queries, ClickHouse enables companies to manage their data and create reports without using specialized networks that are aimed at high-performance computing.
ClickHouse supports multi-master asynchronous replication and can be deployed across multiple data centers. Downtime of a single node or the whole data center won’t affect the system’s availability for both reads and writes. Distributed reads are automatically balanced to live replicas to avoid increasing latency. Replicated data are synchronized automatically or semi-automatically after server downtime.
- True column-oriented storage
- Vectorized query execution
- Data compression
- Parallel and distributed query execution
- Real-time query processing
- Real-time data ingestion
- On-disk locality of reference
- Cross-datacenter replication
- High availability
- SQL support
- Local and distributed joins
- Pluggable external dimension tables
- Arrays and nested data types
- Approximate query processing
- Probabilistic data structures
- Full support of IPv6
- Features for web analytics
- State-of-the-art algorithms
- Detailed documentation
- Clean documented code