Google Reveals Effingo: The Tech Behind Moving 1.2 Exabytes Daily
Google has unveiled the technical details of its internal data transfer tool called Effingo, which moves an average of 1.2 exabytes of information daily.
At the SIGCOMM 2024 conference in Sydney, a report was presented explaining that bandwidth constraints and the constant speed of light compel Google to duplicate data so it is closer to the point of processing or delivery. Effingo reduces network latency from hundreds of milliseconds to tens of milliseconds across the continent.
Conventional data transfer tools either optimize transfer time or handle point-to-point data streams, but they cannot cope with the volumes that Effingo moves daily—14 terabytes per second. Effingo prioritizes tasks, allocating necessary resources to ensure critical operations, such as disaster recovery, over routine data migrations.
Effingo is optimized for use with Google’s Colossus file system, deployed in clusters comprising thousands of machines. Each cluster is equipped with Effingo software, consisting of a control plane and a data plane. The control plane manages the copy lifecycle, while the data plane transfers data and monitors status. The data plane consumes 99% of the CPU but comprises less than 7% of the code lines.
Each cluster is connected to others via low-latency, high-bandwidth networks, or WAN connections utilizing Google and third-party infrastructure. The Bandwidth Enforcer (BWe) tool, also developed by Google, allocates bandwidth based on service priorities and the value of added bandwidth.
When a user initiates a data transfer, Effingo requests traffic allocation from BWe and begins the transfer as quickly as possible. This allocation can be based on pre-defined quotas, using bandwidth metrics and available Effingo resources, which execute data movement tasks as “Borg” jobs (Google’s containerization platform from which Kubernetes was derived).
Effingo can use best-effort resources for less critical tasks and request quotas for tasks requiring specific network performance. Quotas are allocated months in advance, and Effingo is one of many resources in the central scheduling system. Unused quotas are reallocated but can be quickly reclaimed if necessary.
Despite all efforts to distribute resources, Effingo’s average global queue size is 12 million files, equivalent to about eight petabytes. At peak times, queues increase by 12 petabytes and nine million files when the top 10 users initiate new transfers.
Google plans to improve Effingo’s integration with resource management systems and optimize CPU usage during inter-cluster transfers. Enhancements are also planned to scale data transfers more rapidly.