
Google Cloud has recently announced the launch of its Cloud Run GPU service, allowing users to harness NVIDIA L4 GPUs in a cloud-native environment. Designed primarily for workloads such as AI computation, inference, and training, the service offers seamless auto-scaling and flexible deployment capabilities.
Notably, users are not required to predefine GPU configurations. Instead, the system dynamically allocates GPU resources based on computational demands, thereby preventing idle resource wastage and avoiding unnecessary costs. This flexibility not only enhances deployment agility but also simplifies operational management through automation.
The service adopts a per-second billing model, with charges ceasing when the GPUs are not in use. Furthermore, the GPUs and drivers can initialize from a cold start in approximately five seconds. For instance, when performing inference with the Gemma 3 model, comprising 4 billion parameters, the time from cold start to generating the first token is roughly 19 seconds—highlighting the platform’s rapid startup efficiency.
Cloud Run GPU also integrates easily with applications by allowing GPU acceleration to be enabled either through embedded commands or via toggles in the application service console.
Given its elastic architecture, Google Cloud emphasizes the service’s operational reliability. It enables users and enterprises to deploy across multiple regions based on business needs, while also offering the option to disable zonal redundancy to fine-tune the overall compute resource allocation.
The Cloud Run GPU service is now live across several Google Cloud regions in the United States, Europe, and Asia.