

Surprising and not-so-surprising benefits over generations of Z garbage collectors.
By Danny Thomas, JVM Ecosystem Team
The latest long-term support version of the JDK provides generation support for the Z garbage collector.
More than half of our key streaming video services now run on JDK 21 with Generational ZGC, so now is a good time to talk about our experience and the benefits we’re seeing. If you’re interested in how we use Java at Netflix, Paul Bakker’s talk How Netflix Really Uses Java is a great place to start.
In our GRPC and DGS framework services, GC pauses are a significant source of tail latency. This is especially true for our GRPC clients and servers, where request cancellation due to timeouts interacts with reliability features such as retries, hedging, and rollbacks. Each error is a canceled request, resulting in a retry, so this reduction further reduces overall service traffic at the following rate:
Removing pause noise also allows us to identify actual sources of end-to-end latency that would otherwise be hidden in noise, as maximum pause time outliers can be significant:
Even though we saw very promising results in our evaluation, we still expect adoption of ZGC to be a trade-off: application throughput due to store and load barriers, work performed in thread-local handshakes, and GC competition with applications. Will be slightly reduced for resources. We believe this is an acceptable trade-off, as the benefits of avoiding pauses will outweigh the overhead.
In fact, we found that our services and architecture had no such trade-offs. For a given CPU usage target, ZGC improves average latency and P99 latency compared to G1 with the same or better CPU usage.
The consistency in request rates, request patterns, response times, and allocation rates that we see across many services certainly helps ZGC, but we’ve found that it’s equally capable of handling less consistent workloads (with exceptions, of course; more on that at below).
Service owners often ask us about long pauses and seek assistance with adjustments. We have several frameworks that regularly flush large amounts of on-heap data to avoid external service calls for greater efficiency. These periodic flushes of data on the heap are perfect for surprising G1, causing pause time outliers well beyond the preset pause time target.
These long-lived on-heap data were the main reason we did not use non-generational ZGC previously. In the worst-case scenario of our evaluation, non-generational ZGC had 36% higher CPU usage than G1 for the same workload. The performance of the first generation ZGC has been improved by nearly 10%.
Half of all services required to stream video use our Hollow library to obtain on-heap metadata. Eliminating the pause issue allowed us to eliminate array pool mitigation, freeing up hundreds of megabytes of memory for allocations.
Operational simplicity also comes from ZGC’s heuristics and default settings. No explicit tuning is required to achieve these results. Allocation stalls are rare, usually coincident with unusual spikes in allocation rates, and are shorter than the average pauses we saw in G1.
We anticipate that since colored pointers require 64-bit object pointers, missing compressed references on stacks smaller than 32G will be a major factor in garbage collector selection.
We found that while this is an important consideration for stop-the-world GC, this is not the case with ZGC, and even on small heaps the increase in allocation rate is amortized by efficiency and operational improvements. We thank Erik Österlund of Oracle for explaining the less intuitive benefits of colored pointers with respect to concurrent garbage collectors, which allowed us to evaluate ZGC more extensively than originally planned.
In most cases, ZGC is also able to continuously make more memory available to applications:
ZGC has a fixed overhead of 3% of the stack size and requires more native memory than G1. There is no need to lower the maximum heap size to make more space except in a few cases where the native memory requirements of these services are higher than average.
Reference processing is also only performed in the ZGC’s main collection. We’re particularly concerned about direct byte buffer reallocation, but so far we haven’t seen any impact. This difference in reference handling does cause performance issues with JSON thread dump support, but this is an unusual situation because the framework accidentally creates unused ExecutorService instances for each request.
Even if you don’t use ZGC, you should probably use hugepages, and transparent hugepages are the most convenient way to use them.
ZGC uses shared memory as the heap, and many Linux distributions configure shmem_enabled as no waywhich silently prevents ZGC from using huge pages via -XX:+UseTransparentHugePages.
Here we deployed a service without making any other changes, but shmem_enabled changed from never to recommended, significantly reducing CPU usage:
Our default configuration:
- Set the heap minimum and maximum to equal sizes
- Configure -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
- Use the following transparent_hugepage configuration:
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo advise | sudo tee /sys/kernel/mm/transparent_hugepage/shmem_enabled
echo defer | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
echo 1 | sudo tee /sys/kernel/mm/transparent_hugepage/khugepaged/defrag
There is no best garbage collector. Each trades off collection throughput, application latency, and resource utilization based on the garbage collector’s goals.
For workloads that perform better using G1 compared to ZGC, we find that they tend to be more throughput-focused, have very high allocation rates, and have long-running tasks that save objects for unpredictable amounts of time.
One notable example is a service with a very high allocation rate and a large number of long-lived objects, which happens to be a good fit for G1’s pause time goals and legacy region collection heuristics. It enables G1 to avoid unproductive work during GC cycles, which ZGC cannot do.
Switching to ZGC by default provides application owners with an excellent opportunity to consider garbage collector choices. Some batch/precompute cases are preset to use G1 and they will see better throughput with the parallel collector. In a large precompute workload, we saw a 6-8% improvement in application throughput and an hour improvement in batch times compared to G1.
If left unquestioned, assumptions and expectations could cause us to miss one of the most impactful changes we have made to operating defaults in a decade. We encourage you to try Generation ZGC for yourself. It might surprise you, just like it surprised us.