What is garbage collection? Automated memory management for your programs

Most modern programming languages use garbage collection, but developers have options for how it is implemented and tuned. Get an overview of how garbage collection works in languages such as Java, Python, and C#.

Collection, allocation, moving, memory, GC.
Ljupco Smokovski/Shutterstock

This article introduces garbage collection, including an overview of garbage collection algorithms and how garbage collection is implemented in some popular programming languages including Java and Python. Before we get into that, let's consider the pros and cons of garbage collection itself. Why is it such a common solution for memory allocation errors? We'll start with the perils of memory management in languages like C and C++, which do not use garbage collection.

The perils of memory management in C/C++

Memory allocation issues are only a subset of the problems common to C/C++ that cause potential bugs and vulnerabilities, but they are a large subset and are very annoying to track down and fix. Memory allocation bugs include the following scenarios:

  • Failing to release memory that you’ve allocated, which can eventually use up all the RAM in the system and crash not only the program but the whole computer.
  • Attempting to read or write a buffer through a pointer after the memory has been freed with potentially random results (aka the dangling pointer).
  • Double-freeing a memory block, which can crash the memory manager and eventually the program or even the whole system.

Other common C/C++ vulnerabilities include buffer overruns and string manipulation, which can overwrite code with data. The “fun” part is when an attacker crafts the data so that it will be malicious executable code and then manages to run the code.

Wait, wait, you’re saying: that can’t happen anymore because of separate code and data segments in a protected mode system. Unfortunately, it still can and does happen in some circumstances. An example is a program that constructs an SQL statement in a string and then sends it to a database for execution, often creating an SQL injection vulnerability. Sure, there are well-documented best practices for avoiding SQL injection vulnerabilities, but new bugs in this category keep cropping up in database clients, so it's clear that not every programmer follows the best practices.

Garbage collection: The flawed cure

Using garbage collection can completely eliminate the major memory allocation and deallocation issues, but it comes at a cost. The biggest issues are the overhead of the garbage collector; unpredictable stalls when the garbage collector decides to run; and increased lag when a server process stalls. The latter issue often shows up in Java-based server programs.

Garbage collection's overhead can be substantial, and involves a tradeoff between memory and performance. According to a 2005 paper by Matthew Hertz and Emery D. Berger:

[W]ith five times as much memory, an Appel-style generational collector with a non-copying mature space matches the performance of reachability-based explicit memory management. With only three times as much memory, the collector runs on average 17% slower than explicit memory management. However, with only twice as much memory, garbage collection degrades performance by nearly 70%. When physical memory is scarce, paging causes garbage collection to run an order of magnitude slower than explicit memory management.

Appel-style generational collectors are conservative garbage collectors; more aggressive GCs can sometimes perform better with less memory.

Stalls and lag mean that GC-based languages can be suboptimal for real-time programs and high-throughput servers that need to minimize lag. For example, there have been several attempts at real-time Lisp and real-time Java, all of which have modified or eliminated the garbage collector.

Recently, several Java and Scala servers were rewritten in non-GC languages, for example Scylla, which is a rewrite of Cassandra in C++, and Redpanda, which is a Kafka plug-in replacement written primarily in C++. In both Scylla and Redpanda, lag has drastically reduced compared to the original servers. Both also require much smaller clusters for the same load.

Garbage collection algorithms

There are dozens of algorithms for garbage collection. Let's look at some of the most important algorithms and their salient characteristics.

Reference counting

In reference counting, the program stores the number of references, pointers, or handles to a resource as part of the allocated resource, and increments or decrements the count as references are added or removed. When the reference count reaches zero, the resource can be freed. Memory garbage collection is only one of the applications of reference counting; it is also used for deallocation control of system objects, Windows COM objects, and file system blocks or files.

There are two major downsides to reference counting: excessively frequent updates, and circular references. One way to control the frequency of updates is to allow the compiler to batch related objects. One way to handle circular references, which keep the counts from ever reaching zero, is to run an occasional tracing GC to remove the unreachable cycles.

Tracing garbage collection

Tracing GC is the major alternative to reference counting and includes all the following algorithms as well as quite a few more. The general idea of tracing garbage collection is that the tracing process starts with some root objects (such as the current local variables, global variables, and current function parameters) and follows the references to determine which objects are reachable. All of the unreachable objects are then garbage collected. Tracing garbage collection is so common that it is sometimes simply called garbage collection.

Mark and sweep

The “naïve” mark and sweep algorithm, which was published in 1960 and goes back to John McCarthy and Lisp, works by first freezing the system, then marking all the objects reachable from the root set as “in-use.” The third step is to sweep all of memory and free any block not marked “in-use." Finally, the “in-use” bit is cleared in all remaining  memory blocks, to prepare for the next collection, and the system is allowed to continue. Obviously this is inappropriate for real-time systems.

A variant on mark and sweep uses three “colors” of memory: white blocks are unreachable, and are condemned to be freed if they are still in the white set when the algorithm ends; black blocks are reachable from the roots and have no outgoing references to objects in the white set; and gray blocks are reachable from the roots but are yet to be scanned for references to "white" objects. After the algorithm completes, the gray blocks all wind up in the black set. Typically, the initial marking puts all blocks referenced by the roots into the gray set and all other blocks in the white set.

The tri-color variant algorithm has three steps:

  1. Pick an object from the grey set and move it to the black set.
  2. Move each white object it references to the grey set. This ensures that neither this object nor any object it references can be garbage-collected.
  3. Repeat the last two steps until the grey set is empty.

When the grey set is empty all white blocks can be freed. The tri-color algorithm can be performed in the background while the program runs; there is still overhead, but it doesn’t “stop the world.”

Copying collection

The general idea of copying collection, aka semi-space GC, is that you divide memory into two two equal-size regions called “from-space” and “to-space." You allocate blocks sequentially in to-space until it fills up, and then perform a collection. That swaps the roles of the regions and copies the live objects from from-space to, to-space, leaving a block of free space (corresponding to the memory used by all unreachable objects) at the end of the to-space.

There are complications with copying collection. The biggest one is that when you copy blocks their address changes; one solution to that is to maintain a table of forwarding addresses. Another big issue is that you need twice as much memory for copying collection as you do for mark and sweep. Copying collection is faster than mark and sweep if most memory is garbage, but slower if most memory is reachable.

Mark and compact

Mark and compact collection is essentially copying collection that runs inside of a single memory space. The mark-compact collector scans for all reachable objects and compacts them at the bottom of the heap, which leaves the top of the heap available for consumption. The biggest drawback of mark and compact collection is the time it takes.

Generational collection

Generational collection divides the heap into multiple spaces (usually either two or three) based on the age of the objects, i.e., generations. In general, recent objects are more likely to be garbage than older objects, so it makes sense to scan the newer objects for garbage and leave the older objects alone most of the time. Some generational collectors use different scan frequencies and/or collection algorithms on different generations.

What languages use garbage collection?

Lisp has used garbage collection since John McCarthy invented it in 1958. Java, Scala, Python, and .NET/C# are all popular GC languages. Additional garbage collection languages include the relatively young Go, Ruby, D, OCaml, and Swift, as well the older languages Eiffel, Haskell, ML, Modula-3, Perl, Prolog, Scheme, and Smalltalk.

Java, Python, and .NET/C# are some of the more popular programming languages that implement garbage collection. The Java virtual machine (JVM) actually provides four different garbage collectors: serial, parallel, concurrent mark and sweep, and G1GC, the garbage first garbage collector. G1GC is now the default in Java; it is a regionalized and generational parallel compacting collector that achieves soft real-time goals.

Python, specifically the standard CPython implementation, combines reference counting with three-level generational collection that only focuses on cleaning container objects. The .NET CLR (common language runtime) uses a three-level generational mark and compact collection algorithm. The CLR also segregates memory objects into two heaps, one for large objects (85,000 bytes or higher) and one for small objects; the large object heap usually isn’t compacted, just marked and swept, but can be compacted if necessary.

Conclusion

As you’ve seen, there are quite a few ways to handle garbage collection, and most of them have their uses. The more mature garbage collection implementations combine multiple algorithms and have been heavily tuned over the years to minimize lag.

Copyright © 2023 IDG Communications, Inc.