Tracing garbage collection

This guide will go through the fundamentals of garbage collection traces.

By the end of this guide, you'll be able to:

Enable traces in your Node.js application
Interpret traces
Identify potential memory issues in your Node.js application

There's a lot to learn about how the garbage collector works, but if you learn one thing it's that when GC is running, your code is not.

You may want to know how often and long the garbage collection runs, and what is the outcome.

Setup

For the proposal of this guide, we'll use this script:

// script.mjs

import  from 'node:os';

let  = 1_000_000;
const  = new ();

function () {
  const  = {
    : .(),
    : .(),
    : .(),
    : .(),
  };

  .();
}

function () {
  .(`Total: ${.} entries`);
}

// execution
(() => {
  while ( > 0) {
    ();
    ..(`~~> ${} entries to record\r`);
    --;
  }

  ();
})();

Even if the leak is evident here, finding the source of a leak could be cumbersome in the context of a real-world application.

Running with garbage collection traces

You can see traces for garbage collection in console output of your process using the --trace-gc flag.

$ node --trace-gc script.mjs

Note: you can find the source code of this exercise in the Node.js Diagnostics repository.

It should output something like:

[39067:0x158008000]     2297 ms: Scavenge 117.5 (135.8) -> 102.2 (135.8) MB, 0.8 / 0.0 ms  (average mu = 0.994, current mu = 0.994) allocation failure
[39067:0x158008000]     2375 ms: Scavenge 120.0 (138.3) -> 104.7 (138.3) MB, 0.9 / 0.0 ms  (average mu = 0.994, current mu = 0.994) allocation failure
[39067:0x158008000]     2453 ms: Scavenge 122.4 (140.8) -> 107.1 (140.8) MB, 0.7 / 0.0 ms  (average mu = 0.994, current mu = 0.994) allocation failure
[39067:0x158008000]     2531 ms: Scavenge 124.9 (143.3) -> 109.6 (143.3) MB, 0.7 / 0.0 ms  (average mu = 0.994, current mu = 0.994) allocation failure
[39067:0x158008000]     2610 ms: Scavenge 127.1 (145.5) -> 111.8 (145.5) MB, 0.7 / 0.0 ms  (average mu = 0.994, current mu = 0.994) allocation failure
[39067:0x158008000]     2688 ms: Scavenge 129.6 (148.0) -> 114.2 (148.0) MB, 0.8 / 0.0 ms  (average mu = 0.994, current mu = 0.994) allocation failure
[39067:0x158008000]     2766 ms: Scavenge 132.0 (150.5) -> 116.7 (150.5) MB, 1.1 / 0.0 ms  (average mu = 0.994, current mu = 0.994) allocation failure
Total: 1000000 entries

Hard to read? Maybe we should pass in review a few concepts and explain the outputs of the --trace-gc flag.

Examining a trace with `--trace-gc`

The --trace-gc (or --trace_gc, either is fine) flag outputs all garbage collection events in the console. The composition of each line can be described as:

[13973:0x110008000]       44 ms: Scavenge 2.4 (3.2) -> 2.0 (4.2) MB, 0.5 / 0.0 ms  (average mu = 1.000, current mu = 1.000) allocation failure

Token value	Interpretation
13973	PID of the running process
0x110008000	Isolate (JS heap instance)
44 ms	The time since the process started in ms
Scavenge	Type / Phase of GC
2.4	Heap used before GC in MB
(3.2)	Total heap before GC in MB
2.0	Heap used after GC in MB
(4.2)	Total heap after GC in MB
0.5 / 0.0 ms (average mu = 1.000, current mu = 1.000)	Time spent in GC in ms
allocation failure	Reason for GC

We'll only focus on two events here:

Scavenge
Mark-sweep

The heap is divided into spaces. Amongst these, we have a space called the "new" space and another one called the "old" space.

👉 In reality, the heap structure is a bit different, but we'll stick to a simpler version for this article. If you want more details we encourage you to look at this talk of Peter Marshall about Orinoco.

Scavenge is the name of an algorithm that will perform garbage collection into new space. The new space is where objects are created. The new space is designed to be small and fast for garbage collection.

Let's imagine a Scavenge scenario:

we allocated A, B, C & D.
```
| A | B | C | D | <unallocated> |
```
Shell
we want to allocate E
not enough space, the memory is exhausted
then, a (garbage) collection is triggered
dead objects are collected
living object will stay
assuming B and D were dead
```
| A | C | <unallocated> |
```
Shell
now we can allocate E
```
| A | C | E | <unallocated> |
```
Shell

v8 will promote objects, not garbage collected after two Scavenge operations to the old space.

👉 Full Scavenge scenario

Mark-sweep

Mark-sweep is used to collect objects from old space. The old space is where objects that survived the new space are living.

This algorithm is composed of two phases:

Mark: Will mark still alive objects as black and others as white.
Sweep: Scans for white objects and converts them to free spaces.

👉 In fact, the Mark and Sweep steps are a bit more elaborate. Please read this document for more details.

`--trace-gc` in action

Memory leak

Now, if you return quickly to the previous terminal window: you will see many Mark-sweep events in the console. We also see that the amount of memory collected after the event is insignificant.

Now that we are experts in garbage collection! What could we deduce?

We probably have a memory leak! But how could we be sure of that? (Reminder: it is pretty apparent in this example, but what about a real-world application?)

But how could we spot the context?

How to get the context of bad allocations

Suppose we observe that the old space is continuously increasing.
Reduce --max-old-space-size such that the total heap is closer to the limit
Run the program until you hit the out of memory.
The produced log shows the failing context.
If it hits OOM, increment the heap size by ~10% and repeat a few times. If the same pattern is observed, it indicates a memory leak.
If there is no OOM, then freeze the heap size to that value - A packed heap reduces memory footprint and computation latency.

For example, try to run script.mjs with the following command:

node --trace-gc --max-old-space-size=50 script.mjs

You should experience an OOM:

[...]
<--- Last few GCs --->
[40928:0x148008000]      509 ms: Mark-sweep 46.8 (65.8) -> 40.6 (77.3) MB, 6.4 / 0.0 ms  (+ 1.4 ms in 11 steps since start of marking, biggest step 0.2 ms, walltime since start of marking 24 ms) (average mu = 0.977, current mu = 0.977) finalize incrementa[40928:0x148008000]      768 ms: Mark-sweep 56.3 (77.3) -> 47.1 (83.0) MB, 35.9 / 0.0 ms  (average mu = 0.927, current mu = 0.861) allocation failure scavenge might not succeed
<--- JS stacktrace --->
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory [...]

Now, try to it for 100mb:

node --trace-gc --max-old-space-size=100 script.mjs

You should experience something similar, the only difference should be that the last GC trace will contain a bigger heap size.

<--- Last few GCs --->
[40977:0x128008000]     2066 ms: Mark-sweep (reduce) 99.6 (102.5) -> 99.6 (102.5) MB, 46.7 / 0.0 ms  (+ 0.0 ms in 0 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 47 ms) (average mu = 0.154, current mu = 0.155) allocati[40977:0x128008000]     2123 ms: Mark-sweep (reduce) 99.6 (102.5) -> 99.6 (102.5) MB, 47.7 / 0.0 ms  (+ 0.0 ms in 0 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 48 ms) (average mu = 0.165, current mu = 0.175) allocati

Note: In the context of real application, it could be cumbersome to find the leaked object in the code. Heap snapshot could help you to find it. Visit the guide dedicated to heap snapshot

Slowness

How do you assert whether too many garbage collections are happening or causing an overhead?

Review the trace data, precisely the time between consecutive collections.
Review the trace data, specifically around time spent in GC.
If the time between two GC is less than the time spent in GC, the application is severely starving.
If the time between two GCS and the time spent in GC are very high, probably the application can use a smaller heap.
If the time between two GCS is much greater than the time spent in GC, the application is relatively healthy.

Fix the leak

Now let's fix the leak. Instead of using an object to store our entries, we could use a file.

Let's modify our script a bit:

// script-fix.mjs
import  from 'node:fs/promises';
import  from 'node:os';

let  = 1_000_000;
const  = `entries-${.()}`;

async function () {
  const  = {
    : .(),
    : .(),
    : .(),
    : .(),
  };
  await .(, .() + '\n');
}

async function () {
  const  = await .();
  .(`File size ${.} bytes`);
}

// execution
(async () => {
  await .(, '----START---\n');
  while ( > 0) {
    await ();
    ..(`~~> ${} entries to record\r`);
    --;
  }

  await ();
})();

Using a Set to store data is not a bad practice at all; you should just care about the memory footprint of your program.

Note: you can find the source code of this exercise in the Node.js Diagnostics repository.

Now, let's execute this script.

node --trace-gc script-fix.mjs

You should observe two things:

Mark-sweep events appear less frequently
the memory footprint doesn't exceed 25MB versus more than 130MB with the first script.

It makes a lot of sense as the new version puts less pressure on the memory than the first one.

Takeaway: What do you think about improving this script? You probably see that the new version of the script is slow. What if we use a Set again and write its content into a file only when the memory reaches a specific size?

getheapstatistics API could help you.

Bonus: Trace garbage collection programmatically

Using `v8` module

You might want to avoid getting traces from the entire lifetime of your process. In that case, set the flag from within the process. The v8 module exposes an API to put flags on the fly.

import  from 'v8';

// enabling trace-gc
.('--trace-gc');

// disabling trace-gc
.('--notrace-gc');

Using performance hooks

In Node.js, you can use performance hooks to trace garbage collection.

const {  } = ('node:perf_hooks');

// Create a performance observer
const  = new ( => {
  const  = .()[0];
  /*
  The entry is an instance of PerformanceEntry containing
  metrics of a single garbage collection event.
  For example:
  PerformanceEntry {
    name: 'gc',
    entryType: 'gc',
    startTime: 2820.567669,
    duration: 1.315709,
    kind: 1
  }
  */
});

// Subscribe to notifications of GCs
.({ : ['gc'] });

// Stop subscription
.();

Examining a trace with performance hooks

You can get GC statistics as PerformanceEntry from the callback in PerformanceObserver.

For example:

{
  "name": "gc",
  "entryType": "gc",
  "startTime": 2820.567669,
  "duration": 1.315709,
  "kind": 1
}

Property	Interpretation
name	The name of the performance entry.
entryType	The type of the performance entry.
startTime	The high-resolution millisecond timestamp is marking the starting time of the Performance Entry.
duration	The total number of milliseconds elapsed for this entry.
kind	The type of garbage collection operation that occurred.
flags	The additional information about GC.

For more information, you can refer to the documentation about performance hooks.

Introduction to Node.js