Skip to content

Use Tracing Channel for observability #4629

@logaretm

Description

@logaretm

I came across #3133 and since we (Sentry and other APM providers) are driving an ecosystem effort to adopt tracing channels into the server runtimes, I think this is a good time to discuss this again, especially with the concerns noted there no longer being valid.

I'd like to propose adding first-class TracingChannel support to graphql-js, following the pattern established by undici in Node.js core.

Motivation

graphql-js is deliberately minimal, it provides the execution engine and nothing else. There are no built-in hooks, middleware, plugins, or tracing APIs. This is a strength for a reference implementation, but it means every APM tool must resort to monkey-patching to observe what the engine is doing.

Today, @opentelemetry/instrumentation-graphql patches parse, validate, and execute using import-in-the-middle for esm and require-in-the-middle for CJS, then recursively traverses the entire schema at instrumentation time to wrap every field resolver's resolve function. Datadog's dd-trace does the same. Sentry does the same. Every APM vendor independently patches the same three functions and walks the same schema tree which is fragile, duplicative, and version-coupled.

Not to mention the broader ecosystem concerns:

  • Runtime lock-in: RITM and IITM rely on Node.js-specific module loader internals (Module._resolveFilename, module.register()). They don't work on Bun or Deno, which implement the Node.js API surface but not the module loader internals.
  • ESM fragility: IITM is built on Node.js's module customization hooks, which are still evolving and have been a persistent source of breakage in the OTEL JS ecosystem.
  • Initialization ordering: Both require instrumentation to be set up before graphql is first require()'d / import'd.
  • Bundling: Users must ensure instrumented modules are externalized, which is increasingly difficult as frameworks bundle server-side code into single executables or deployment files.
  • Schema walking is expensive. Every APM tool recursively wraps every resolver on every type in the schema at setup time. This is O(fields) work that runs once per instrumented schema — and if the schema is rebuilt (e.g., in a gateway), it runs again. Native emission eliminates this entirely.

TracingChannel solves all of these. It provides structured lifecycle events (start, end, asyncStart, asyncEnd, error) with built-in async context propagation, zero-cost when no subscribers are attached, and a standardized subscription model that requires no monkey-patching.

Tracing channels already solve many issues with standard event emitters:

  • No overhead when no subscribers are listening.
  • Tracing channels can be acquired from anywhere, anytime in the code, so timing and load order won't be an issue.
  • Compatible with all server-runtimes we have today (more on that below).
  • Automatically handles correlation and context propagation, so no need to track executions with requestIds and no need to do the span relationship dance.

Cross-platform compatibility

A previous discussion (#3133) raised the concern that diagnostics_channel is Node.js-specific and graphql-js targets multiple platforms. This concern was valid in 2021 but is no longer accurate:

  • Bun supports node:diagnostics_channel including TracingChannel (Bun docs)
  • Deno supports node:diagnostics_channel via its Node.js compatibility layer (Deno docs), and also supports the TracingChannel API.
  • CloudFlare Workers also have the same level of compatability.

This means that every server-side JavaScript runtime that runs graphql-js in production now supports diagnostics_channel. Browser environments don't need APM tracing at the execution engine level, browsers usually run GraphQL clients, not servers and even then the API can be imported/loaded conditionally if available and only used if so, so you can write isomorphic logic for it.

The standard compatibility pattern used across the ecosystem handles this cleanly:

let dc;
try {
  dc = ('getBuiltinModule' in process)
    ? process.getBuiltinModule('node:diagnostics_channel')
    : require('node:diagnostics_channel');
} catch {
  // No diagnostics_channel available — all tracing is a no-op
}

This is zero-cost when diagnostics_channel is unavailable, this means no import, no overhead, no behavior change.

Runtime Overhead

Tracing channels are specifically designed to handle way more events than EventEmitter, and it can be optimized to have zero-overhead when there are no listeners. In runtime we have access to hasSubscribers which can be checked before constructing any context objects.

Implementation

We can use OTEL's as the bare minimum we need as it is the most use instrumentation for graphql out there by most APM providers.

All channels use the Node.js TracingChannel API, which provides start, end, asyncStart, asyncEnd, and error sub-channels automatically.

TracingChannel Tracks Context fields
graphql:execute execute() — full operation execution lifecycle operationType, operationName, document, schema, variableValues
graphql:parse parse() — query string to AST source
graphql:validate validate() — AST validation against schema document, schema
graphql:resolve Individual field resolver execution fieldName, fieldPath, fieldType, parentType, args
graphql:subscribe subscribe() — subscription setup operationType, operationName, document, schema, variableValues

We can spec out the types and exact fields in each channel's context in a PR.

Usage in the Ecosystem

const dc = require('node:diagnostics_channel');

// Subscribe to operation execution — the primary span
dc.tracingChannel('graphql:execute').subscribe({
  start(ctx) {
    ctx.span = tracer.startSpan(`${ctx.operationType} ${ctx.operationName}`, {
      attributes: {
        'graphql.operation.type': ctx.operationType,
        'graphql.operation.name': ctx.operationName,
        'graphql.source': print(ctx.document),
      },
    });
  },
  asyncEnd(ctx) {
    ctx.span?.end();
  },
  error(ctx) {
    ctx.span?.setStatus({ code: SpanStatusCode.ERROR, message: ctx.error?.message });
    ctx.span?.recordException(ctx.error);
  },
});

That's it, all interested parties including the user can listen for the other channels like this one and run their own telemetry and logic.

Prior Art

This approach follows the same pattern already adopted or in progress by other major libraries:

Full disclosure, I'm leading the effort on most of these implementations


I would love to hear from you if there's appetite for this, I'm happy to spec this in a PR and fully own this until it is shipped.

I think this feels like a great milestone for v17, the code changes are extremely minimal and if you check some of those PRs it takes a few minutes, or hours at best to transform a library or a framework to be observable-friendly with tracing channels.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions