-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
I came across #3133 and since we (Sentry and other APM providers) are driving an ecosystem effort to adopt tracing channels into the server runtimes, I think this is a good time to discuss this again, especially with the concerns noted there no longer being valid.
I'd like to propose adding first-class TracingChannel support to graphql-js, following the pattern established by undici in Node.js core.
Motivation
graphql-js is deliberately minimal, it provides the execution engine and nothing else. There are no built-in hooks, middleware, plugins, or tracing APIs. This is a strength for a reference implementation, but it means every APM tool must resort to monkey-patching to observe what the engine is doing.
Today, @opentelemetry/instrumentation-graphql patches parse, validate, and execute using import-in-the-middle for esm and require-in-the-middle for CJS, then recursively traverses the entire schema at instrumentation time to wrap every field resolver's resolve function. Datadog's dd-trace does the same. Sentry does the same. Every APM vendor independently patches the same three functions and walks the same schema tree which is fragile, duplicative, and version-coupled.
Not to mention the broader ecosystem concerns:
- Runtime lock-in: RITM and IITM rely on Node.js-specific module loader internals (
Module._resolveFilename,module.register()). They don't work on Bun or Deno, which implement the Node.js API surface but not the module loader internals. - ESM fragility: IITM is built on Node.js's module customization hooks, which are still evolving and have been a persistent source of breakage in the OTEL JS ecosystem.
- Initialization ordering: Both require instrumentation to be set up before
graphqlis firstrequire()'d /import'd. - Bundling: Users must ensure instrumented modules are externalized, which is increasingly difficult as frameworks bundle server-side code into single executables or deployment files.
- Schema walking is expensive. Every APM tool recursively wraps every resolver on every type in the schema at setup time. This is O(fields) work that runs once per instrumented schema — and if the schema is rebuilt (e.g., in a gateway), it runs again. Native emission eliminates this entirely.
TracingChannel solves all of these. It provides structured lifecycle events (start, end, asyncStart, asyncEnd, error) with built-in async context propagation, zero-cost when no subscribers are attached, and a standardized subscription model that requires no monkey-patching.
Tracing channels already solve many issues with standard event emitters:
- No overhead when no subscribers are listening.
- Tracing channels can be acquired from anywhere, anytime in the code, so timing and load order won't be an issue.
- Compatible with all server-runtimes we have today (more on that below).
- Automatically handles correlation and context propagation, so no need to track executions with requestIds and no need to do the span relationship dance.
Cross-platform compatibility
A previous discussion (#3133) raised the concern that diagnostics_channel is Node.js-specific and graphql-js targets multiple platforms. This concern was valid in 2021 but is no longer accurate:
- Bun supports
node:diagnostics_channelincludingTracingChannel(Bun docs) - Deno supports
node:diagnostics_channelvia its Node.js compatibility layer (Deno docs), and also supports theTracingChannelAPI. - CloudFlare Workers also have the same level of compatability.
This means that every server-side JavaScript runtime that runs graphql-js in production now supports diagnostics_channel. Browser environments don't need APM tracing at the execution engine level, browsers usually run GraphQL clients, not servers and even then the API can be imported/loaded conditionally if available and only used if so, so you can write isomorphic logic for it.
The standard compatibility pattern used across the ecosystem handles this cleanly:
let dc;
try {
dc = ('getBuiltinModule' in process)
? process.getBuiltinModule('node:diagnostics_channel')
: require('node:diagnostics_channel');
} catch {
// No diagnostics_channel available — all tracing is a no-op
}This is zero-cost when diagnostics_channel is unavailable, this means no import, no overhead, no behavior change.
Runtime Overhead
Tracing channels are specifically designed to handle way more events than EventEmitter, and it can be optimized to have zero-overhead when there are no listeners. In runtime we have access to hasSubscribers which can be checked before constructing any context objects.
Implementation
We can use OTEL's as the bare minimum we need as it is the most use instrumentation for graphql out there by most APM providers.
All channels use the Node.js TracingChannel API, which provides start, end, asyncStart, asyncEnd, and error sub-channels automatically.
| TracingChannel | Tracks | Context fields |
|---|---|---|
graphql:execute |
execute() — full operation execution lifecycle |
operationType, operationName, document, schema, variableValues |
graphql:parse |
parse() — query string to AST |
source |
graphql:validate |
validate() — AST validation against schema |
document, schema |
graphql:resolve |
Individual field resolver execution | fieldName, fieldPath, fieldType, parentType, args |
graphql:subscribe |
subscribe() — subscription setup |
operationType, operationName, document, schema, variableValues |
We can spec out the types and exact fields in each channel's context in a PR.
Usage in the Ecosystem
const dc = require('node:diagnostics_channel');
// Subscribe to operation execution — the primary span
dc.tracingChannel('graphql:execute').subscribe({
start(ctx) {
ctx.span = tracer.startSpan(`${ctx.operationType} ${ctx.operationName}`, {
attributes: {
'graphql.operation.type': ctx.operationType,
'graphql.operation.name': ctx.operationName,
'graphql.source': print(ctx.document),
},
});
},
asyncEnd(ctx) {
ctx.span?.end();
},
error(ctx) {
ctx.span?.setStatus({ code: SpanStatusCode.ERROR, message: ctx.error?.message });
ctx.span?.recordException(ctx.error);
},
});That's it, all interested parties including the user can listen for the other channels like this one and run their own telemetry and logic.
Prior Art
This approach follows the same pattern already adopted or in progress by other major libraries:
undici(Node.js core) — shipsTracingChannelsupport since Node 20.12:undici:requestfastify— shipsTracingChannelsupport natively (tracing:fastify.request.handler)node-redis— redis/node-redis#3195 (node-redis:command,node-redis:connect)ioredis— redis/ioredis#2089 (ioredis:command,ioredis:connect)pg/pg-pool— brianc/node-postgres#3624 (pg:query,pg:connection,pg:pool:connect)mysql2— sidorares/node-mysql2#4178 (mysql2:query,mysql2:execute,mysql2:connect,mysql2:pool:connect)
Full disclosure, I'm leading the effort on most of these implementations
I would love to hear from you if there's appetite for this, I'm happy to spec this in a PR and fully own this until it is shipped.
I think this feels like a great milestone for v17, the code changes are extremely minimal and if you check some of those PRs it takes a few minutes, or hours at best to transform a library or a framework to be observable-friendly with tracing channels.