Skip to content

Comments

Stream debug_traceBlock* responses directly to avoid OOM on large blocks#9848

Draft
daniellehrner wants to merge 3 commits intohyperledger:mainfrom
daniellehrner:feat/trace-streaming
Draft

Stream debug_traceBlock* responses directly to avoid OOM on large blocks#9848
daniellehrner wants to merge 3 commits intohyperledger:mainfrom
daniellehrner:feat/trace-streaming

Conversation

@daniellehrner
Copy link
Contributor

@daniellehrner daniellehrner commented Feb 19, 2026

PR description

Converts debug_traceBlockByNumber, debug_traceBlockByHash, and `debug_traceBlock from accumulate-then-serialize to stream-as-you-go. Previously, these methods built the entire JSON response in memory (via TransactionTrace + DebugTraceTransactionResult), which OOMs on blocks with many transactions or complex traces. Now, structLogs are written directly to the HTTP/WebSocket output stream during EVM execution.

Infrastructure changes:

  • New StreamingJsonRpcMethod interface: streaming methods implement streamResponse(request, outputStream, mapper) instead of response(request)
  • JsonRpcMethod.isStreaming() marker allows the HTTP/WS handlers to dispatch to the streaming path
  • streamProcess() added to the JsonRpcProcessor chain (Base, Timed, Traced, Authenticated) so streaming requests get the same metrics, tracing, and auth as regular requests
  • JsonRpcObjectExecutor and WebSocketMessageHandler detect streaming methods and write directly to the response stream
  • JsonRpcExecutor shared preamble extracted into prepareExecution() to eliminate duplication between execute() and executeStreaming()

Trace-specific changes

  • AbstractDebugTraceBlock base class with shared option parsing and response writing
  • DebugTraceBlockStreamer — the core streaming engine. For OPCODE_TRACER, writes structLogs during execution via a frame callback. For other tracers (callTracer, etc.), accumulates per-transaction then serializes
  • DebugOperationTracer.StreamingFrameWriter — zero-allocation callback that passes raw trace data + live MessageFrame reference, bypassing TraceFrame/Builder/Optional construction entirely
  • Reusable StringBuilder for hex conversion (StructLog.toCompactHex overload), memory read via readMutableMemory() (view, no copy), storage read directly from getUpdatedStorage() — eliminates ~130+ transient objects per
    opcode step
  • EthScheduler dependency removed from debug trace method constructors (no longer needed without async dispatch)

Breaking changes

JSON field ordering changed for debug_traceBlock* with OPCODE_TRACER (default tracer):

Before:

  {"txHash": "0x...", "result": {"gas": 21000, "failed": false, "returnValue": "", "structLogs": [...]}}

After:

  {"txHash": "0x...", "result": {"structLogs": [...], "gas": 21000, "failed": false, "returnValue": ""}}

gas, failed, and returnValue now appear after structLogs because they're only known after execution completes. JSON-RPC clients that parse by key name (standard) are unaffected. Clients that depend on field ordering will
break.

Batch requests containing streaming methods now return {"error": {"code": -32600, "message": "Invalid request"}} for those methods instead of crashing the batch. This is new behavior but I am unsure if tracing more than one block in parallel would have been possible without OOM anyways.

Performance tests

I ran the following script to trace 10 recent blocks in a row to compare the current implementation against this PR. The script was:

#!/bin/bash
RPC_URL="http://localhost:8545"

if [ -z "$1" ]; then
  echo "Usage: $0 <start_block>" >&2
  exit 1
fi

START="$1"

echo "Testing debug_traceBlockByNumber streaming (10 blocks starting at $START)"
echo "RPC: $RPC_URL"
echo "---"
printf "%-12s %10s %10s %12s\n" "Block" "Total(s)" "TTFB(s)" "Size(MB)"
echo "---"

total_time=0
total_size=0

for i in $(seq 0 9); do
  block=$(printf "0x%X" $(( START + i )))
  result=$(curl -o /dev/null -w "%{time_total} %{time_starttransfer} %{size_download}" \
    -s -X POST "$RPC_URL" \
    -H 'Content-Type: application/json' \
    -d "{\"jsonrpc\":\"2.0\",\"method\":\"debug_traceBlockByNumber\",\"params\":[\"$block\"],\"id\":1}")
  time=$(echo "$result" | awk '{print $1}')
  ttfb=$(echo "$result" | awk '{print $2}')
  size=$(echo "$result" | awk '{print $3}')
  mb=$(echo "scale=2; $size/1048576" | bc)
  printf "%-12s %10.3f %10.3f %10.2fMB\n" "$block" "$time" "$ttfb" "$mb"
  total_time=$(echo "$total_time + $time" | bc)
  total_size=$(echo "$total_size + $size" | bc)
done

avg_time=$(echo "scale=3; $total_time/10" | bc)
avg_mb=$(echo "scale=2; $total_size/10/1048576" | bc)
echo "---"
printf "%-12s %10.3f %10s %10.2fMB\n" "Average" "$avg_time" "" "$avg_mb"
printf "%-12s %10.3f %10s %10.2fMB\n" "Total" "$total_time" "" "$(echo "scale=2; $total_size/1048576" | bc)"

On the feature node, which includes this PR we got:

./test_trace_block.sh 24496058
Testing debug_traceBlockByNumber streaming (10 blocks starting at 24496058)
RPC: http://localhost:8545
---
Block          Total(s)    TTFB(s)     Size(MB)
---
0x175C7BA         6.602      0.020     942.23MB
0x175C7BB        12.770      0.004    1860.28MB
0x175C7BC         4.399      0.003     524.80MB
0x175C7BD         5.358      0.003     643.87MB
0x175C7BE        13.290      0.003    2283.97MB
0x175C7BF        15.382      0.003    2586.28MB
0x175C7C0         2.897      0.002     349.54MB
0x175C7C1         4.722      0.002     615.65MB
0x175C7C2         4.555      0.002     604.91MB
0x175C7C3        16.937      0.004    2753.11MB
---
Average           8.691               1316.46MB
Total            86.913              13164.68MB

On the control node, which is main we got:

./test_trace_block.sh 24496058
Testing debug_traceBlockByNumber streaming (10 blocks starting at 24496058)
RPC: http://localhost:8545
---
Block          Total(s)    TTFB(s)     Size(MB)
---
0x175C7BA         9.488      6.427     942.23MB
0x175C7BB        11.434      4.777    1860.28MB
0x175C7BC         4.623      2.952     524.80MB
0x175C7BD         5.753      3.696     643.87MB
0x175C7BE        13.923      5.766    2283.97MB
0x175C7BF        30.170     30.170       0.00MB
0x175C7C0        39.813     39.813       0.00MB
0x175C7C1        38.678     38.678       0.00MB
^C

The control node crashed during the execution, as can be seen by retruning 0 bytes on the bottom 3 blocks.

TTFB (time to first byte) is only a few ms on the feature node and several seconds on the control node, showing that the streaming works correctly and starts to send data almost immediately.

The total response time varies between the two without a clear winner.

During the tests we saw the following memory consumption:

Screenshot 2026-02-20 at 06 56 35

We see the expected spikes on the control node in GC time and general memory consumption. As the memory consumption increases by several GBs we eventually run into a OOM error which crahes the node.

On the feature node GC time increases a bit, but the general memory consumption stays relatively flat, as expected because the streaming only keeps very little data in memory before writing it to the socket and deleting it right away.

Fixed Issue(s)

Thanks for sending a pull request! Have you done the following?

  • Checked out our contribution guidelines?
  • Considered documentation and added the doc-change-required label to this PR if updates are required.
  • Considered the changelog and included an update if required.
  • For database changes (e.g. KeyValueSegmentIdentifier) considered compatibility and performed forwards and backwards compatibility tests

Locally, you can run these tests to catch failures early:

  • spotless: ./gradlew spotlessApply
  • unit tests: ./gradlew build
  • acceptance tests: ./gradlew acceptanceTest
  • integration tests: ./gradlew integrationTest
  • reference tests: ./gradlew ethereum:referenceTests:referenceTests
  • hive tests: Engine or other RPCs modified?

Signed-off-by: daniellehrner <daniel.lehrner@consensys.net>
Signed-off-by: daniellehrner <daniel.lehrner@consensys.net>
Signed-off-by: daniellehrner <daniel.lehrner@consensys.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant