Skip to content

Comments

Fix memory corruption/JVM crash in RediSearch by copying pooled ByteBuffers#3664

Open
YoHanKi wants to merge 5 commits intoredis:mainfrom
YoHanKi:fix/redisearch-encoded-complex-output-buffer-corruption
Open

Fix memory corruption/JVM crash in RediSearch by copying pooled ByteBuffers#3664
YoHanKi wants to merge 5 commits intoredis:mainfrom
YoHanKi:fix/redisearch-encoded-complex-output-buffer-corruption

Conversation

@YoHanKi
Copy link

@YoHanKi YoHanKi commented Feb 10, 2026

Make sure that:

  • You have read the contribution guidelines.
  • You have created a feature request first to discuss your contribution intent. Please reference the feature request ticket number in the pull request.
  • You applied code formatting rules using the mvn formatter:format target. Don’t submit any formatting related changes.
  • You submit test cases (unit or integration tests) that back your changes.

NOTE: This PR was split out from #3624 to keep concerns isolated and reviews focused.
cc @a-TODO-rov

Description

This PR fixes a memory-safety issue in RediSearch response parsing that can lead to data corruption or JVM crashes (SIGSEGV) under Netty pooled buffer reuse scenarios.

Problem

EncodedComplexOutput stored read-only views of Netty pooled ByteBuffers.
Because pooled buffers can be reclaimed/reused before parsing completes, the stored views may later point to mutated/freed memory, causing:

  • corrupted parsed results
  • intermittent SIGSEGV / JVM crash (reported especially on Linux/Ubuntu with Netty 4.2+)

Solution

  • Update EncodedComplexOutput to copy the ByteBuffer contents into a heap-allocated buffer before storing it.
  • This ensures the data remains valid for the duration of parsing regardless of pooled buffer lifecycle.

Verification

  • Added testSearchWithLargeJsonPayloads to RediSearchIntegrationTests.
  • Verified in an Ubuntu-based Docker environment using:
    • Netty 4.2.4.Final
    • leak detection: PARANOID
  • The stress test reliably reproduces the risk scenario and validates stability after the fix.

Changes

  • EncodedComplexOutput: implemented ByteBuffer copying for persistence.
  • RediSearchIntegrationTests: added stress test for large JSON payloads.

Split from #3624

…uffer

- Update EncodedComplexOutput to store a copy of ByteBuffer instead of a read-only view
- Add integration test for FT.SEARCH with large JSON payloads to verify stability
@a-TODO-rov
Copy link
Contributor

Thank you @YoHanKi
Will review it soon.

@a-TODO-rov a-TODO-rov linked an issue Feb 11, 2026 that may be closed by this pull request
Copy link
Collaborator

@atakavci atakavci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @YoHanKi, thanks for your contribution!
looks good to me.
cc: @a-TODO-rov

Copy link
Contributor

@a-TODO-rov a-TODO-rov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well i ran my reproducer and it seems the memory corruption still exists ... I didn't have the time to analyze further.

Reproducer
 
public class RediSearchBugProver {
private static final Logger log = LoggerFactory.getLogger(RediSearchBugProver.class);

public static void main(String[] args) {
    // System.setProperty("org.slf4j.simpleLogger.log.io.lettuce.core.protocol", "trace");
    RedisURI uri = RedisURI.Builder.redis("localhost", 6379)
            .withTimeout(Duration.ofSeconds(30))
            .build();
    try (var client = RedisClient.create(uri)) {
        client.setOptions(ClientOptions.builder()
                // increase it drastically just to be sure discardBytesRatio is not the cause
                .decodeBufferPolicy(DecodeBufferPolicies.ratio(Integer.MAX_VALUE / 2.0f))
                // ft.search result on RESP2 is an array, on RESP3 a map. So error is different
                .protocolVersion(ProtocolVersion.RESP3)
                .build());
        try (var connection = client.connect()) {
            runTest(connection, String.valueOf(System.currentTimeMillis()));
        }
    }
}

private static void runTest(StatefulRedisConnection<String, String> con, String space) {
    var prefix = "test-" + space + ":";
    var index = "idx-" + space;

    log.info("===> running with {}, {}", prefix, index);
    con.sync().ftCreate(
            index,
            CreateArgs.<String, String>builder()
                    .on(CreateArgs.TargetType.JSON)
                    .withPrefix(prefix)
                    .build(),
            List.of(NumericFieldArgs.<String>builder().name("pos").build())
    );
    var searchArgs = SearchArgs.<String, String>builder()
            .sortBy(SortByArgs.<String>builder().attribute("pos").build())
            .limit(0, 10_000)
            .build();

    var expected = new ArrayList<>();
    for (int i = 1; i <= 1000; i++) {
        var latest = """
                {"pos":%d,"ts":%d,"large":"just here to make the response larger to some great extend and overflow the buffers"}
                """.formatted(i, System.currentTimeMillis()).trim();
        con.sync().jsonSet(prefix + i, JsonPath.ROOT_PATH, latest);
        expected.add(latest);

        if (i >= 924) {
            log.info("=== search {}", i);
            var searchReply = con.sync().ftSearch(index, "*", searchArgs);
            // with RESP3 this simply returns 0 as the map keys in Resp3SearchResultsParser are not found
            assertEquals(expected.size(), searchReply.getCount());
            for (int t = 1; t <= expected.size(); t++) {
                var fields = searchReply.getResults().get(t - 1).getFields();
                try {
                    assertEquals(expected.get(t - 1), fields.get("$"));
                } catch (AssertionFailedError e) {
                    // with RESP2 this shows strange fields instead of the expected '$={"pos":..."'
                    log.info("Fields at pos {}: {}", t - 1, fields);
                    throw new AssertionFailedError("On loop " + i + ": " + e.getMessage(), e);
                }
            }
        }
    }
}

}

Error
 
[main] WARN io.netty.resolver.dns.DnsServerAddressStreamProviders - Can not find io.netty.resolver.dns.macos.MacOSDnsServerAddressStreamProvider in the classpath, fallback to system defaults. This may result in incorrect DNS resolutions on MacOS. Check whether you have a dependency on 'io.netty:netty-resolver-dns-native-macos'
[main] INFO RediSearchBugProver - ===> running with test-1771235215479:, idx-1771235215479
[main] INFO RediSearchBugProver - === search 924
[main] INFO RediSearchBugProver - Fields at pos 0: {$={"pos":12,"ts":1771235215499,"large":"just here to make the response larger to some great extend and overflow the buffers"}}
Exception in thread "main" org.opentest4j.AssertionFailedError: On loop 924: expected: <{"pos":1,"ts":1771235215484,"large":"just here to make the response larger to some great extend and overflow the buffers"}> but was: <{"pos":12,"ts":1771235215499,"large":"just here to make the response larger to some great extend and overflow the buffers"}>
	at RediSearchBugProver.runTest(RediSearchBugProver.java:83)
	at RediSearchBugProver.main(RediSearchBugProver.java:40)
Caused by: org.opentest4j.AssertionFailedError: expected: <{"pos":1,"ts":1771235215484,"large":"just here to make the response larger to some great extend and overflow the buffers"}> but was: <{"pos":12,"ts":1771235215499,"large":"just here to make the response larger to some great extend and overflow the buffers"}>
	at org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:158)
	at org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:139)
	at org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:201)
	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:184)
	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:179)
	at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:1188)
	at RediSearchBugProver.runTest(RediSearchBugProver.java:79)
	... 1 more

@a-TODO-rov
Copy link
Contributor

@YoHanKi I also tried tightening your integration test, similar to the reproducer, but it seems it cannot be reproduced with integration test. Probably some environment factors are involved in the problem -like java version for example. Didn't have time to investigate further.

@YoHanKi
Copy link
Author

YoHanKi commented Feb 18, 2026

@a-TODO-rov — thank you again for taking the time to test and for sharing the crash log.

After applying the copy() change, I haven’t been able to reproduce the remaining corruption on my side.

I ran the full regression suite — including a test that mirrors your reproducer as closely as I can (small ~95-byte payloads, sortBy("pos"), incremental insert + search) — across the following matrix on Linux (x86-64, Docker):

JDK redis/redis-stack-server:latest redis/redis-stack-server:7.2.0-v10
Eclipse Temurin 21 ✅ 4/4 PASS ✅ 4/4 PASS
Amazon Corretto 21 ✅ 4/4 PASS ✅ 4/4 PASS
Eclipse Temurin 17 ✅ 4/4 PASS ✅ 4/4 PASS

Tests covered:

  • RESP3 growing document set (1,000 docs × 200 iterations, 4 KB payload)
  • RESP3 high-frequency loop (100 docs × 500 iterations, 4 KB payload)
  • RESP2 growing document set (1,000 docs × 200 iterations, 4 KB payload)
  • RESP3 incremental insert + sortBy("pos") + ~95-byte payload — intended to match your reproducer conditions

For reference, the unfixed version (using asReadOnlyBuffer()) triggers a JVM SIGSEGV crash across all combinations above.

Update — additional Linux environment verification:
I also tested the fix on Ubuntu 24.04 with the Ubuntu-packaged OpenJDK 21 (openjdk-21-jdk from apt) — the same OS/JVM flavour shown in your crash log — as well as Alpine Linux (musl libc), Amazon Linux 2023 (Corretto 21), and JDK 17. All combinations pass cleanly:

Environment redis-stack-server:latest redis-stack-server:7.2.0-v10
Temurin 21 / Debian ✅ PASS ✅ PASS
Corretto 21 / Amazon Linux ✅ PASS ✅ PASS
Temurin 17 / Debian ✅ PASS ✅ PASS
Temurin 21 / Alpine (musl libc) ✅ PASS ✅ PASS
Ubuntu 24.04 openjdk-21 from apt ✅ PASS ✅ PASS

On the other hand, the unfixed version (asReadOnlyBuffer()) crashes with SIGSEGV in jbyte_disjoint_arraycopy_avx3 on all five environments above — consistent with what you reported.

Since I can’t reproduce a failure with the copy() fix across any of these combinations, I’m concerned I might be missing a condition that triggers a different code path.

If you don’t mind, could you share a bit more about your environment and how you ran the reproducer? Even small details may help narrow this down:

  • CPU model / virtualization (bare metal vs VM)
  • libc / distro details (e.g., glibc version), container vs non-container
  • JDK vendor + java -version
  • Netty version on the classpath (mvn dependency:tree | grep netty)
  • Active transport (EpollEventLoopGroup / IOUringEventLoopGroup in logs)
  • Build/run method (Maven/Gradle/IDE/fat-jar) and any custom ClientResources/ChannelOptions

One more thing: I noticed you mentioned near the end of PR #3664 that a very similar approach didn’t resolve the issue. I re-tested with the same code structure and observed:

  • asReadOnlyBuffer(): reproducible SIGSEGV
  • updated copy(): no crash across the environments above

If there’s any subtle difference in your local patch or the exact reproduction steps (even something that seems minor), I’d really appreciate pointers — I want to make sure I’m not overlooking anything.

Thanks again — any additional details you can share would be very helpful for validating the fix and ensuring we cover the real root cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory corruption on large responses for ft.search

3 participants