Skip to content

Possible integer overflow in NetworkPublication #1967

@ivanbogo

Description

@ivanbogo

We believe we've encountered a integer overflow bug in NetworkPublication#sendData, which can cause the publication to be stuck in the DRAINING state.

Details on our setup below.

We run a AeronCluster based service which publishes identical output from several replicas. The output is to an exclusive publication on a multicast
channel where each replica is configured to use the same session-id, stream-id and initial position.
The stream is also recorded by a local archive so late joiners can catch up using ReplayMerge.

We recently encountered an issue, whereby if a replica is >2GB behind the others (can happen on restart), it gets stuck and is not able to publish
to the live stream. Additionally, even when we stop the replica process, the media driver keeps the publication alive and does not
allow another instance to create the exclusive publication (because the conflicting publication is never reclaimed).

We believe the culprit is an integer overflow in NetworkPublication#sendData.
When the replica is far behind, the following code can cause an overflow:

    final int availableWindow = (int)(senderLimit.get() - senderPosition);

As a work around we start by publishing to a recorded ipc publication until close to the live stream position, and then switch to the multicast publication.

As far as I can tell, our setup is supported, but would appreciate guidance if there is something wrong:
(https://github.com/aeron-io/aeron/wiki/Transport-Protocol-Specification)

Note: Multiple Senders sending redundant data is supported. Can be as simple as having each use the same Session ID, Stream ID, Term ID, and term offset.

We managed to reproduce on both Linux and Windows and are using aeron 1.48.6.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions