Make MetricsITBase tests serial instead of parallelizable (CASSJAVA-19)#2048
Open
Bouncheck wants to merge 1 commit intoapache:4.xfrom
Open
Make MetricsITBase tests serial instead of parallelizable (CASSJAVA-19)#2048Bouncheck wants to merge 1 commit intoapache:4.xfrom
Bouncheck wants to merge 1 commit intoapache:4.xfrom
Conversation
This change fixes random failures by removing `MicrometerMetricsIT`, `MicroProfileMetricsIT` and `DropwizardMetricsIT` from the `ParallelizableTests` category. It is not the only solution but seems to be the simplest workaround. The reasoning is as follows: `should_evict_down_node_metrics_when_timeout_fires()` has two places where it manipulates `AbstractMetricUpdater.MIN_EXPIRE_AFTER` in order to set `DefaultDriverOption.METRICS_NODE_EXPIRE_AFTER` to 1 second during session initialization. Otherwise `MIN_EXPIRE_AFTER` would not allow for that and warning log would be printed about using higher value than what user wanted to set. It lowers it before session initialization: https://github.com/apache/cassandra-java-driver/blob/17ebe6092e2877d8c524e07489c4c3d005cfeea5/integration-tests/src/test/java/com/datastax/oss/driver/core/metrics/MetricsITBase.java#L157 And sets it back to 5 mintues at the end of the test: https://github.com/apache/cassandra-java-driver/blob/17ebe6092e2877d8c524e07489c4c3d005cfeea5/integration-tests/src/test/java/com/datastax/oss/driver/core/metrics/MetricsITBase.java#L186 The code comment in AbstractMetricUpdater also mentions that this variable is intentionally not made final for testing purposes. I believe this is what causes the intermittent failures. When those three tests that extend this class run in parallel it is possible that test B sets the minimum back to 5 minutes after the test A lowers it to 1 second but before test A initializes its session. The test waits around 10 seconds for the change to happen, but in such case it would need to wait for 5 minutes. Warnings with this pattern should be visible when the test fails on the CI: ``` c.d.o.d.i.c.m.AbstractMetricUpdater - [s6] Value too low for advanced.metrics.node.expire-after: PT1S. Forcing to PT5M instead. ``` I can confirm that this is what was happening when I was testing this locally, but I did not try to search for the logs on cassandra-java-driver's CI.
771cc50 to
441389f
Compare
Author
|
Here's also a handy script for reproducing this issue. I think I needed around 20 runs on version before the change to trigger the bug. (note: exit code 0 does not mean no test failed in general, but here it's sufficient) #!/bin/bash
mvn clean install -P fast
echo "Running tests in a loop until failure detection..."
counter=1
while true; do
echo "=============================================="
echo "Attempt #$counter"
echo "=============================================="
# Run the test command and show output in real-time while also capturing it
mvn -pl integration-tests integration-test -Dit.test='MicrometerMetricsIT, MicroProfileMetricsIT, DropwizardMetricsIT' | tee /tmp/maven_output.log
# Capture the exit status of the tee command (which will be the exit status of the Maven command)
maven_status=${PIPESTATUS[0]}
# Check for explicit error containing the method name in the captured output
if grep -q "\[ERROR\].*should_evict_down_node_metrics_when_timeout_fires" /tmp/maven_output.log; then
echo ""
echo "Found test failure on attempt #$counter"
echo "Failing test output:"
grep -A 5 -B 2 "should_evict_down_node_metrics_when_timeout_fires" /tmp/maven_output.log
break
fi
# Also check standard Maven error exit code
if [ $maven_status -ne 0 ]; then
echo ""
echo "Test command failed on attempt #$counter, but the specific error wasn't found."
break
fi
echo ""
echo "Test passed. Continuing..."
((counter++))
# Optional: small delay between runs
sleep 1
done
echo ""
echo "Tests failed after $counter attempts." |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This change fixes random failures by removing
MicrometerMetricsIT,MicroProfileMetricsITandDropwizardMetricsITfrom theParallelizableTestscategory.It is not the only solution but seems to be the simplest workaround.
The reasoning is as follows:
should_evict_down_node_metrics_when_timeout_fires()has two places where it manipulatesAbstractMetricUpdater.MIN_EXPIRE_AFTERin order to setDefaultDriverOption.METRICS_NODE_EXPIRE_AFTERto 1 second during session initialization. OtherwiseMIN_EXPIRE_AFTERwould not allow for that and warning log would be printed about using higher value than what user wanted to set.It lowers it before session initialization:
cassandra-java-driver/integration-tests/src/test/java/com/datastax/oss/driver/core/metrics/MetricsITBase.java
Line 157 in 17ebe60
cassandra-java-driver/integration-tests/src/test/java/com/datastax/oss/driver/core/metrics/MetricsITBase.java
Line 186 in 17ebe60
The test waits around 10 seconds for the change to happen, but in such case it would need to wait for 5 minutes.
Warnings with this pattern should be visible when the test fails on the CI:
I can confirm that this is what was happening when I was testing this locally, but I did not try to search for the logs on cassandra-java-driver's CI.