Skip to content

Add INFO COMMANDSTATS tracking with per-command success rates#1702

Open
Mathos1432 wants to merge 2 commits intomainfrom
users/matrembl/commandstats
Open

Add INFO COMMANDSTATS tracking with per-command success rates#1702
Mathos1432 wants to merge 2 commits intomainfrom
users/matrembl/commandstats

Conversation

@Mathos1432
Copy link
Copy Markdown
Contributor

Why is this change being made?

Garnet currently has no way to track per-command usage statistics. Redis provides INFO COMMANDSTATS which reports how many times each command was called, how many failed, and how many were rejected. This is critical for production observability — operators need to understand command usage patterns, identify failing commands, and monitor success rates.

What changed?

Adds Redis-compatible INFO COMMANDSTATS support, gated behind --commandstats-monitor (disabled by default, zero overhead when off). When enabled, tracks per-command calls, failed_calls, and rejected_calls.

Key design decisions:

  • Counter-only tracking — No latency (usec) on the hot path. Reports usec=0,usec_per_call=0.00 for Redis format compatibility. Latency can be added later via sampling.

  • Per-session single-writer — Each session owns its own CommandStats array indexed by RespCommand enum (O(1) access). No locking on the hot path; aggregation happens on-demand when INFO COMMANDSTATS is requested.

  • Error detection — Dual strategy: commandErrorWritten flag set by WriteError/AbortWithErrorMessage, plus RESP - prefix check guarded by a bufferFlushedDuringCommand flag.

  • Excluded from default INFO — Only appears when explicitly requested via INFO COMMANDSTATS.

  • Bug fix: GarnetServerMonitor.Dispose() deadlocked when Start() was never called (the ManualResetEvent was never signaled). This affected EmbeddedRespServer used by BDN benchmarks. Fixed by tracking a started flag.

How was this validated?

Unit tests (17 passing)

  • 8 new RespCommandStatsTests: DisabledByDefault, CallsTracking, FailedCalls, UsecFieldsZero, MultipleCommands, SuccessRate, InfoServerShowsMonitorStatus, FormatMatchesRedisConvention
  • 9 existing metrics/info tests verified unbroken

Performance benchmarks (BenchmarkDotNet)

  • ACL — Server started with an ACL file (authentication enabled)
  • AOF — Server started with Append-Only File (persistence enabled)
  • None — Bare server, no ACL or AOF

PING (100 commands/op, .NET 10, Server GC)

Params Disabled Enabled Overhead
ACL 1.687 μs 1.820 μs +7.9%
AOF 1.680 μs 1.794 μs +6.8%
None 1.690 μs 1.816 μs +7.5%

PING is ~17ns/command, so ~1.3ns tracking overhead is proportionally large. This is the worst case.

SET (100 commands/op, .NET 10, Server GC)

Params Disabled Enabled Overhead
ACL 13.10 μs 13.18 μs +0.6%
AOF 21.13 μs 20.62 μs noise
None 12.62 μs 13.03 μs +3.2%

GET (100 commands/op, .NET 10, Server GC)

Params Disabled Enabled Overhead
ACL 14.93 μs 14.65 μs noise
AOF 14.70 μs 14.68 μs noise
None 14.24 μs 14.82 μs +4.1%

Summary: For real-world commands (SET/GET), overhead is 0-4% and within measurement noise for most configurations. Disabled by default = zero overhead unless opted in.

@Mathos1432 Mathos1432 force-pushed the users/matrembl/commandstats branch 5 times, most recently from 0460835 to 60dc7f0 Compare April 15, 2026 13:35
Implements Redis-compatible INFO COMMANDSTATS support, gated behind
--commandstats-monitor config flag. Tracks per-command calls, failed_calls,
and rejected_calls using lightweight counter increments (no Stopwatch).

Key design decisions:
- Array-indexed by RespCommand enum for O(1) access
- Per-session counters (single-writer, no locking on hot path)
- On-demand aggregation from active sessions + disposed session history
- No latency tracking (usec=0) to avoid Stopwatch overhead (~4.4x perf hit)
- Monitor dispose handles case where Start() was never called (EmbeddedServer)

Benchmark results (100 PINGs, .NET 10):
  Disabled: 1.687 us | Enabled: 1.820 us | Overhead: ~7.8%
  SET/GET overhead: 0-4% (within noise for real workloads)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Mathos1432 Mathos1432 force-pushed the users/matrembl/commandstats branch from 60dc7f0 to f26aea3 Compare April 15, 2026 15:14
…latency references

When MetricsSamplingFrequency > 0, globalCommandStats already includes
history from disposed sessions. PopulateCommandStatsInfo was adding
historyCommandStats on top, double-counting disposed session stats.

Also removed stale 'latency' references from doc comments and help text
since per-command latency tracking was removed to avoid Stopwatch overhead.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Mathos1432 Mathos1432 marked this pull request as ready for review April 15, 2026 19:57
Copilot AI review requested due to automatic review settings April 15, 2026 19:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +659 to +663
if (!failed && !bufferFlushedDuringCommand &&
responseStartBefore != null && dcurr > responseStartBefore)
failed = *responseStartBefore == '-';
if (failed)
commandStats.IncrementFailed(cmd);
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure-detection logic dereferences responseStartBefore assuming the output buffer was never flushed (!bufferFlushedDuringCommand). However bufferFlushedDuringCommand is only set in SendAndReset(); other flush paths (e.g., WriteDirectLarge and SendAndReset(IMemoryOwner<byte>, ...)) also call Send(...)/GetResponseObject() without setting this flag. If a command triggers one of those flushes, responseStartBefore may point to an old response buffer, and *responseStartBefore can read invalid memory / crash. Consider tracking flushes in all buffer-rotation paths, or avoid raw-pointer inspection by relying solely on an explicit commandErrorWritten signal (ensure every error write sets it), or store an offset within the current response object rather than a raw pointer.

Copilot uses AI. Check for mistakes.
if (entry.Calls == 0 && entry.RejectedCalls == 0)
continue;

string cmdName = cmd.ToString().ToLowerInvariant().Replace('_', '|');
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Command names for INFO output should use the canonical Redis command name mapping rather than cmd.ToString(). Using the enum name risks mismatches for commands/subcommands/aliases and duplicates the naming logic already centralized in RespCommandsInfo.GetRespCommandName(cmd) (which already encodes the '|' subcommand separator). Suggest building cmdName from RespCommandsInfo.GetRespCommandName(cmd) and then applying the cmdstat_/lowercasing formatting.

Suggested change
string cmdName = cmd.ToString().ToLowerInvariant().Replace('_', '|');
string cmdName = RespCommandsInfo.GetRespCommandName(cmd).ToLowerInvariant();

Copilot uses AI. Check for mistakes.
Comment on lines +48 to +55
TimeSpan metricsUpdateDelay = TimeSpan.FromSeconds(1.5);

using ConnectionMultiplexer redis = ConnectionMultiplexer.Connect(TestUtils.GetConfig());
IDatabase db = redis.GetDatabase(0);

db.StringSet("key1", "value1");
Thread.Sleep(metricsUpdateDelay);

Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests rely on fixed Thread.Sleep delays to wait for metrics sampling, which can make the suite slow and flaky under load/CI variance. Since COMMANDSTATS can be aggregated on-demand when sampling is disabled, consider starting the server with metricsSamplingFreq <= 0 and polling INFO until the expected entry appears (with a timeout) instead of sleeping a fixed duration.

Copilot uses AI. Check for mistakes.
Comment on lines +20 to +26
private void StartServer(bool commandStatsMonitor = true, int metricsSamplingFreq = 1)
{
server = TestUtils.CreateGarnetServer(
TestUtils.MethodTestDir,
metricsSamplingFreq: metricsSamplingFreq,
commandStatsMonitor: commandStatsMonitor);
server.Start();
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StartServer defaults metricsSamplingFreq to 1, which forces these tests to depend on periodic sampling (hence repeated sleeps). Since COMMANDSTATS supports on-demand aggregation when sampling is disabled, consider defaulting metricsSamplingFreq to a non-positive value for this test class (or adding a helper that waits/polls), to avoid timing-based assertions throughout the suite.

Copilot uses AI. Check for mistakes.
this.loggerFactory = loggerFactory;
this.databaseManager = databaseManager ?? DatabaseManagerFactory.CreateDatabaseManager(serverOptions, createDatabaseDelegate, this);
this.monitor = serverOptions.MetricsSamplingFrequency > 0
this.monitor = serverOptions.MetricsSamplingFrequency > 0 || serverOptions.CommandStatsMonitor
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GarnetLatencyMetricsSession requires a non-null GarnetServerMonitor (it dereferences monitor.monitor_iterations), but StoreWrapper still does not create monitor when only LatencyMonitor is enabled and MetricsSamplingFrequency <= 0. With the current condition, --latency-monitor can still lead to a NullReferenceException at session creation unless metrics sampling (or now commandstats) also happens to be enabled. Consider including serverOptions.LatencyMonitor in the monitor-creation condition, or adding explicit validation that latency monitoring requires a positive MetricsSamplingFrequency.

Suggested change
this.monitor = serverOptions.MetricsSamplingFrequency > 0 || serverOptions.CommandStatsMonitor
this.monitor = serverOptions.MetricsSamplingFrequency > 0 || serverOptions.CommandStatsMonitor || serverOptions.LatencyMonitor

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants