Add INFO COMMANDSTATS tracking with per-command success rates#1702
Add INFO COMMANDSTATS tracking with per-command success rates#1702Mathos1432 wants to merge 2 commits intomainfrom
Conversation
0460835 to
60dc7f0
Compare
Implements Redis-compatible INFO COMMANDSTATS support, gated behind --commandstats-monitor config flag. Tracks per-command calls, failed_calls, and rejected_calls using lightweight counter increments (no Stopwatch). Key design decisions: - Array-indexed by RespCommand enum for O(1) access - Per-session counters (single-writer, no locking on hot path) - On-demand aggregation from active sessions + disposed session history - No latency tracking (usec=0) to avoid Stopwatch overhead (~4.4x perf hit) - Monitor dispose handles case where Start() was never called (EmbeddedServer) Benchmark results (100 PINGs, .NET 10): Disabled: 1.687 us | Enabled: 1.820 us | Overhead: ~7.8% SET/GET overhead: 0-4% (within noise for real workloads) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
60dc7f0 to
f26aea3
Compare
…latency references When MetricsSamplingFrequency > 0, globalCommandStats already includes history from disposed sessions. PopulateCommandStatsInfo was adding historyCommandStats on top, double-counting disposed session stats. Also removed stale 'latency' references from doc comments and help text since per-command latency tracking was removed to avoid Stopwatch overhead. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (!failed && !bufferFlushedDuringCommand && | ||
| responseStartBefore != null && dcurr > responseStartBefore) | ||
| failed = *responseStartBefore == '-'; | ||
| if (failed) | ||
| commandStats.IncrementFailed(cmd); |
There was a problem hiding this comment.
The failure-detection logic dereferences responseStartBefore assuming the output buffer was never flushed (!bufferFlushedDuringCommand). However bufferFlushedDuringCommand is only set in SendAndReset(); other flush paths (e.g., WriteDirectLarge and SendAndReset(IMemoryOwner<byte>, ...)) also call Send(...)/GetResponseObject() without setting this flag. If a command triggers one of those flushes, responseStartBefore may point to an old response buffer, and *responseStartBefore can read invalid memory / crash. Consider tracking flushes in all buffer-rotation paths, or avoid raw-pointer inspection by relying solely on an explicit commandErrorWritten signal (ensure every error write sets it), or store an offset within the current response object rather than a raw pointer.
| if (entry.Calls == 0 && entry.RejectedCalls == 0) | ||
| continue; | ||
|
|
||
| string cmdName = cmd.ToString().ToLowerInvariant().Replace('_', '|'); |
There was a problem hiding this comment.
Command names for INFO output should use the canonical Redis command name mapping rather than cmd.ToString(). Using the enum name risks mismatches for commands/subcommands/aliases and duplicates the naming logic already centralized in RespCommandsInfo.GetRespCommandName(cmd) (which already encodes the '|' subcommand separator). Suggest building cmdName from RespCommandsInfo.GetRespCommandName(cmd) and then applying the cmdstat_/lowercasing formatting.
| string cmdName = cmd.ToString().ToLowerInvariant().Replace('_', '|'); | |
| string cmdName = RespCommandsInfo.GetRespCommandName(cmd).ToLowerInvariant(); |
| TimeSpan metricsUpdateDelay = TimeSpan.FromSeconds(1.5); | ||
|
|
||
| using ConnectionMultiplexer redis = ConnectionMultiplexer.Connect(TestUtils.GetConfig()); | ||
| IDatabase db = redis.GetDatabase(0); | ||
|
|
||
| db.StringSet("key1", "value1"); | ||
| Thread.Sleep(metricsUpdateDelay); | ||
|
|
There was a problem hiding this comment.
These tests rely on fixed Thread.Sleep delays to wait for metrics sampling, which can make the suite slow and flaky under load/CI variance. Since COMMANDSTATS can be aggregated on-demand when sampling is disabled, consider starting the server with metricsSamplingFreq <= 0 and polling INFO until the expected entry appears (with a timeout) instead of sleeping a fixed duration.
| private void StartServer(bool commandStatsMonitor = true, int metricsSamplingFreq = 1) | ||
| { | ||
| server = TestUtils.CreateGarnetServer( | ||
| TestUtils.MethodTestDir, | ||
| metricsSamplingFreq: metricsSamplingFreq, | ||
| commandStatsMonitor: commandStatsMonitor); | ||
| server.Start(); |
There was a problem hiding this comment.
StartServer defaults metricsSamplingFreq to 1, which forces these tests to depend on periodic sampling (hence repeated sleeps). Since COMMANDSTATS supports on-demand aggregation when sampling is disabled, consider defaulting metricsSamplingFreq to a non-positive value for this test class (or adding a helper that waits/polls), to avoid timing-based assertions throughout the suite.
| this.loggerFactory = loggerFactory; | ||
| this.databaseManager = databaseManager ?? DatabaseManagerFactory.CreateDatabaseManager(serverOptions, createDatabaseDelegate, this); | ||
| this.monitor = serverOptions.MetricsSamplingFrequency > 0 | ||
| this.monitor = serverOptions.MetricsSamplingFrequency > 0 || serverOptions.CommandStatsMonitor |
There was a problem hiding this comment.
GarnetLatencyMetricsSession requires a non-null GarnetServerMonitor (it dereferences monitor.monitor_iterations), but StoreWrapper still does not create monitor when only LatencyMonitor is enabled and MetricsSamplingFrequency <= 0. With the current condition, --latency-monitor can still lead to a NullReferenceException at session creation unless metrics sampling (or now commandstats) also happens to be enabled. Consider including serverOptions.LatencyMonitor in the monitor-creation condition, or adding explicit validation that latency monitoring requires a positive MetricsSamplingFrequency.
| this.monitor = serverOptions.MetricsSamplingFrequency > 0 || serverOptions.CommandStatsMonitor | |
| this.monitor = serverOptions.MetricsSamplingFrequency > 0 || serverOptions.CommandStatsMonitor || serverOptions.LatencyMonitor |
Why is this change being made?
Garnet currently has no way to track per-command usage statistics. Redis provides INFO COMMANDSTATS which reports how many times each command was called, how many failed, and how many were rejected. This is critical for production observability — operators need to understand command usage patterns, identify failing commands, and monitor success rates.
What changed?
Adds Redis-compatible INFO COMMANDSTATS support, gated behind --commandstats-monitor (disabled by default, zero overhead when off). When enabled, tracks per-command calls, failed_calls, and rejected_calls.
Key design decisions:
Counter-only tracking — No latency (usec) on the hot path. Reports usec=0,usec_per_call=0.00 for Redis format compatibility. Latency can be added later via sampling.
Per-session single-writer — Each session owns its own CommandStats array indexed by RespCommand enum (O(1) access). No locking on the hot path; aggregation happens on-demand when INFO COMMANDSTATS is requested.
Error detection — Dual strategy: commandErrorWritten flag set by WriteError/AbortWithErrorMessage, plus RESP - prefix check guarded by a bufferFlushedDuringCommand flag.
Excluded from default INFO — Only appears when explicitly requested via INFO COMMANDSTATS.
Bug fix: GarnetServerMonitor.Dispose() deadlocked when Start() was never called (the ManualResetEvent was never signaled). This affected EmbeddedRespServer used by BDN benchmarks. Fixed by tracking a started flag.
How was this validated?
Unit tests (17 passing)
Performance benchmarks (BenchmarkDotNet)
PING (100 commands/op, .NET 10, Server GC)
PING is ~17ns/command, so ~1.3ns tracking overhead is proportionally large. This is the worst case.
SET (100 commands/op, .NET 10, Server GC)
GET (100 commands/op, .NET 10, Server GC)
Summary: For real-world commands (SET/GET), overhead is 0-4% and within measurement noise for most configurations. Disabled by default = zero overhead unless opted in.