Implement firmware upgrade in scout by rahmonov · Pull Request #484 · NVIDIA/ncx-infra-controller-core

rahmonov · 2026-03-09T15:48:55Z

Description

Adds a handler in scout to handle firmware upgrades. Downloads the necessary files and performs the upgrade. Currently the behaviour of the whole system is not affected. I will follow up with carbide-api changes that initiates this new process and handles the response.

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

Discussed whether to keep this new handler synchronous or make it async. The main points are the following:

Nothing else should happen while the upgrade is ongoing so making it sync is fine.
How about healthchecks? They only happen for the machines in the ready state.

That's why keeping it sync which is consistent with the other handlers.

Copilot

Pull request overview

Adds a new Scout stream handler to execute host-based firmware upgrades, including new Forge RPC messages and a Scout implementation that downloads artifacts and runs an upgrade script.

Changes:

Extend Scout stream routing to handle ScoutFirmwareUpgradeRequest and return ScoutFirmwareUpgradeResponse.
Add crates/scout/src/firmware_upgrade.rs implementing download + script execution flow (with unit tests).
Update Forge protobuf definitions to include the new request/response messages and wire them into stream message oneofs.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
crates/scout/src/stream.rs	Routes the new stream request type to the firmware upgrade handler (now async).
crates/scout/src/main.rs	Registers the new `firmware_upgrade` module.
crates/scout/src/firmware_upgrade.rs	Implements firmware upgrade flow (download + execute) and adds unit tests.
crates/scout/Cargo.toml	Adds dependencies needed for firmware upgrade implementation and tests.
crates/rpc/proto/forge.proto	Adds `ScoutFirmwareUpgrade{Request,Response}` and wires them into stream messages; also includes formatting cleanups.
Cargo.lock	Locks new dependencies (axum/tempfile/tokio-test).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

crates/scout/src/firmware_upgrade.rs

crates/scout/src/remote_exec.rs

crates/scout/Cargo.toml

kensimon

Hmm, this is a little scary since it's basically a "download this URL and run whatever it says" command. It's called "firmware upgrade" right now but it could be re-used for anything else in the future.

I think I'd prefer that if we wanted to go this route, we'd rename the gRPC message to "RunCommandFromUrl" or similar, since that's literally what it does. Calling it "ScoutFirmwareUpgradeRequest" implies that it's doing something specific to firmware upgrades, and it gives us a false sense of security/restriction when there is none.

(Not to mention that there may be better/more restrictive ways to only install firmware if that's our only goal. Not saying I can think of any at the moment, but it'd be great if we could think of a more restricted way to do this.)

crates/scout/src/remote_exec.rs

rahmonov · 2026-03-09T16:22:49Z

My editor seems to have added lots of formatting changes that I didn't want to add. Will remove them when handling the feedback.

ddejong-spec · 2026-03-09T22:20:03Z

As mentioned in the reply to the copilot comment, having a checksum of the files is something you might want to consider. It's more of a deal when transferring over the Internet, when a bad link can cause enough errors to eventually get a TCP/IP checksum false negative; if we're expecting downloads from a local site only, it may be of lower importance. (Until we end up with some That One Site that causes issues, at least.)

rahmonov · 2026-03-10T16:44:17Z

@kensimon @ddejong-spec I have made some changes:

Added checksum verification.
Renamed things to "remote execution".
Smaller changes based on the comments on the previous version.

Let me know what you think.

Matthias247 · 2026-03-10T17:11:33Z

crates/rpc/proto/forge.proto

+  string script_url = 3;
+  uint32 timeout_seconds = 4;
+  // Files to download before running the script.
+  // Keys are download URLs, values are expected SHA-256 hex checksums.


I'd make this a repeated FileArtifcat file_artifact (or something like this), and define FileArtifact as required. Then it becomes more obvious what the fields are compared to the map, and we can extend it if further fields are required in the future.

I like this, will add.

Matthias247 · 2026-03-10T17:15:05Z

crates/scout/src/remote_exec.rs

+    );
+
+    let work_dir = tempfile::tempdir()?;
+


It looks like we have a timeout for executing the script, but neither for downloading the script nor downloading the artifacts? Can we please add?

You can decide whether

the timeout that is specified is for each step (meaning the total time of execution could be a multiple of it)

the timeout is for everything together. In that case each step would need to subtract the already elapsed time for previous steps for calculating the timeout. Or you calculate a deadline once upfront.

Right, only the script execution phase is using that timeout param. I am thinking we can hardcode the timeout for the script download phase because it is usually a couple of lines of bash script. But the artifact downloading can be different depending on the component so that might have to be another parameter. What do you think?

Matthias247 · 2026-03-10T17:15:32Z

crates/scout/src/remote_exec.rs

+    // Download files and verify checksums.
+    let download_dir = work_dir.path().join("downloads");
+    std::fs::create_dir_all(&download_dir)?;
+    for (url, expected_sha256) in &request.download_files {


we might want to consider downloading everything in parallel. But that could be a future optimization

Matthias247 · 2026-03-10T17:18:29Z

crates/rpc/proto/forge.proto

    mlx_device.MlxDeviceConfigSyncRequest mlx_device_config_sync_request = 13;
    mlx_device.MlxDeviceConfigCompareRequest mlx_device_config_compare_request = 14;
    ScoutStreamAgentPingRequest scout_stream_agent_ping_request = 15;
+    ScoutRemoteExecRequest scout_remote_exec_request = 16;


It's a very long running request. I am not sure if it fits the "scout stream" model best.

My understanding was a bit of:

If we do anything in the main state machine, then ForgeAgentControl would be the mechanism to tell scout what to do

If things are outside of the state machine and relatively short lived, then scout stream could be used.

Maybe @chet who introduced scout stream can help figuring out where it fits best.

Hm, I thought we were moving towards the stream model because it seemed much cleaner than ForgeAgentControl. Let's see what @chet has to say.

Hey Matthias, I have created this PR with the polling approach: #590. If we go with the polling approach, I will close this PR. It also includes your other suggestions (file_artifacts, timeouts). Would appreciate your input there.

crates/scout/src/remote_exec.rs

Copilot AI review requested due to automatic review settings March 9, 2026 15:48

rahmonov requested a review from a team as a code owner March 9, 2026 15:48

Copilot started reviewing on behalf of rahmonov March 9, 2026 15:49 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

kensimon reviewed Mar 9, 2026

View reviewed changes

crates/scout/src/remote_exec.rs Outdated Show resolved Hide resolved

rahmonov force-pushed the jrakhmonov/scout-firmware-proto branch 3 times, most recently from 8bb40b6 to c8a7579 Compare March 10, 2026 16:40

Implement remote execution in scout (e.g. for firmware upgrades)

791b621

rahmonov force-pushed the jrakhmonov/scout-firmware-proto branch from c8a7579 to 791b621 Compare March 10, 2026 17:07

Matthias247 reviewed Mar 10, 2026

View reviewed changes

Matthias247 requested a review from chet March 10, 2026 17:19

ddejong-spec approved these changes Mar 10, 2026

View reviewed changes

rahmonov mentioned this pull request Mar 17, 2026

Implement firmware upgrade in scout #590

Open

9 tasks

Conversation

rahmonov commented Mar 9, 2026

Description

Type of Change

Testing

Additional Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kensimon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rahmonov commented Mar 9, 2026

Uh oh!

ddejong-spec commented Mar 9, 2026

Uh oh!

rahmonov commented Mar 10, 2026

Uh oh!

Matthias247 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

rahmonov Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Matthias247 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

rahmonov Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Matthias247 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Matthias247 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

rahmonov Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

rahmonov Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rahmonov Mar 17, 2026 •

edited

Loading