Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new Scout stream handler to execute host-based firmware upgrades, including new Forge RPC messages and a Scout implementation that downloads artifacts and runs an upgrade script.
Changes:
- Extend Scout stream routing to handle
ScoutFirmwareUpgradeRequestand returnScoutFirmwareUpgradeResponse. - Add
crates/scout/src/firmware_upgrade.rsimplementing download + script execution flow (with unit tests). - Update Forge protobuf definitions to include the new request/response messages and wire them into stream message oneofs.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/scout/src/stream.rs | Routes the new stream request type to the firmware upgrade handler (now async). |
| crates/scout/src/main.rs | Registers the new firmware_upgrade module. |
| crates/scout/src/firmware_upgrade.rs | Implements firmware upgrade flow (download + execute) and adds unit tests. |
| crates/scout/Cargo.toml | Adds dependencies needed for firmware upgrade implementation and tests. |
| crates/rpc/proto/forge.proto | Adds ScoutFirmwareUpgrade{Request,Response} and wires them into stream messages; also includes formatting cleanups. |
| Cargo.lock | Locks new dependencies (axum/tempfile/tokio-test). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
kensimon
left a comment
There was a problem hiding this comment.
Hmm, this is a little scary since it's basically a "download this URL and run whatever it says" command. It's called "firmware upgrade" right now but it could be re-used for anything else in the future.
I think I'd prefer that if we wanted to go this route, we'd rename the gRPC message to "RunCommandFromUrl" or similar, since that's literally what it does. Calling it "ScoutFirmwareUpgradeRequest" implies that it's doing something specific to firmware upgrades, and it gives us a false sense of security/restriction when there is none.
(Not to mention that there may be better/more restrictive ways to only install firmware if that's our only goal. Not saying I can think of any at the moment, but it'd be great if we could think of a more restricted way to do this.)
|
My editor seems to have added lots of formatting changes that I didn't want to add. Will remove them when handling the feedback. |
|
As mentioned in the reply to the copilot comment, having a checksum of the files is something you might want to consider. It's more of a deal when transferring over the Internet, when a bad link can cause enough errors to eventually get a TCP/IP checksum false negative; if we're expecting downloads from a local site only, it may be of lower importance. (Until we end up with some That One Site that causes issues, at least.) |
8bb40b6 to
c8a7579
Compare
|
@kensimon @ddejong-spec I have made some changes:
Let me know what you think. |
c8a7579 to
791b621
Compare
| string script_url = 3; | ||
| uint32 timeout_seconds = 4; | ||
| // Files to download before running the script. | ||
| // Keys are download URLs, values are expected SHA-256 hex checksums. |
There was a problem hiding this comment.
I'd make this a repeated FileArtifcat file_artifact (or something like this), and define FileArtifact as required. Then it becomes more obvious what the fields are compared to the map, and we can extend it if further fields are required in the future.
There was a problem hiding this comment.
I like this, will add.
| ); | ||
|
|
||
| let work_dir = tempfile::tempdir()?; | ||
|
|
There was a problem hiding this comment.
It looks like we have a timeout for executing the script, but neither for downloading the script nor downloading the artifacts? Can we please add?
You can decide whether
- the timeout that is specified is for each step (meaning the total time of execution could be a multiple of it)
- the timeout is for everything together. In that case each step would need to subtract the already elapsed time for previous steps for calculating the timeout. Or you calculate a deadline once upfront.
There was a problem hiding this comment.
Right, only the script execution phase is using that timeout param. I am thinking we can hardcode the timeout for the script download phase because it is usually a couple of lines of bash script. But the artifact downloading can be different depending on the component so that might have to be another parameter. What do you think?
| // Download files and verify checksums. | ||
| let download_dir = work_dir.path().join("downloads"); | ||
| std::fs::create_dir_all(&download_dir)?; | ||
| for (url, expected_sha256) in &request.download_files { |
There was a problem hiding this comment.
we might want to consider downloading everything in parallel. But that could be a future optimization
| mlx_device.MlxDeviceConfigSyncRequest mlx_device_config_sync_request = 13; | ||
| mlx_device.MlxDeviceConfigCompareRequest mlx_device_config_compare_request = 14; | ||
| ScoutStreamAgentPingRequest scout_stream_agent_ping_request = 15; | ||
| ScoutRemoteExecRequest scout_remote_exec_request = 16; |
There was a problem hiding this comment.
It's a very long running request. I am not sure if it fits the "scout stream" model best.
My understanding was a bit of:
- If we do anything in the main state machine, then
ForgeAgentControlwould be the mechanism to tell scout what to do - If things are outside of the state machine and relatively short lived, then scout stream could be used.
Maybe @chet who introduced scout stream can help figuring out where it fits best.
There was a problem hiding this comment.
Hm, I thought we were moving towards the stream model because it seemed much cleaner than ForgeAgentControl. Let's see what @chet has to say.
There was a problem hiding this comment.
Hey Matthias, I have created this PR with the polling approach: #590. If we go with the polling approach, I will close this PR. It also includes your other suggestions (file_artifacts, timeouts). Would appreciate your input there.
Description
Adds a handler in scout to handle firmware upgrades. Downloads the necessary files and performs the upgrade. Currently the behaviour of the whole system is not affected. I will follow up with carbide-api changes that initiates this new process and handles the response.
Type of Change
Testing
Additional Notes
Discussed whether to keep this new handler synchronous or make it async. The main points are the following:
That's why keeping it sync which is consistent with the other handlers.