Skip to content

[cells] PID not found errors when stopping running executables #534

Description

@mccormickt

Attempting to stop a running executable seems to have the following behavior on my (x86_64 Ubuntu 24.04) system:

  • sh -c <executable> process is created and has its PID tracked in the Executables cache.
  • Child <executable> process is not tracked.
  • Executable stop command is issued, Auraed errors with PID not found
  • Parent shell process is killed/missing, child executable process remains as a zombie.
$ ps aux | grep "tail -f"
root     4095196  0.0  0.0   8320  1792 ?        S    09:34   0:00 tail -f /dev/null

Testing

Rust

Rust tests, configuring new remote client for nested auraed

#[test_helpers_macros::shared_runtime_test]
async fn cells_start_stop_delete() {
    skip_if_not_root!("cells_start_stop_delete");
    skip_if_seccomp!("cells_start_stop_delete");

    let client = common::auraed_client().await;

    // Allocate a cell
    let cell_name = retry!(
        client
            .allocate(
                common::cells::CellServiceAllocateRequestBuilder::new().build()
            )
            .await
    )
    .unwrap()
    .into_inner()
    .cell_name;

    // Start the executable
    let req = common::cells::CellServiceStartRequestBuilder::new()
        .cell_name(cell_name.clone())
        .executable_name("aurae-exe".to_string())
        .build();
    let _ = retry!(client.start(req.clone()).await).unwrap().into_inner();

    // Stop the executable
    let _ = retry!(
        client
            .stop(proto::cells::CellServiceStopRequest {
                cell_name: Some(cell_name.clone()),
                executable_name: "aurae-exe".to_string(),
            })
            .await
    )
    .unwrap();

    // Delete the cell
    let _ = retry!(
        client
            .free(proto::cells::CellServiceFreeRequest {
                cell_name: cell_name.clone()
            })
            .await
    )
    .unwrap();
}
sudo -E cargo test -p auraed --test cell_start_stop_delete -- --include-ignored
[...snip...]
2024-11-07T01:30:08.068934Z  INFO start: auraed::cells::cell_service::cell_service: CellService: start() executable=ValidatedExec
utable { name: ExecutableName("aurae-exe"), command: "sleep 400", description: "description" } request=ValidatedCellServiceStartR
equest { cell_name: None, executable: ValidatedExecutable { name: ExecutableName("aurae-exe"), command: "sleep 400", description:
 "description" }, uid: None, gid: None }                                                                                         
2024-11-07T01:30:08.069353Z  INFO start: auraed::observe::observe_service: Registering channel for pid 1668303 Stdout request=Val
idatedCellServiceStartRequest { cell_name: None, executable: ValidatedExecutable { name: ExecutableName("aurae-exe"), command: "s
leep 400", description: "description" }, uid: None, gid: None }
2024-11-07T01:30:08.069445Z  INFO start: auraed::observe::observe_service: Registering channel for pid 1668303 Stderr request=Val
idatedCellServiceStartRequest { cell_name: None, executable: ValidatedExecutable { name: ExecutableName("aurae-exe"), command: "s
leep 400", description: "description" }, uid: None, gid: None }
2024-11-07T01:30:08.103119Z  INFO stop: auraed::cells::cell_service::cell_service: CellService: stop() executable_name=Executable
Name("aurae-exe") request=ValidatedCellServiceStopRequest { cell_name: None, executable_name: ExecutableName("aurae-exe") }
2024-11-07T01:30:08.103377Z ERROR stop: auraed::cells::cell_service::error: executable 'aurae-exe' failed to stop: No child proce
sses (os error 10) request=ValidatedCellServiceStopRequest { cell_name: None, executable_name: ExecutableName("aurae-exe") }
thread 'cells_start_stop_delete' panicked at auraed/tests/cell_list_must_list_allocated_cells_recursively.rs:172:6:             
called `Result::unwrap()` on an `Err` value: Status { code: Internal, message: "executable 'aurae-exe' failed to stop: No child p
rocesses (os error 10)", metadata: MetadataMap { headers: {"content-type": "application/grpc", "content-length": "0", "date": "Th
u, 07 Nov 2024 01:30:08 GMT"} }, source: None }

Manually with aer and cloud-hypervisor

Install cloud-hypervisor and build guest image/kernel

sudo make /opt/aurae/cloud-hypervisor/cloud-hypervisor
sudo make build-guest-kernel
sudo make prepare-image

Run cloud-hypervisor with the auraed pid1 image

sudo cloud-hypervisor --kernel /var/lib/aurae/vm/kernel/vmlinux.bin \                                 
--disk path=/var/lib/aurae/vm/image/disk.raw \                                                                                   
--cmdline "console=hvc0 root=/dev/vda1 rw" \                                                                                     
--cpus boot=4 \                                                                                                                  
--memory size=4096M \                                                                                                            
--net "tap=tap0,mac=aa:ae:00:00:00:01,id=eth0"

Retrieve zone ID from tap0 (13 in my case):

ip link show tap0                                               
13: tap0: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 06:66:42:a8:3f:e1 brd ff:ff:ff:ff:ff:ff

Configure aurae client config in ~/.aurae/config:

[system]
socket = "[fe80::2%13]:8080"

Verify cells run:

aer cell allocate sleeper
aer cell start --executable-command "sleep 9000" sleeper sleep-forever
aer cell list
aer cell stop sleeper sleep-forever
aer cell free sleeper

Metadata

Metadata

Assignees

Labels

AuraedThe Aurae Daemon (gRPC Server)Bug

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions