[AI Studio] Frame Project backend returns errorCode:500 for 13+ hours — project never becomes usable

## Summary

The AI Studio Frame Project backend API (`/studio/project/running`) returns `errorCode:500, "System Error"` for extended periods after project provisioning. We monitored project 10105279 continuously for **~25 hours** and logged **1,897 API responses** — the first **13+ hours** were uninterrupted errorCode:500, including through a full stop/start cycle. The project never became usable.

This is a backend reliability issue — separate from the SSH routing bug reported in #78425.

cc @tizhou86 (container/K8s infrastructure), @sunzhongkai588

## Forensic Evidence

Background polling against `/studio/project/running` with `projectId=10105279` at ~30-second intervals. 1,897 entries logged, each with a unique `logId` verifiable against server-side logs.

### Timeline (all times CST, 2026-03-21 → 2026-03-22)

| Time | Duration | API Response | Notes |
|------|----------|-------------|-------|
| 2026-03-21 ~16:35 | — | — | Frame Project 10105279 created |
| ~16:35 → 18:23 | ~1h48m | `code: 201, sshAddress: null` | Provisioning stuck — UI shows "running" but API never returns SSH endpoint |
| **18:24:08 → 05:40:02** | **11h16m** | `errorCode:500, "System Error"` | **1,182 consecutive 500 responses** through entire night |
| 05:40 → 12:46 | ~7h | — | Polling gap (machine sleep) |
| **12:46:34 → 14:46:44** | **2h00m** | `errorCode:500, "System Error"` | **229 more 500 responses** after gap — still broken |
| 14:47:15 → 19:13:48 | 4h27m | `errorCode:8403, "Not login yet"` | Auth token expired. Backend was broken *before* expiry. |

**Total documented errorCode:500 duration**: 13h16m (11h16m + 2h00m)
**Total API polls logged**: 1,897 (1,411 errorCode:500 + 486 errorCode:8403)

### Stop/Start Cycle Did Not Fix It

During the errorCode:500 period:
1. Stopped the project via UI
2. Waited for confirmation
3. Restarted the project
4. Polled again — **still errorCode:500**

### Sample API Response

```json
{"logId":"cbcd96d605510f69cd862450ae571317","errorCode":500,"errorMsg":"System Error","timestamp":0,"result":{}}
```

Each response has a unique `logId`. Full log of 1,411 `logId` values available for server-side correlation.

## Reproduction

1. Create a Frame Project (框架开发任务) with GPU frame (e.g., `paddle dev gpu-cuda13.0-cudnn9.13`)
2. Click "启动" (Start)
3. Poll the readiness endpoint:
   ```bash
   curl -s 'https://aistudio.baidu.com/studio/project/running' \
     -X POST \
     -H "x-studio-token: $TOKEN" \
     -H 'Content-Type: application/x-www-form-urlencoded' \
     --data 'projectId=<PROJECT_ID>'
   ```
4. Observe: `errorCode:500, "System Error"` — indefinitely

**Note**: A second project (10107437) created 2026-03-22 started near-instantly without errorCode:500. This may be intermittent — which makes it harder to diagnose but does not reduce the impact when it occurs.

## Environment

- **Product**: AI Studio Frame Project (框架开发任务)
- **Project**: https://aistudio.baidu.com/projectdetailforpaddle/10105279
- **Frame**: `paddle dev gpu-cuda13.0-cudnn9.13` (V100)
- **API endpoint**: `POST /studio/project/running`
- **Monitoring duration**: 2026-03-21 18:24 → 2026-03-22 19:13 (~25 hours)
- **Polling interval**: ~30 seconds

## Impact

- Projects entering this state are **permanently stuck** — appears "running" in UI but backend never returns a usable session
- Stop/start cycle does not recover
- Users are **billed for compute cards** (2.0 pts/hr V100) during the entire error period on a non-functional project
- Blocks productive use even if SSH routing (#78425) were fixed — backend must first reliably provision

**For comparison**: Google Cloud provisions equivalent GPU environments in under 60 seconds with 100% reliability across our testing. AI Studio's Frame Project spent 13+ hours returning System Error.

## Related

- #78425 — Frame Project SSH routing failure (`default-service-name.idehub` K8s template placeholder)
- #78426 — GitHub unreachable across all AI Studio environment types (GFW)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AI Studio] Frame Project backend returns errorCode:500 for 13+ hours — project never becomes usable #78427

Summary

Forensic Evidence

Timeline (all times CST, 2026-03-21 → 2026-03-22)

Stop/Start Cycle Did Not Fix It

Sample API Response

Reproduction

Environment

Impact

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Time	Duration	API Response	Notes
2026-03-21 ~16:35	—	—	Frame Project 10105279 created
~16:35 → 18:23	~1h48m	`code: 201, sshAddress: null`	Provisioning stuck — UI shows "running" but API never returns SSH endpoint
18:24:08 → 05:40:02	11h16m	`errorCode:500, "System Error"`	1,182 consecutive 500 responses through entire night
05:40 → 12:46	~7h	—	Polling gap (machine sleep)
12:46:34 → 14:46:44	2h00m	`errorCode:500, "System Error"`	229 more 500 responses after gap — still broken
14:47:15 → 19:13:48	4h27m	`errorCode:8403, "Not login yet"`	Auth token expired. Backend was broken before expiry.

[AI Studio] Frame Project backend returns errorCode:500 for 13+ hours — project never becomes usable #78427

Description

Summary

Forensic Evidence

Timeline (all times CST, 2026-03-21 → 2026-03-22)

Stop/Start Cycle Did Not Fix It

Sample API Response

Reproduction

Environment

Impact

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions