-
Notifications
You must be signed in to change notification settings - Fork 6k
[AI Studio] Frame Project backend returns errorCode:500 for 13+ hours — project never becomes usable #78427
Description
Summary
The AI Studio Frame Project backend API (/studio/project/running) returns errorCode:500, "System Error" for extended periods after project provisioning. We monitored project 10105279 continuously for ~25 hours and logged 1,897 API responses — the first 13+ hours were uninterrupted errorCode:500, including through a full stop/start cycle. The project never became usable.
This is a backend reliability issue — separate from the SSH routing bug reported in #78425.
cc @tizhou86 (container/K8s infrastructure), @sunzhongkai588
Forensic Evidence
Background polling against /studio/project/running with projectId=10105279 at ~30-second intervals. 1,897 entries logged, each with a unique logId verifiable against server-side logs.
Timeline (all times CST, 2026-03-21 → 2026-03-22)
| Time | Duration | API Response | Notes |
|---|---|---|---|
| 2026-03-21 ~16:35 | — | — | Frame Project 10105279 created |
| ~16:35 → 18:23 | ~1h48m | code: 201, sshAddress: null |
Provisioning stuck — UI shows "running" but API never returns SSH endpoint |
| 18:24:08 → 05:40:02 | 11h16m | errorCode:500, "System Error" |
1,182 consecutive 500 responses through entire night |
| 05:40 → 12:46 | ~7h | — | Polling gap (machine sleep) |
| 12:46:34 → 14:46:44 | 2h00m | errorCode:500, "System Error" |
229 more 500 responses after gap — still broken |
| 14:47:15 → 19:13:48 | 4h27m | errorCode:8403, "Not login yet" |
Auth token expired. Backend was broken before expiry. |
Total documented errorCode:500 duration: 13h16m (11h16m + 2h00m)
Total API polls logged: 1,897 (1,411 errorCode:500 + 486 errorCode:8403)
Stop/Start Cycle Did Not Fix It
During the errorCode:500 period:
- Stopped the project via UI
- Waited for confirmation
- Restarted the project
- Polled again — still errorCode:500
Sample API Response
{"logId":"cbcd96d605510f69cd862450ae571317","errorCode":500,"errorMsg":"System Error","timestamp":0,"result":{}}Each response has a unique logId. Full log of 1,411 logId values available for server-side correlation.
Reproduction
- Create a Frame Project (框架开发任务) with GPU frame (e.g.,
paddle dev gpu-cuda13.0-cudnn9.13) - Click "启动" (Start)
- Poll the readiness endpoint:
curl -s 'https://aistudio.baidu.com/studio/project/running' \ -X POST \ -H "x-studio-token: $TOKEN" \ -H 'Content-Type: application/x-www-form-urlencoded' \ --data 'projectId=<PROJECT_ID>'
- Observe:
errorCode:500, "System Error"— indefinitely
Note: A second project (10107437) created 2026-03-22 started near-instantly without errorCode:500. This may be intermittent — which makes it harder to diagnose but does not reduce the impact when it occurs.
Environment
- Product: AI Studio Frame Project (框架开发任务)
- Project: https://aistudio.baidu.com/projectdetailforpaddle/10105279
- Frame:
paddle dev gpu-cuda13.0-cudnn9.13(V100) - API endpoint:
POST /studio/project/running - Monitoring duration: 2026-03-21 18:24 → 2026-03-22 19:13 (~25 hours)
- Polling interval: ~30 seconds
Impact
- Projects entering this state are permanently stuck — appears "running" in UI but backend never returns a usable session
- Stop/start cycle does not recover
- Users are billed for compute cards (2.0 pts/hr V100) during the entire error period on a non-functional project
- Blocks productive use even if SSH routing ([AI Studio] Frame Project SSH: 6 infrastructure failure modes — non-deterministic auth, host key rotation, non-functional sessions #78425) were fixed — backend must first reliably provision
For comparison: Google Cloud provisions equivalent GPU environments in under 60 seconds with 100% reliability across our testing. AI Studio's Frame Project spent 13+ hours returning System Error.
Related
- [AI Studio] Frame Project SSH: 6 infrastructure failure modes — non-deterministic auth, host key rotation, non-functional sessions #78425 — Frame Project SSH routing failure (
default-service-name.idehubK8s template placeholder) - AI Studio: GitHub unreachable across all environment types — impacts Hackathon contributor workflows #78426 — GitHub unreachable across all AI Studio environment types (GFW)