Skip to content

[AI Studio] Frame Project backend returns errorCode:500 for 13+ hours — project never becomes usable #78427

@cloudforge1

Description

@cloudforge1

Summary

The AI Studio Frame Project backend API (/studio/project/running) returns errorCode:500, "System Error" for extended periods after project provisioning. We monitored project 10105279 continuously for ~25 hours and logged 1,897 API responses — the first 13+ hours were uninterrupted errorCode:500, including through a full stop/start cycle. The project never became usable.

This is a backend reliability issue — separate from the SSH routing bug reported in #78425.

cc @tizhou86 (container/K8s infrastructure), @sunzhongkai588

Forensic Evidence

Background polling against /studio/project/running with projectId=10105279 at ~30-second intervals. 1,897 entries logged, each with a unique logId verifiable against server-side logs.

Timeline (all times CST, 2026-03-21 → 2026-03-22)

Time Duration API Response Notes
2026-03-21 ~16:35 Frame Project 10105279 created
~16:35 → 18:23 ~1h48m code: 201, sshAddress: null Provisioning stuck — UI shows "running" but API never returns SSH endpoint
18:24:08 → 05:40:02 11h16m errorCode:500, "System Error" 1,182 consecutive 500 responses through entire night
05:40 → 12:46 ~7h Polling gap (machine sleep)
12:46:34 → 14:46:44 2h00m errorCode:500, "System Error" 229 more 500 responses after gap — still broken
14:47:15 → 19:13:48 4h27m errorCode:8403, "Not login yet" Auth token expired. Backend was broken before expiry.

Total documented errorCode:500 duration: 13h16m (11h16m + 2h00m)
Total API polls logged: 1,897 (1,411 errorCode:500 + 486 errorCode:8403)

Stop/Start Cycle Did Not Fix It

During the errorCode:500 period:

  1. Stopped the project via UI
  2. Waited for confirmation
  3. Restarted the project
  4. Polled again — still errorCode:500

Sample API Response

{"logId":"cbcd96d605510f69cd862450ae571317","errorCode":500,"errorMsg":"System Error","timestamp":0,"result":{}}

Each response has a unique logId. Full log of 1,411 logId values available for server-side correlation.

Reproduction

  1. Create a Frame Project (框架开发任务) with GPU frame (e.g., paddle dev gpu-cuda13.0-cudnn9.13)
  2. Click "启动" (Start)
  3. Poll the readiness endpoint:
    curl -s 'https://aistudio.baidu.com/studio/project/running' \
      -X POST \
      -H "x-studio-token: $TOKEN" \
      -H 'Content-Type: application/x-www-form-urlencoded' \
      --data 'projectId=<PROJECT_ID>'
  4. Observe: errorCode:500, "System Error" — indefinitely

Note: A second project (10107437) created 2026-03-22 started near-instantly without errorCode:500. This may be intermittent — which makes it harder to diagnose but does not reduce the impact when it occurs.

Environment

  • Product: AI Studio Frame Project (框架开发任务)
  • Project: https://aistudio.baidu.com/projectdetailforpaddle/10105279
  • Frame: paddle dev gpu-cuda13.0-cudnn9.13 (V100)
  • API endpoint: POST /studio/project/running
  • Monitoring duration: 2026-03-21 18:24 → 2026-03-22 19:13 (~25 hours)
  • Polling interval: ~30 seconds

Impact

For comparison: Google Cloud provisions equivalent GPU environments in under 60 seconds with 100% reliability across our testing. AI Studio's Frame Project spent 13+ hours returning System Error.

Related

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions