Skip to content

openclaw-opd/4b_lora训练报错问题 #101

@Mathking1

Description

@Mathking1

这是运行脚本
run_qwen3_4b_openclaw_opd_topk_lora.sh
直接运行会无法启动主的推理服务
(SGLangEngine pid=3080582) [2026-04-23 11:02:56] INFO: Started server process [3080920]
(SGLangEngine pid=3080582) [2026-04-23 11:02:56] INFO: Waiting for application startup.
(SGLangEngine pid=3080582) [2026-04-23 11:02:56] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}

(SGLangEngine pid=3080582) thread '' (3080920) panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/rayon-core-1.13.0/src/registry.rs:171:10:
(SGLangEngine pid=3080582) The global thread pool has not been initialized. ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }
(SGLangEngine pid=3080582) note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
(SGLangEngine pid=3080582) [2026-04-23 11:02:56] ERROR: Traceback (most recent call last):
(SGLangEngine pid=3080582) File "/var/ai-cloud/project/qwenpaw-rl-dev/lib/python3.12/site-packages/starlette/routing.py", line 694, in lifespan
(SGLangEngine pid=3080582) async with self.lifespan_context(app) as maybe_state:
(SGLangEngine pid=3080582) File "/var/ai-cloud/project/python3.12/lib/python3.12/contextlib.py", line 210, in aenter
(SGLangEngine pid=3080582) return await anext(self.gen)
(SGLangEngine pid=3080582) File "/var/ai-cloud/project/qwenpaw-rl-dev/lib/python3.12/site-packages/fastapi/routing.py", line 201, in merged_lifespan
(SGLangEngine pid=3080582) async with original_context(app) as maybe_original_state:
(SGLangEngine pid=3080582) File "/var/ai-cloud/project/python3.12/lib/python3.12/contextlib.py", line 210, in aenter
(SGLangEngine pid=3080582) return await anext(self.gen)
(SGLangEngine pid=3080582) File "/var/ai-cloud/project/zhongkaipeng/qwenpaw_train/own_train_packages/sglang-d566816d838ce92d3ae044209f7d67eaa58ce74a/python/sglang/srt/entrypoints/http_server.py", line 308, in lifespan
(SGLangEngine pid=3080582) fast_api_app.state.openai_serving_rerank = OpenAIServingRerank(
(SGLangEngine pid=3080582) File "/var/ai-cloud/project/zhongkaipeng/qwenpaw_train/own_train_packages/sglang-d566816d838ce92d3ae044209f7d67eaa58ce74a/python/sglang/srt/entrypoints/openai/serving_rerank.py", line 212, in init
(SGLangEngine pid=3080582) self._yes_token_id, self._no_token_id = _get_yes_no_token_ids(
(SGLangEngine pid=3080582) File "/var/ai-cloud/project/zhongkaipeng/qwenpaw_train/own_train_packages/sglang-d566816d838ce92d3ae044209f7d67eaa58ce74a/python/sglang/srt/entrypoints/openai/serving_rerank.py", line 30, in _get_yes_no_token_ids
(SGLangEngine pid=3080582) yes_tokens = tokenizer.encode("yes", add_special_tokens=False)
(SGLangEngine pid=3080582) File "/var/ai-cloud/project/qwenpaw-rl-dev/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2732, in encode
(SGLangEngine pid=3080582) encoded_inputs = self.encode_plus(
(SGLangEngine pid=3080582) File "/var/ai-cloud/project/qwenpaw-rl-dev/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 3123, in encode_plus
(SGLangEngine pid=3080582) return self._encode_plus(
(SGLangEngine pid=3080582) File "/var/ai-cloud/project/qwenpaw-rl-dev/lib/python3.12/site-packages/transformers/tokenization_utils_fast.py", line 627, in _encode_plus
(SGLangEngine pid=3080582) batched_output = self._batch_encode_plus(
(SGLangEngine pid=3080582) File "/var/ai-cloud/project/qwenpaw-rl-dev/lib/python3.12/site-packages/transformers/tokenization_utils_fast.py", line 553, in _batch_encode_plus
(SGLangEngine pid=3080582) encodings = self._tokenizer.encode_batch(
(SGLangEngine pid=3080582) pyo3_runtime.PanicException: The global thread pool has not been initialized. ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }
(SGLangEngine pid=3080582) [2026-04-23 11:02:56] ERROR: Application startup failed. Exiting.
如果加上一句话:
export TOKENIZERS_PARALLELISM=false
那么脚本可以正常启动,但是每次qwenpaw调工具都会导致prm报错cuda oom,并且崩掉一个显卡的prm推理服务。

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions