Skip to content

LoRA adapters are not properly syncing when using LocalBackend - stale adapters are being used in rollouts #661

@gmanlan

Description

@gmanlan

When using LocalBackend (colocate mode), I noticed that the LoRA adapter was not being sync properly. If I killed the run after several steps and resumed from a checkpoint, the reward would climb up abruptly as ART was forced to reload the last checkpoint as the first/base model. Upon inspection, it seems that the adapter is correctly making its way to vLLM, but the OpenAI serving layer is not "seeing" the new adapter. As a result, rollouts (which are obtaining the current model using model.get_inference_name()) are just seeing/using the initial/base adapter (step @ 0) and so the whole training process falls apart (silently) due to the stale inference.

This seems to be a regression that may have affected multiple versions, because I do remember this working properly in last year's versions.

While I don't have the means to properly submit a PR at the moment, I wanted to share one possible solution here, in case it helps maintainers:

The problem arises from unsloth/service.py, where at the end of _train_shared() it adds the new adapter to vLLM, but it forgets to also register it within _openai_serving_models. A quick fix would be to add the following snippet right after the llm.add_lora() code block:

lora_request = LoRARequest(
	lora_name=f"{self.model_name}@{new_step}",
	lora_int_id=self._next_lora_id(),
	lora_path=checkpoint_dir,
)
added = await llm.add_lora(lora_request)
if not added:
	raise RuntimeError(f"Failed to add LoRA adapter for step {new_step} at {checkpoint_dir}")

# -- Patch here:
import art.vllm.server as _vllm_server_mod
serving_models = _vllm_server_mod._openai_serving_models
if serving_models is not None:
	serving_models.lora_requests[lora_name] = lora_request
	logger.info(
		"Registered '%s' in OpenAI serving models registry", lora_name
	)
else:
	logger.warning(
		"_openai_serving_models is None — LoRA loaded into vLLM "
		"but NOT registered in the OpenAI serving layer. Inference requests "
		"may still use the previous adapter."
	)
# --

self._latest_step = new_step

Notes:

  • Affected versions: I believe multiple versions are affected, but I personally discovered this with version 0.5.16.
  • I acknowledge that this may not be the best place to fix it, but I've verified that ART works correctly after patching this.
  • I don't know if train_sft is sensitive to this, but if so, it could be patched in exactly the same way I believe.

@bradhilton this is what we briefly discussed on Discord.

I hope it helps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions