wip(neuron): add neuron integration for SFT by michaelbenayoun · Pull Request #5125 · huggingface/trl

michaelbenayoun · 2026-02-18T19:53:53Z

What does this PR do?

Reference script

MODEL_NAME=Qwen/Qwen3-0.6B
NUM_PROC=4
export TP_SIZE=4

echo "Running SFT with the following configuration:"
echo "Model Name: $MODEL_NAME"
echo "Number of Processes: $NUM_PROC"
echo "Tensor Parallel Size: $TP_SIZE"

export TORCH_NEURONX_ENABLE_STABLEHLO=0
export ON_NEURON_EAGER=1
export TORCH_NEURONX_MLIR_ATEN_OPS=1
export TORCH_NEURONX_NEFF_CACHE_DIR="/home/ubuntu/neff_cache"
export TORCH_NEURONX_NEFF_LOCAL_CACHE_DIR="/home/ubuntu/neff_local_cache/"
export ON_NEURON=1
export TENSOR_DUMPER_OUTPUT_DIR="./tensor_dumps_neuron"

export TORCH_NEURONX_FALLBACK_ONLY_FOR_UNIMPLEMENTED_OPS=1

export OMP_NUM_THREADS=128

# Disable asynchronous loading to avoid hanging, investigate why it does work in async
export HF_DEACTIVATE_ASYNC_LOAD=1 

uv run torchrun --nproc_per_node=$NUM_PROC examples/scripts/sft_neuron.py \
    --model_name_or_path $MODEL_NAME \
    --dataset_name trl-lib/Capybara \
    --learning_rate 2.0e-5 \
    --num_train_epochs 1 \
    --packing \
    --bf16 \
    --fp16 false \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --eos_token '<|im_end|>' \
    --eval_strategy no \
    --logging_steps 1 \
    --use_peft false \
    --lora_r 32 \
    --lora_alpha 16 \
    --report_to wandb \
    --output_dir $MODEL_NAME-SFT

Changes needed to integrate `torch_neuron` and HF ecosystem well:

Transformers

TrainingArguments._setup_device needs to be adapted for the neuron device, refer to Integrate the Neuron device to TrainingArguments transformers#44302
TrainingArguments._validate_args requires a proper is_torch_neuron_available method, refer to Integrate the Neuron device to TrainingArguments transformers#44302
HF_DEACTIVATE_ASYNC_LOAD=1 is required for now, we cannot use async load
LoRA and TP are not integrated yet

Accelerate

PartialState needs to be adapted to integrate the neuron device, refer to Neuron integration accelerate#3935
ParallelismConfig does not allow for DDP + TP, is it expected?
Accelerator._prepare_tp is not adapted to the Transformers TP implementation yet, refer to Prepare TP fix accelerate#3945

Kernels

Kernels integration to be tested once huggingface/kernels#285 is merged.

michaelbenayoun added 7 commits February 18, 2026 14:51

wip(neuron): add neuron integration for SFT

ad311bb

wip(neuron): add specific config for proper device initialization

be29e96

wip(neuron): integrated the neuron device with AcceleratorState

5f1f3fe

wip(neuron): add support for bf16

dd9b5ce

wip(neuron): add support for distributed training

3a5d3fb

wip(tp): add TP integration to the SFT example

ed8a48a

wip(tp): add support for "native" transformers TP

7e98e28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wip(neuron): add neuron integration for SFT#5125

wip(neuron): add neuron integration for SFT#5125
michaelbenayoun wants to merge 7 commits intohuggingface:mainfrom
michaelbenayoun:neuron_integration

michaelbenayoun commented Feb 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michaelbenayoun commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Reference script

Changes needed to integrate torch_neuron and HF ecosystem well:

Transformers

Accelerate

Kernels

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

michaelbenayoun commented Feb 18, 2026 •

edited

Loading

Changes needed to integrate `torch_neuron` and HF ecosystem well: