Skip to content

[bug] smdistributed is not included in HuggingFace training image #3989

@dbpprt

Description

@dbpprt

Checklist

Concise Description:
smdistributed is not available.

ModuleNotFoundError: No module named ‘smdistributed’

DLC image/dockerfile:
763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:2.1.0-transformers4.36.0-gpu-py310-cu121-ubuntu20.04

Current behavior:

Expected behavior:

Additional context:
Installing it manually gives the following error:

ErrorMessage "ImportError: libsmddpcpp.so: cannot open shared object file: No such file or directory

from: https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.1.0/cu121/2024-02-04/smdistributed_dataparallel-2.1.0-cp310-cp310-linux_x86_64.whl

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions