Problems of distributed computing in federated learning

When using distributed operation, I have four Gpus, each of which has a client. During the training process, each GPU has a huge difference. Two gpus even ran out of memory. By the way, I also found that gpu training with overflow was extremely slow and seemed to have gpu utilization close to zero.