support branch parallel for evoformer#14
support branch parallel for evoformer#14GuoxiaWang wants to merge 2 commits intodptech-corp:mainfrom
Conversation
|
Thank you, I will review this in the weekend. |
| ), "Must specify batch size either with --batch-size" | ||
| metrics.reset() | ||
|
|
||
| args.seed += args.dp_rank |
There was a problem hiding this comment.
When using a hybrid distributed parallel strategy, such as DP-BP, the parameters and data in the same BP group need to be the same, so the seeds need to be the same.
| if torch.cuda.is_available(): | ||
| dist.all_reduce(torch.zeros(1).cuda()) | ||
|
|
||
| scg.init_group(bp_degree=args.bp_degree, dap_degree=1) |
There was a problem hiding this comment.
will this affect the normal c10d, no_c10d mode?
Can we make "bp" a choice, like currently c10d, no_c10d?
There was a problem hiding this comment.
I'm not quite sure about this question. This PR is just to show how to use BP, not to merge this PR into UniCore.
There was a problem hiding this comment.
sorry, I may miss some contexts.
|
|
||
| return outer_grad.clone(), msa_grad.clone(), pair_grad.clone() | ||
|
|
||
| def sync_evoformer_results(outer, msa, pair, training): |
There was a problem hiding this comment.
I feel like the functions in this file are better to be in Uni-Fold.
There was a problem hiding this comment.
Same problem as above. It is necessary to design the code together and merge them into UniFold and UniCore respectively.
support Branch Parallelism in paper Efficient AlphaFold2 Training using Parallel Evoformer and Branch Parallelism