Could you please describe details of rhythm-only conversion ?

I don't understand how to get alignment when the input(utterance) to the rhythm-encoder is different from inputs(utterance) to pitch/content-encoders.  ps(I don't understand the implementation details of variant in Appendix B.3). thank you, sincerely.