ManuMatic: Strategy injection for robust automatic hybrid parallelism in distributed DNN training

Wang, Ruiwen; Li, Chong; Wang, Hongxing; Appuswamy, Raja; Yujie, Yuan
NPC 2025, 22nd IFIP International Conference on Network and Parallel Computing, 14-16 November 2025, Nha Trang, Vietnam / Also on Lecture Notes in Computer Science, Vol.16306

Training modern deep neural networks (DNNs) requires hybrid parallelism. Automatic planners search data, tensor/model, and pipeline shardings with cost models, but decisions can drift from runtime optima due to framework/planner decoupling and overlap mis-modeling. We present MANUMATIC, a light-touch planner that lets users pin a few critical operator shardings while automatically deriving globally consistent strategies for the rest. Inside a binary recursive partitioner, MANUMATIC prioritizes pins via an infinite compromise price and decomposes multi-dimensional hints into two-way refinements; when hard constraints are infeasible, a soft-penalty variant applies. The design is profiling-free, preserves D-Rec’s short compilation time, and degenerates to D-Rec when no pins are given. Built atop D-Rec, MANUMATIC delivers consistent speedups without cost-model reengineering: on Mixtral-8×

" role="presentation" style="box-sizing: inherit; display: inline-block; line-height: normal; font-size-adjust: none; word-spacing: normal; overflow-wrap: normal; text-wrap-mode: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; scroll-margin-top: 74px; position: relative;">

7B, an expert-parallel-aware BMM pin achieves 2.24×

" role="presentation" style="box-sizing: inherit; display: inline-block; line-height: normal; font-size-adjust: none; word-spacing: normal; overflow-wrap: normal; text-wrap-mode: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; scroll-margin-top: 74px; position: relative;">

 over D-Rec; on Llama3-8B, a sequence-parallel-aware MatMul pin reaches 2.04×

" role="presentation" style="box-sizing: inherit; display: inline-block; line-height: normal; font-size-adjust: none; word-spacing: normal; overflow-wrap: normal; text-wrap-mode: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; scroll-margin-top: 74px; position: relative;">

; on Qwen2.5-72B, a sequence-parallel-aware MatMul pin combined with BMPipe yields 1.45×

" role="presentation" style="box-sizing: inherit; display: inline-block; line-height: normal; font-size-adjust: none; word-spacing: normal; overflow-wrap: normal; text-wrap-mode: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; scroll-margin-top: 74px; position: relative;">

 over D-Rec and 1.30×

" role="presentation" style="box-sizing: inherit; display: inline-block; line-height: normal; font-size-adjust: none; word-spacing: normal; overflow-wrap: normal; text-wrap-mode: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; scroll-margin-top: 74px; position: relative;">

 over an expert plan. These results show that minimal guidance can robustify automatic parallelism while largely preserving automation.


DOI
Type:
Conference
City:
Nha Trang
Date:
2025-11-14
Department:
Data Science
Eurecom Ref:
8487
Copyright:
© Springer. Personal use of this material is permitted. The definitive version of this paper was published in NPC 2025, 22nd IFIP International Conference on Network and Parallel Computing, 14-16 November 2025, Nha Trang, Vietnam / Also on Lecture Notes in Computer Science, Vol.16306 and is available at : https://doi.org/10.1007/978-3-032-10466-3_14

PERMALINK : https://www.eurecom.fr/publication/8487