Training today's hundred-billion-parameter neural networks is limited less by model architecture than by the efficiency with which accelerator clusters are utilized. Existing auto-parallel planners optimize at most three out of four major axes-data (DP), tensor (TP), pipeline (PP) and optimizer (OP) parallelism-and rarely co-tune micro-batch size or activation recomputation, leaving significant speedups untapped. We present H2O, a two-level, holistic hyper-parameter optimizer that jointly explores DP–TP–PP–OP degrees together with micro-batch size, stage assignment and selective recomputation under a unified analytic compute-communication-memory model. A coarse Level 1 search prunes the vast design space, while a fine-grained Level 2 search balances pipeline stages and recomputation to minimize overall iteration time under device-memory constraints. On a 128 devices cluster training a 141-billion-parameter DeepSeek model, H2O shows 22.38% faster than an expert hand-tuned baseline and 36.72% faster than the state-of-the-art auto-parallel planner. These results demonstrate that cross-axis optimization is already critical for contemporary hardware and will be indispensable in the trillion-parameter era.
H2O: Holistic Hyper-Parameter Optimization for Large-Scale Deep Neural Network Training
EURO-PAR 2025, 31st International European Conference on parallel and distributed computing, 25-29 August 2025, Dresden, Germany / Also published in "Lecture Notes in Computer Science"
Type:
Conférence
City:
Dresden
Date:
2025-08-25
Department:
Data Science
Eurecom Ref:
8334
Copyright:
© Springer. Personal use of this material is permitted. The definitive version of this paper was published in EURO-PAR 2025, 31st International European Conference on parallel and distributed computing, 25-29 August 2025, Dresden, Germany / Also published in "Lecture Notes in Computer Science" and is available at :
See also:
PERMALINK : https://www.eurecom.fr/publication/8334