H2O: Holistic Hyper-Parameter Optimization for Large-Scale Deep Neural Network Training

Wang, Ruiwen; Li, Chong; Appuswamy, Raja

EURO-PAR 2025, 31st International European Conference on parallel and distributed computing, 25-29 August 2025, Dresden, Germany / Also published in "Lecture Notes in Computer Science"

Best Poster Award

Training today's hundred-billion-parameter neural networks is limited less by model architecture than by the efficiency with which accelerator clusters are utilized. Existing auto-parallel planners optimize at most three out of four major axes-data (DP), tensor (TP), pipeline (PP) and optimizer (OP) parallelism-and rarely co-tune micro-batch size or activation recomputation, leaving significant speedups untapped. We present H2O, a two-level, holistic hyper-parameter optimizer that jointly explores DP–TP–PP–OP degrees together with micro-batch size, stage assignment and selective recomputation under a unified analytic compute-communication-memory model. A coarse Level 1 search prunes the vast design space, while a fine-grained Level 2 search balances pipeline stages and recomputation to minimize overall iteration time under device-memory constraints. On a 128 devices cluster training a 141-billion-parameter DeepSeek model, H2O shows 22.38% faster than an expert hand-tuned baseline and 36.72% faster than the state-of-the-art auto-parallel planner. These results demonstrate that cross-axis optimization is already critical for contemporary hardware and will be indispensable in the trillion-parameter era.

Detail

Document

BIBTEX

Type:

Poster / Demo

City:

Dresden

Date:

2025-08-25

Department:

Data Science

Eurecom Ref:

8334

© Springer. Personal use of this material is permitted. The definitive version of this paper was published in EURO-PAR 2025, 31st International European Conference on parallel and distributed computing, 25-29 August 2025, Dresden, Germany / Also published in "Lecture Notes in Computer Science" and is available at :