解读 Megatron-LM 中的 SKILL.md

2862 字
14 分钟
解读 Megatron-LM 中的 SKILL.md

Agent Skills 是一种可复用的专业技能模块。每个 Skill 封装了特定领域的知识和操作流程,可以让 AI 更高效地处理某些类型的任务。 如果需要更多关于 Agent Skills 的知识,可以查看最近AI领域爆火的 Agent Skills 是什么?

下面我们解读 Megatron 代码仓库中的几个 SKILL.md 文件。

build-and-dependency(构建与依赖管理)#

这个 SKILL.md 文件是 Megatron-LM 项目的开发环境配置技能文档。

核心原则#

在容器内进行构建和开发 — CI 容器预装了正确的 CUDA toolkit、PyTorch 构建版本和预编译的原生扩展,这些在裸机上很难复现。

主要内容#

为什么使用容器#

  • Megatron-LM 依赖 CUDA、NCCL、GPU 版 PyTorch、TransformerEngine 等组件
  • 容器保证所有开发者和 CI 环境一致(CUDA/NCCL/cuDNN 版本相同)
  • GPU 相关操作开箱即用

获取镜像 (两种方式)#

Option A — NVIDIA 内部用户:从内部 GitLab CI 拉取预构建镜像

⚠️ Requires access to the internal GitLab instance. See tools/trigger_internal_ci.md for setup (adding the git remote, obtaining a token).

The internal GitLab CI publishes images to its container registry. Derive the registry host from your configured gitlab remote — the same host you use for trigger_internal_ci.py:

Terminal window
# Derive host from your 'gitlab' remote:
GITLAB_HOST=$(git remote get-url gitlab | sed 's/.*@\(.*\):.*/\1/')
docker pull ${GITLAB_HOST}/adlr/megatron-lm/mcore_ci_dev:main

Option B — 外部用户:从 Dockerfile 构建,注意必须加 --target main 停在公开阶段(jet 阶段需要内部密钥)

⚠️ Dockerfile.ci.dev has two stages: main and jet. The jet stage requires an internal build secret and will fail without it. Always pass --target main to stop at the public stage.

Terminal window
# dev image (default)
docker build \
--target main \
--build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.dev) \
--build-arg IMAGE_TYPE=dev \
-f docker/Dockerfile.ci.dev \
-t megatron-lm:local .
# lts image
docker build \
--target main \
--build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.lts) \
--build-arg IMAGE_TYPE=lts \
-f docker/Dockerfile.ci.dev \
-t megatron-lm:local-lts .

Which image variant is used is controlled by the PR label container::lts; absent that label, dev is used.

启动容器 (两种方式)#

Option A — 本地 Docker

Terminal window
docker run --rm --gpus all \
-v $(pwd):/workspace \
-w /workspace \
megatron-lm:local \
bash -c "<your command>"

Option B — Slurm 集群

NVIDIA clusters typically use Pyxis + enroot. Request an interactive session:

Terminal window
srun \
--nodes=1 --gpus-per-node=8 \
--container-image megatron-lm:local \
--container-mounts $(pwd):/workspace \
--container-workdir /workspace \
--pty bash

For clusters that require a .sqsh archive first:

Terminal window
enroot import -o megatron-lm.sqsh dockerd://megatron-lm:local
srun \
--nodes=1 --gpus-per-node=8 \
--container-image $(pwd)/megatron-lm.sqsh \
--container-mounts $(pwd):/workspace \
--container-workdir /workspace \
--pty bash

依赖管理 (使用 uv)#

所有 uv 操作必须在容器内运行

Dependencies are declared in pyproject.toml. The venv lives at /opt/venv inside the container (already on PATH).

All uv operations must be run inside the container. Never run uv sync / uv pip install on the host.

依赖定义在 pyproject.toml,锁文件 uv.lock

GroupPurpose
trainingRuntime training extras
devFull dev environment (TransformerEngine, ModelOpt, …)
ltsLTS-safe subset (no ModelOpt)
testpytest, coverage, nemo-run
lintingruff, black, isort, pylint
buildCython, pybind11, nvidia-mathdx

Install commands (inside the container):

Terminal window
# Full dev + test environment
uv sync --locked --group dev --group test
# Linting only
uv sync --locked --only-group linting
# LTS environment
uv sync --locked --group lts --group test

Several dependencies are sourced directly from git (TransformerEngine, nemo-run, FlashMLA, Emerging-Optimizers, nvidia-resiliency-ext). The locked uv.lock file pins exact revisions; update it with uv lock when changing pyproject.toml.

添加依赖:容器内运行 uv add <package>uv lock → 提交更改的文件

Follow this three-step workflow:

  1. Acquire a container image — see Step 1 above.

  2. Launch the container interactively — see Step 2 above.

  3. Update the lock file inside the container, then commit it:

    Terminal window
    # Inside the container:
    uv add <package> # adds to pyproject.toml and resolves
    uv lock # regenerates uv.lock
    # Exit the container, then on the host:
    git add pyproject.toml uv.lock
    git commit -S -s -m "build: add <package> dependency"

解决 uv.lock 合并冲突:取 main 版本为基准,然后重新 uv lock

uv.lock is machine-generated; never resolve conflicts manually. Instead:

Terminal window
git checkout origin/main -- uv.lock # take main's version as the base
# then inside the container:
uv lock # re-resolve on top of your pyproject.toml changes

代码检查#

运行 tools/autoformat.sh 进行 linting

修改 import 后必须运行 uv run isort

Run before opening a PR:

Terminal window
# Check mode (no changes applied)
BASE_REF=main CHECK_ONLY=true SKIP_DOCS=false bash tools/autoformat.sh
# Fix mode
BASE_REF=main CHECK_ONLY=false bash tools/autoformat.sh

Tools invoked: black, isort, pylint, ruff, mypy.

After editing imports in any Python files, always run uv run isort on those files before committing (repo CLAUDE.md requirement).

常见陷阱#

这个 SKILL.md 最后列出来常见问题及解决方案,如依赖冲突、空间不足、权限问题等。

ProblemCauseFix
uv sync --locked failsDependency conflict or stale uv.lockRe-run uv lock inside the container and commit updated lock
ModuleNotFoundError after pip installpip installed outside the uv-managed venvUse uv add and uv sync, never bare pip install
uv: command not found inside containerWrong container imageUse the megatron-lm image built from Dockerfile.ci.dev
No space left on device during uv opsCache fills container’s /root/.cache/Mount a host cache dir via -v $HOME/.cache/uv:/root/.cache/uv
Pre-commit fails with linting errorsCode style violationsRun BASE_REF=main CHECK_ONLY=false bash tools/autoformat.sh
docker build fails with secret-related errorDockerfile.ci.dev has a jet stage that requires an internal secretAdd --target main to stop before the jet stage
access forbidden when pullingRegistry URL includes an explicit port (e.g. :5005)Use ${GITLAB_HOST}/adlr/... with no port — the sed extracts the hostname only

ci-test-system(测试系统与 CI 指南)#

这个 SKILL.md 文件是 Megatron-LM 的测试系统和 CI 流程技能文档。

测试目录布局#

tests/
├── unit_tests/ # pytest, 1 node × 8 GPUs, torch.distributed runner
├── functional_tests/ # end-to-end shell + training scripts
│ └── test_cases/
│ └── {model}/{test_case}/
│ ├── model_config.yaml # training args
│ └── golden_values_{env}_{platform}.json
└── test_utils/
├── recipes/
│ ├── h100/ # YAML recipes for H100 jobs
│ └── gb200/ # YAML recipes for GB200 jobs
└── python_scripts/ # helpers (recipe_parser, golden-value download, …)

测试执行机制#

GitHub Actions 调用 launch_nemo_run_workload.py,通过 nemo-run 启动 Docker 容器

The GitHub Actions runner invokes launch_nemo_run_workload.py, which uses nemo-run to launch a DockerExecutor container. The repo is bind-mounted at /opt/megatron-lm; training data is mounted at /mnt/artifacts.

单元测试:通过 torch.distributed.run 运行,rank 0 和 3 输出到 stdout,其他 rank 写日志文件

Unit tests are dispatched through torch.distributed.run:

  • Ranks 0 and 3 are tee-d to stdout; all other ranks write only to log files.
  • Per-rank log files land at {assets_dir}/logs/1/ and are uploaded as a GitHub artifact after the run.

功能测试:由 run_ci_test.sh 驱动,只有 rank 0 运行 pytest 验证

Functional tests are driven by tests/functional_tests/shell_test_utils/run_ci_test.sh. Only rank 0 runs the pytest validation step; training output from all ranks is uploaded as an artifact.

自动重试:最多 3 次,针对已知临时故障模式(NCCL timeout、ECC error、segfault 等)

Flaky-failure auto-retry: launch_nemo_run_workload.py retries up to 3 times for known transient patterns (NCCL timeout, ECC error, segfault, HuggingFace connectivity, …) before declaring a genuine failure.

Recipe YAML 结构#

Recipe 文件定义测试规格,通过 products 块展开为多个 workload:

type: basic
format_version: 1
maintainers: [mcore]
loggers: [stdout]
spec:
name: "{test_case}_{environment}_{platforms}"
model: gpt # maps to tests/functional_tests/test_cases/{model}/
build: mcore-pyt-{environment}
nodes: 1
gpus: 8
n_repeat: 5
platforms: dgx_h100
time_limit: 1800
script_setup: |
...
script: |-
bash tests/functional_tests/shell_test_utils/run_ci_test.sh ...
products:
- test_case: [my_test]
products:
- environment: [dev, lts]
scope: [mr-github]
platforms: [dgx_h100]

CI 测试范围标签#

PR 标签决定测试范围、重复次数和容器镜像:

Decision tree (first match wins):

Conditionscopen_repeatlightweightNotes
Merge groupmr-github1falseAutomatic, no label needed
Label: Run testsmr-github1trueTrains 4 steps, no golden-value compare
Label: Run functional testsmr-github5falseTrains 100 steps, golden-value compare
(no label)mr-github-slim5falseSlim subset only

Orthogonal image label:

LabelEffect
container::ltsUse the LTS base image instead of dev (combinable with any scope label)
Run MBridge testsAlso triggers the MBridge L1 test suite

添加测试#

单元测试:

  1. 创建 tests/unit_tests/<category>/test_<name>.py
  2. 使用 conftest.py 中的 fixtures
  3. 添加 markers(@pytest.mark.internal / flaky / experimental
  4. 在容器内验证

功能测试:

  1. 创建 tests/functional_tests/test_cases/<model>/<test_name>/
  2. 编写 model_config.yaml
  3. 添加 YAML recipe
  4. 推送 PR,添加 Run functional tests 标签
  5. 成功后下载金标准值并提交

CI 失败调查#

Terminal window
# Extract PR number from the current branch
PR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')
# Fetch the PR metadata (title, labels, author, base branch)
gh pr view "$PR_NUMBER" --repo NVIDIA/Megatron-LM
# Show the changeset for that PR
gh pr diff "$PR_NUMBER" --repo NVIDIA/Megatron-LM
# List the files changed in the PR
gh pr view "$PR_NUMBER" --repo NVIDIA/Megatron-LM --json files --jq '.files[].path'

查看日志:

  • Runner stdout 只显示 rank 0 和 3
  • 完整日志作为 artifact 上传(logs-<test_case>-<run_id>-<uuid>
  • 下载 artifact 后用 grep -r -l "ERROR|Traceback" 找错误位置

常见错误#

ProblemCauseFix
Port collision on multi-GPU runstorchrun binding conflictsUse torch.distributed.run via the container entry point
Test passes locally but fails in CIDifferent environment or data pathCheck DATA_PATH, DATA_CACHE_PATH, and the environment tag (dev vs lts)
Golden value mismatch after a code changeNumerical regressionDownload new golden values via download_golden_values.py after a clean run
cicd-integration-tests-gb200 not triggeredGB200 jobs require maintainer statusAsk a maintainer to trigger, or add the Run functional tests label

run-on-slurm(SLURM 集群上运行 Megatron-LM 指南)#

这个 SKILL.md 文件是关于 SLURM 集群上启动分布式 Megatron-LM 训练任务的技能文档。

前置条件#

  • 拥有 SLURM 集群登录权限,且具备向 GPU 计算分区提交任务的权限。
  • 在集群分配的所有节点均可访问的文件系统(NFS、Lustre或同类文件系统)中拉取 Megatron-LM 代码仓库。所有节点必须能通过相同路径访问代码、数据、检查点文件和输出文件。
  • 已安装 uv 工具;在提交任务前,需在工作目录中执行一次 uv sync --extra training --extra dev(或--extra lts)命令,完成虚拟环境 .venv 的实例化部署,确保所有节点均可识别并使用该虚拟环境。

最简 sbatch 脚本模板#

在工作目录中另存为 run_megatron.slurm 文件:

#!/bin/bash
#SBATCH --job-name=megatron
#SBATCH --account=<SLURM_ACCOUNT>
#SBATCH --partition=<SLURM_PARTITION>
#SBATCH --nodes=<NODES>
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=<GPUS_PER_NODE>
#SBATCH --time=<HH:MM:SS>
#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err
set -euo pipefail
cd <MEGATRON_WORKTREE>
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
export MASTER_PORT=${MASTER_PORT:-29500}
export NNODES=${SLURM_NNODES}
export GPUS_PER_NODE=<GPUS_PER_NODE>
export WORLD_SIZE=$((NNODES * GPUS_PER_NODE))
# Set CUDA_DEVICE_MAX_CONNECTIONS only when your configuration requires it
# (see the section below). Example for pre-Blackwell with TP>1 or CP>1
# (non-FSDP):
# export CUDA_DEVICE_MAX_CONNECTIONS=1
srun --ntasks=${NNODES} --ntasks-per-node=1 bash -c '
# NODE_RANK comes from SLURM_NODEID with one task per node.
NODE_RANK=${SLURM_NODEID}
uv run python -m torch.distributed.run \
--nnodes='"${NNODES}"' \
--nproc-per-node='"${GPUS_PER_NODE}"' \
--node-rank=${NODE_RANK} \
--master-addr='"${MASTER_ADDR}"' \
--master-port='"${MASTER_PORT}"' \
pretrain_gpt.py \
<MEGATRON_ARGS>
'

提交任务命令:

Terminal window
mkdir -p logs && JOB_ID=$(sbatch --parsable run_megatron.slurm)
echo "Submitted ${JOB_ID}"

多节点规则#

  • 所有节点必须能访问共享文件系统的相同路径
  • 使用一个跨所有节点的 torchrun worker group,不要启动独立的单节点作业
  • --nproc-per-node 应等于每节点可见 GPU 数
  • checkpoint、tensorboard 数据写入共享存储

CUDA_DEVICE_MAX_CONNECTIONS#

不要盲目设置,根据硬件和并行模式决定:

  • Blackwell 架构之前(Hopper、Ampere)、TP>1 或 CP>1、非 FSDP 场景:设置为1。相关代码逻辑会对此值做断言校验,若不为 1 会直接触发断言报错,而非出现静默死锁。
  • Blackwell 架构:无需配置,设置该参数不会产生任何效果。
  • Torch-FSDP2 或 Megatron-FSDP 模式:严禁设为 1。保持该环境变量不配置,或设置为大于 1 的数值即可。
  • 开启 overlap_moe_expert_parallel_comm 功能:设置为 32。

当你的配置场景有对应要求时,请在 sbatch 脚本中显式配置该参数。

容器使用#

许多平台都在容器中运行 Megatron-LM(使用 enroot/pyxis,或者 singularity)。若采用容器部署,由 uv 管理的 .venv 虚拟环境必须存放在容器内部可访问的路径下,且容器镜像需提供代码仓库所需的 CUDA、NCCL、PyTorch 版本(参考 docker/.ngc_version.dev.ngc_version.lts 文件)。上文给出的基础脚本框架无需改动;只需在调用 srun 命令时,添加对应的容器参数(如--container-image=…--container-mounts=…等)即可。

监控与收集#

Terminal window
squeue -j "$JOB_ID" -o "%.10i %.8T %.10M %.6D %R"
sacct -j "$JOB_ID" --format=JobID,State,ExitCode,Elapsed
scancel "$JOB_ID"

如果你的训练脚本会输出结果产物(如 0 号进程生成的 JSON 指标文件、最终检查点文件等),请轮询检测该产物,而非仅依赖 squeue 任务状态进行等待。有效输出通常会在 SLURM 标记任务完成前就生成,通过轮询产物,你可以在产物生成后立即取消任务,无需一直占用集群资源直到超时结束。

故障诊断#

扫描所有 rank 的标准错误输出,而非仅查看 0 号 rank。最早出现的非 NCCL 相关 Python 堆栈追踪信息通常是根本原因;其他 rank 后续出现的 NCCL 超时问题,均为首次崩溃引发的连锁问题。

快速故障分类排查:

  • 内存溢出(OOM):记录 rank、阶段(forward/backward/optimizer)、batch size、sequence length、并行配置(TP/DP/CP/PP)以及峰值内存
  • 形状/除法错误:校验WORLD_SIZE = TP × DP × CP × PP ,以及注意力头数的整除规则(num_attention_heads % TP == 0)。
  • Import Error:检查 worktree 路径、uv sync是否执行、PYTHONPATH环境变量失效。启动任务前务必确认已切换至<MEGATRON_WORKTREE>工作目录。
  • NCCL 失败无 Python traceback:核查资源分配情况、端口可达性、主节点地址MASTER_ADDR域名解析状态,以及所有 rank 的执行命令是否保持一致。

常见错误#

  • Forgetting uv sync before the first submission. If the venv is missing, every job rebuilds it from inside srun, costing minutes per job.
  • Writing logs to a node-local path that disappears at job exit. Always write to the shared filesystem.
  • Setting CUDA_DEVICE_MAX_CONNECTIONS=1 blindly. The right value depends on hardware and parallelism mode (see the dedicated section above). Setting it to 1 with FSDP causes a different problem; on Blackwell it has no effect; on pre-Blackwell with TP>1 or CP>1 (non-FSDP) the code asserts, it does not deadlock.
  • Running bare torchrun instead of uv run python -m torch.distributed.run. Bare torchrun may dispatch through a python interpreter that does not see venv packages, depending on how the venv is set up.

支持与分享

如果这篇文章对你有帮助,欢迎分享给更多人或赞助支持!

赞助
解读 Megatron-LM 中的 SKILL.md
https://llm-tech.com.cn/posts/megatron-skills/
作者
Ming
发布于
2026-04-30
许可协议
CC BY-NC-SA 4.0
Profile Image of the Author
Ming
你是来找 Ming 学习的吗
🎉 欢迎来到 Ming 的博客
这里是我的个人博客,分享 AI Infra、LLM 等技术内容。欢迎关注交流!
分类
标签
站点统计
文章
19
分类
6
标签
12
总字数
69,591
运行时长
0
最后活动
0 天前

目录