解读 Megatron-LM 中的 SKILL.md

Agent Skills 是一种可复用的专业技能模块。每个 Skill 封装了特定领域的知识和操作流程，可以让 AI 更高效地处理某些类型的任务。如果需要更多关于 Agent Skills 的知识，可以查看最近AI领域爆火的 Agent Skills 是什么？

下面我们解读 Megatron 代码仓库中的几个 SKILL.md 文件。

build-and-dependency（构建与依赖管理）#

这个 SKILL.md 文件是 Megatron-LM 项目的开发环境配置技能文档。

核心原则#

在容器内进行构建和开发 — CI 容器预装了正确的 CUDA toolkit、PyTorch 构建版本和预编译的原生扩展，这些在裸机上很难复现。

主要内容#

为什么使用容器#

Megatron-LM 依赖 CUDA、NCCL、GPU 版 PyTorch、TransformerEngine 等组件
容器保证所有开发者和 CI 环境一致（CUDA/NCCL/cuDNN 版本相同）
GPU 相关操作开箱即用

获取镜像 (两种方式)#

Option A — NVIDIA 内部用户：从内部 GitLab CI 拉取预构建镜像

⚠️ Requires access to the internal GitLab instance. See tools/trigger_internal_ci.md for setup (adding the git remote, obtaining a token).

The internal GitLab CI publishes images to its container registry. Derive the registry host from your configured gitlab remote — the same host you use for trigger_internal_ci.py:

1
# Derive host from your 'gitlab' remote:
2
GITLAB_HOST=$(git remote get-url gitlab | sed 's/.*@\(.*\):.*/\1/')
3

4
docker pull ${GITLAB_HOST}/adlr/megatron-lm/mcore_ci_dev:main

Option B — 外部用户：从 Dockerfile 构建，注意必须加 --target main 停在公开阶段（jet 阶段需要内部密钥）

⚠️ Dockerfile.ci.dev has two stages: main and jet. The jet stage requires an internal build secret and will fail without it. Always pass --target main to stop at the public stage.

1
# dev image (default)
2
docker build \
3
  --target main \
4
  --build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.dev) \
5
  --build-arg IMAGE_TYPE=dev \
6
  -f docker/Dockerfile.ci.dev \
7
  -t megatron-lm:local .
8

9
# lts image
10
docker build \
11
  --target main \
12
  --build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.lts) \
13
  --build-arg IMAGE_TYPE=lts \
14
  -f docker/Dockerfile.ci.dev \
15
  -t megatron-lm:local-lts .

Which image variant is used is controlled by the PR label container::lts; absent that label, dev is used.

启动容器 (两种方式)#

Option A — 本地 Docker

1
docker run --rm --gpus all \
2
  -v $(pwd):/workspace \
3
  -w /workspace \
4
  megatron-lm:local \
5
  bash -c "<your command>"

Option B — Slurm 集群

NVIDIA clusters typically use Pyxis + enroot. Request an interactive session:

1
srun \
2
  --nodes=1 --gpus-per-node=8 \
3
  --container-image megatron-lm:local \
4
  --container-mounts $(pwd):/workspace \
5
  --container-workdir /workspace \
6
  --pty bash

For clusters that require a .sqsh archive first:

1
enroot import -o megatron-lm.sqsh dockerd://megatron-lm:local
2
srun \
3
  --nodes=1 --gpus-per-node=8 \
4
  --container-image $(pwd)/megatron-lm.sqsh \
5
  --container-mounts $(pwd):/workspace \
6
  --container-workdir /workspace \
7
  --pty bash

依赖管理 (使用 uv)#

所有 uv 操作必须在容器内运行

Dependencies are declared in pyproject.toml. The venv lives at /opt/venv inside the container (already on PATH).

All uv operations must be run inside the container. Never run uv sync / uv pip install on the host.

依赖定义在 pyproject.toml，锁文件 uv.lock

Group	Purpose
`training`	Runtime training extras
`dev`	Full dev environment (TransformerEngine, ModelOpt, …)
`lts`	LTS-safe subset (no ModelOpt)
`test`	pytest, coverage, nemo-run
`linting`	ruff, black, isort, pylint
`build`	Cython, pybind11, nvidia-mathdx

Install commands (inside the container):

1
# Full dev + test environment
2
uv sync --locked --group dev --group test
3

4
# Linting only
5
uv sync --locked --only-group linting
6

7
# LTS environment
8
uv sync --locked --group lts --group test

Several dependencies are sourced directly from git (TransformerEngine, nemo-run, FlashMLA, Emerging-Optimizers, nvidia-resiliency-ext). The locked uv.lock file pins exact revisions; update it with uv lock when changing pyproject.toml.

添加依赖：容器内运行 uv add <package> → uv lock → 提交更改的文件

Follow this three-step workflow:

Acquire a container image — see Step 1 above.
Launch the container interactively — see Step 2 above.

Update the lock file inside the container, then commit it:

1
# Inside the container:
2
uv add <package>          # adds to pyproject.toml and resolves
3
uv lock                   # regenerates uv.lock
4
# Exit the container, then on the host:
5
git add pyproject.toml uv.lock
6
git commit -S -s -m "build: add <package> dependency"

解决 uv.lock 合并冲突：取 main 版本为基准，然后重新 uv lock

uv.lock is machine-generated; never resolve conflicts manually. Instead:

1
git checkout origin/main -- uv.lock   # take main's version as the base
2
# then inside the container:
3
uv lock                               # re-resolve on top of your pyproject.toml changes

代码检查#

运行 tools/autoformat.sh 进行 linting

修改 import 后必须运行 uv run isort

Run before opening a PR:

1
# Check mode (no changes applied)
2
BASE_REF=main CHECK_ONLY=true SKIP_DOCS=false bash tools/autoformat.sh
3

4
# Fix mode
5
BASE_REF=main CHECK_ONLY=false bash tools/autoformat.sh

Tools invoked: black, isort, pylint, ruff, mypy.

After editing imports in any Python files, always run uv run isort on those files before committing (repo CLAUDE.md requirement).

常见陷阱#

这个 SKILL.md 最后列出来常见问题及解决方案，如依赖冲突、空间不足、权限问题等。

Problem	Cause	Fix
`uv sync --locked` fails	Dependency conflict or stale `uv.lock`	Re-run `uv lock` inside the container and commit updated lock
`ModuleNotFoundError` after pip install	pip installed outside the uv-managed venv	Use `uv add` and `uv sync`, never bare `pip install`
`uv: command not found` inside container	Wrong container image	Use the `megatron-lm` image built from `Dockerfile.ci.dev`
`No space left on device` during uv ops	Cache fills container’s `/root/.cache/`	Mount a host cache dir via `-v $HOME/.cache/uv:/root/.cache/uv`
Pre-commit fails with linting errors	Code style violations	Run `BASE_REF=main CHECK_ONLY=false bash tools/autoformat.sh`
`docker build` fails with secret-related error	`Dockerfile.ci.dev` has a `jet` stage that requires an internal secret	Add `--target main` to stop before the `jet` stage
`access forbidden` when pulling	Registry URL includes an explicit port (e.g. `:5005`)	Use `${GITLAB_HOST}/adlr/...` with no port — the sed extracts the hostname only

ci-test-system（测试系统与 CI 指南）#

这个 SKILL.md 文件是 Megatron-LM 的测试系统和 CI 流程技能文档。

测试目录布局#

1
tests/
2
├── unit_tests/          # pytest, 1 node × 8 GPUs, torch.distributed runner
3
├── functional_tests/    # end-to-end shell + training scripts
4
│   └── test_cases/
5
│       └── {model}/{test_case}/
6
│           ├── model_config.yaml          # training args
7
│           └── golden_values_{env}_{platform}.json
8
└── test_utils/
9
    ├── recipes/
10
    │   ├── h100/        # YAML recipes for H100 jobs
11
    │   └── gb200/       # YAML recipes for GB200 jobs
12
    └── python_scripts/  # helpers (recipe_parser, golden-value download, …)

测试执行机制#

GitHub Actions 调用 launch_nemo_run_workload.py，通过 nemo-run 启动 Docker 容器

The GitHub Actions runner invokes launch_nemo_run_workload.py, which uses nemo-run to launch a DockerExecutor container. The repo is bind-mounted at /opt/megatron-lm; training data is mounted at /mnt/artifacts.

单元测试：通过 torch.distributed.run 运行，rank 0 和 3 输出到 stdout，其他 rank 写日志文件

Unit tests are dispatched through torch.distributed.run:

Ranks 0 and 3 are tee-d to stdout; all other ranks write only to log files.

Per-rank log files land at {assets_dir}/logs/1/ and are uploaded as a GitHub artifact after the run.

功能测试：由 run_ci_test.sh 驱动，只有 rank 0 运行 pytest 验证

Functional tests are driven by tests/functional_tests/shell_test_utils/run_ci_test.sh. Only rank 0 runs the pytest validation step; training output from all ranks is uploaded as an artifact.

自动重试：最多 3 次，针对已知临时故障模式（NCCL timeout、ECC error、segfault 等）

Flaky-failure auto-retry: launch_nemo_run_workload.py retries up to 3 times for known transient patterns (NCCL timeout, ECC error, segfault, HuggingFace connectivity, …) before declaring a genuine failure.

Recipe YAML 结构#

Recipe 文件定义测试规格，通过 products 块展开为多个 workload：

1
type: basic
2
format_version: 1
3
maintainers: [mcore]
4
loggers: [stdout]
5
spec:
6
  name: "{test_case}_{environment}_{platforms}"
7
  model: gpt              # maps to tests/functional_tests/test_cases/{model}/
8
  build: mcore-pyt-{environment}
9
  nodes: 1
10
  gpus: 8
11
  n_repeat: 5
12
  platforms: dgx_h100
13
  time_limit: 1800
14
  script_setup: |
15
    ...
16
  script: |-
17
    bash tests/functional_tests/shell_test_utils/run_ci_test.sh ...
18
products:
19
  - test_case: [my_test]
20
    products:
21
      - environment: [dev, lts]
22
        scope: [mr-github]
23
        platforms: [dgx_h100]

CI 测试范围标签#

PR 标签决定测试范围、重复次数和容器镜像：

Decision tree (first match wins):

Condition	`scope`	`n_repeat`	`lightweight`	Notes
Merge group	`mr-github`	1	false	Automatic, no label needed
Label: `Run tests`	`mr-github`	1	true	Trains 4 steps, no golden-value compare
Label: `Run functional tests`	`mr-github`	5	false	Trains 100 steps, golden-value compare
(no label)	`mr-github-slim`	5	false	Slim subset only

Orthogonal image label:

Label	Effect
`container::lts`	Use the LTS base image instead of `dev` (combinable with any scope label)
`Run MBridge tests`	Also triggers the MBridge L1 test suite

添加测试#

单元测试：

创建 tests/unit_tests/<category>/test_<name>.py
使用 conftest.py 中的 fixtures
添加 markers（@pytest.mark.internal / flaky / experimental）
在容器内验证

功能测试：

创建 tests/functional_tests/test_cases/<model>/<test_name>/
编写 model_config.yaml
添加 YAML recipe
推送 PR，添加 Run functional tests 标签
成功后下载金标准值并提交

CI 失败调查#

1
# Extract PR number from the current branch
2
PR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')
3

4
# Fetch the PR metadata (title, labels, author, base branch)
5
gh pr view "$PR_NUMBER" --repo NVIDIA/Megatron-LM
6

7
# Show the changeset for that PR
8
gh pr diff "$PR_NUMBER" --repo NVIDIA/Megatron-LM
9

10
# List the files changed in the PR
11
gh pr view "$PR_NUMBER" --repo NVIDIA/Megatron-LM --json files --jq '.files[].path'

查看日志：

Runner stdout 只显示 rank 0 和 3
完整日志作为 artifact 上传（logs-<test_case>-<run_id>-<uuid>）
下载 artifact 后用 grep -r -l "ERROR|Traceback" 找错误位置

常见错误#

Problem	Cause	Fix
Port collision on multi-GPU runs	torchrun binding conflicts	Use `torch.distributed.run` via the container entry point
Test passes locally but fails in CI	Different environment or data path	Check `DATA_PATH`, `DATA_CACHE_PATH`, and the `environment` tag (`dev` vs `lts`)
Golden value mismatch after a code change	Numerical regression	Download new golden values via `download_golden_values.py` after a clean run
`cicd-integration-tests-gb200` not triggered	GB200 jobs require maintainer status	Ask a maintainer to trigger, or add the `Run functional tests` label

run-on-slurm（SLURM 集群上运行 Megatron-LM 指南）#

这个 SKILL.md 文件是关于 SLURM 集群上启动分布式 Megatron-LM 训练任务的技能文档。

前置条件#

拥有 SLURM 集群登录权限，且具备向 GPU 计算分区提交任务的权限。
在集群分配的所有节点均可访问的文件系统（NFS、Lustre或同类文件系统）中拉取 Megatron-LM 代码仓库。所有节点必须能通过相同路径访问代码、数据、检查点文件和输出文件。
已安装 uv 工具；在提交任务前，需在工作目录中执行一次 uv sync --extra training --extra dev（或--extra lts）命令，完成虚拟环境 .venv 的实例化部署，确保所有节点均可识别并使用该虚拟环境。

最简 sbatch 脚本模板#

在工作目录中另存为 run_megatron.slurm 文件：

1
#!/bin/bash
2
#SBATCH --job-name=megatron
3
#SBATCH --account=<SLURM_ACCOUNT>
4
#SBATCH --partition=<SLURM_PARTITION>
5
#SBATCH --nodes=<NODES>
6
#SBATCH --ntasks-per-node=1
7
#SBATCH --gpus-per-node=<GPUS_PER_NODE>
8
#SBATCH --time=<HH:MM:SS>
9
#SBATCH --output=logs/%x-%j.out
10
#SBATCH --error=logs/%x-%j.err
11

12
set -euo pipefail
13
cd <MEGATRON_WORKTREE>
14

15
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
16
export MASTER_PORT=${MASTER_PORT:-29500}
17
export NNODES=${SLURM_NNODES}
18
export GPUS_PER_NODE=<GPUS_PER_NODE>
19
export WORLD_SIZE=$((NNODES * GPUS_PER_NODE))
20

21
# Set CUDA_DEVICE_MAX_CONNECTIONS only when your configuration requires it
22
# (see the section below). Example for pre-Blackwell with TP>1 or CP>1
23
# (non-FSDP):
24
#   export CUDA_DEVICE_MAX_CONNECTIONS=1
25

26
srun --ntasks=${NNODES} --ntasks-per-node=1 bash -c '
27
  # NODE_RANK comes from SLURM_NODEID with one task per node.
28
  NODE_RANK=${SLURM_NODEID}
29
  uv run python -m torch.distributed.run \
30
    --nnodes='"${NNODES}"' \
31
    --nproc-per-node='"${GPUS_PER_NODE}"' \
32
    --node-rank=${NODE_RANK} \
33
    --master-addr='"${MASTER_ADDR}"' \
34
    --master-port='"${MASTER_PORT}"' \
35
    pretrain_gpt.py \
36
      <MEGATRON_ARGS>
37
'

提交任务命令：

1
mkdir -p logs && JOB_ID=$(sbatch --parsable run_megatron.slurm)
2
echo "Submitted ${JOB_ID}"

多节点规则#

所有节点必须能访问共享文件系统的相同路径
使用一个跨所有节点的 torchrun worker group，不要启动独立的单节点作业
--nproc-per-node 应等于每节点可见 GPU 数
checkpoint、tensorboard 数据写入共享存储

`CUDA_DEVICE_MAX_CONNECTIONS`#

不要盲目设置，根据硬件和并行模式决定：

Blackwell 架构之前（Hopper、Ampere）、TP>1 或 CP>1、非 FSDP 场景：设置为1。相关代码逻辑会对此值做断言校验，若不为 1 会直接触发断言报错，而非出现静默死锁。
Blackwell 架构：无需配置，设置该参数不会产生任何效果。
Torch-FSDP2 或 Megatron-FSDP 模式：严禁设为 1。保持该环境变量不配置，或设置为大于 1 的数值即可。
开启 overlap_moe_expert_parallel_comm 功能：设置为 32。

当你的配置场景有对应要求时，请在 sbatch 脚本中显式配置该参数。

容器使用#

许多平台都在容器中运行 Megatron-LM（使用 enroot/pyxis，或者 singularity）。若采用容器部署，由 uv 管理的 .venv 虚拟环境必须存放在容器内部可访问的路径下，且容器镜像需提供代码仓库所需的 CUDA、NCCL、PyTorch 版本（参考 docker/.ngc_version.dev 与 .ngc_version.lts 文件）。上文给出的基础脚本框架无需改动；只需在调用 srun 命令时，添加对应的容器参数（如--container-image=…、--container-mounts=…等）即可。

监控与收集#

1
squeue -j "$JOB_ID" -o "%.10i %.8T %.10M %.6D %R"
2
sacct -j "$JOB_ID" --format=JobID,State,ExitCode,Elapsed
3
scancel "$JOB_ID"

如果你的训练脚本会输出结果产物（如 0 号进程生成的 JSON 指标文件、最终检查点文件等），请轮询检测该产物，而非仅依赖 squeue 任务状态进行等待。有效输出通常会在 SLURM 标记任务完成前就生成，通过轮询产物，你可以在产物生成后立即取消任务，无需一直占用集群资源直到超时结束。

故障诊断#

扫描所有 rank 的标准错误输出，而非仅查看 0 号 rank。最早出现的非 NCCL 相关 Python 堆栈追踪信息通常是根本原因；其他 rank 后续出现的 NCCL 超时问题，均为首次崩溃引发的连锁问题。

快速故障分类排查：

内存溢出（OOM）：记录 rank、阶段（forward/backward/optimizer）、batch size、sequence length、并行配置（TP/DP/CP/PP）以及峰值内存
形状/除法错误：校验WORLD_SIZE = TP × DP × CP × PP ，以及注意力头数的整除规则（num_attention_heads % TP == 0）。
Import Error：检查 worktree 路径、uv sync是否执行、PYTHONPATH环境变量失效。启动任务前务必确认已切换至<MEGATRON_WORKTREE>工作目录。
NCCL 失败无 Python traceback：核查资源分配情况、端口可达性、主节点地址MASTER_ADDR域名解析状态，以及所有 rank 的执行命令是否保持一致。

常见错误#

Forgetting uv sync before the first submission. If the venv is missing, every job rebuilds it from inside srun, costing minutes per job.
Writing logs to a node-local path that disappears at job exit. Always write to the shared filesystem.
Setting CUDA_DEVICE_MAX_CONNECTIONS=1 blindly. The right value depends on hardware and parallelism mode (see the dedicated section above). Setting it to 1 with FSDP causes a different problem; on Blackwell it has no effect; on pre-Blackwell with TP>1 or CP>1 (non-FSDP) the code asserts, it does not deadlock.
Running bare torchrun instead of uv run python -m torch.distributed.run. Bare torchrun may dispatch through a python interpreter that does not see venv packages, depending on how the venv is set up.

解读 Megatron-LM 中的 SKILL.md

build-and-dependency（构建与依赖管理）#

核心原则#

主要内容#

为什么使用容器#

获取镜像 (两种方式)#

启动容器 (两种方式)#

依赖管理 (使用 uv)#

代码检查#

常见陷阱#

ci-test-system（测试系统与 CI 指南）#

测试目录布局#

测试执行机制#

Recipe YAML 结构#

CI 测试范围标签#

添加测试#

CI 失败调查#

常见错误#

run-on-slurm（SLURM 集群上运行 Megatron-LM 指南）#

前置条件#

最简 sbatch 脚本模板#

多节点规则#

CUDA_DEVICE_MAX_CONNECTIONS#

容器使用#

监控与收集#

故障诊断#

常见错误#

支持与分享

目录

`CUDA_DEVICE_MAX_CONNECTIONS`#