解读 Megatron-LM 中的 SKILL.md
Agent Skills 是一种可复用的专业技能模块。每个 Skill 封装了特定领域的知识和操作流程,可以让 AI 更高效地处理某些类型的任务。 如果需要更多关于 Agent Skills 的知识,可以查看最近AI领域爆火的 Agent Skills 是什么?
下面我们解读 Megatron 代码仓库中的几个 SKILL.md 文件。
build-and-dependency(构建与依赖管理)
这个 SKILL.md 文件是 Megatron-LM 项目的开发环境配置技能文档。
核心原则
在容器内进行构建和开发 — CI 容器预装了正确的 CUDA toolkit、PyTorch 构建版本和预编译的原生扩展,这些在裸机上很难复现。
主要内容
为什么使用容器
- Megatron-LM 依赖 CUDA、NCCL、GPU 版 PyTorch、TransformerEngine 等组件
- 容器保证所有开发者和 CI 环境一致(CUDA/NCCL/cuDNN 版本相同)
- GPU 相关操作开箱即用
获取镜像 (两种方式)
Option A — NVIDIA 内部用户:从内部 GitLab CI 拉取预构建镜像
⚠️ Requires access to the internal GitLab instance. See
tools/trigger_internal_ci.mdfor setup (adding the git remote, obtaining a token).
The internal GitLab CI publishes images to its container registry.
Derive the registry host from your configured gitlab remote — the same
host you use for trigger_internal_ci.py:
# Derive host from your 'gitlab' remote:GITLAB_HOST=$(git remote get-url gitlab | sed 's/.*@\(.*\):.*/\1/')
docker pull ${GITLAB_HOST}/adlr/megatron-lm/mcore_ci_dev:mainOption B — 外部用户:从 Dockerfile 构建,注意必须加 --target main 停在公开阶段(jet 阶段需要内部密钥)
⚠️
Dockerfile.ci.devhas two stages:mainandjet. Thejetstage requires an internal build secret and will fail without it. Always pass--target mainto stop at the public stage.
# dev image (default)docker build \ --target main \ --build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.dev) \ --build-arg IMAGE_TYPE=dev \ -f docker/Dockerfile.ci.dev \ -t megatron-lm:local .
# lts imagedocker build \ --target main \ --build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.lts) \ --build-arg IMAGE_TYPE=lts \ -f docker/Dockerfile.ci.dev \ -t megatron-lm:local-lts .Which image variant is used is controlled by the PR label container::lts;
absent that label, dev is used.
启动容器 (两种方式)
Option A — 本地 Docker
docker run --rm --gpus all \ -v $(pwd):/workspace \ -w /workspace \ megatron-lm:local \ bash -c "<your command>"Option B — Slurm 集群
NVIDIA clusters typically use Pyxis + enroot. Request an interactive session:
srun \ --nodes=1 --gpus-per-node=8 \ --container-image megatron-lm:local \ --container-mounts $(pwd):/workspace \ --container-workdir /workspace \ --pty bashFor clusters that require a .sqsh archive first:
enroot import -o megatron-lm.sqsh dockerd://megatron-lm:localsrun \ --nodes=1 --gpus-per-node=8 \ --container-image $(pwd)/megatron-lm.sqsh \ --container-mounts $(pwd):/workspace \ --container-workdir /workspace \ --pty bash依赖管理 (使用 uv)
所有 uv 操作必须在容器内运行
Dependencies are declared in pyproject.toml. The venv lives at /opt/venv
inside the container (already on PATH).
All
uvoperations must be run inside the container. Never runuv sync/uv pip installon the host.
依赖定义在 pyproject.toml,锁文件 uv.lock
| Group | Purpose |
|---|---|
training | Runtime training extras |
dev | Full dev environment (TransformerEngine, ModelOpt, …) |
lts | LTS-safe subset (no ModelOpt) |
test | pytest, coverage, nemo-run |
linting | ruff, black, isort, pylint |
build | Cython, pybind11, nvidia-mathdx |
Install commands (inside the container):
# Full dev + test environmentuv sync --locked --group dev --group test
# Linting onlyuv sync --locked --only-group linting
# LTS environmentuv sync --locked --group lts --group testSeveral dependencies are sourced directly from git (TransformerEngine, nemo-run,
FlashMLA, Emerging-Optimizers, nvidia-resiliency-ext). The locked uv.lock file
pins exact revisions; update it with uv lock when changing pyproject.toml.
添加依赖:容器内运行 uv add <package> → uv lock → 提交更改的文件
Follow this three-step workflow:
-
Acquire a container image — see Step 1 above.
-
Launch the container interactively — see Step 2 above.
-
Update the lock file inside the container, then commit it:
Terminal window # Inside the container:uv add <package> # adds to pyproject.toml and resolvesuv lock # regenerates uv.lock# Exit the container, then on the host:git add pyproject.toml uv.lockgit commit -S -s -m "build: add <package> dependency"
解决 uv.lock 合并冲突:取 main 版本为基准,然后重新 uv lock
uv.lock is machine-generated; never resolve conflicts manually. Instead:
git checkout origin/main -- uv.lock # take main's version as the base# then inside the container:uv lock # re-resolve on top of your pyproject.toml changes代码检查
运行 tools/autoformat.sh 进行 linting
修改 import 后必须运行 uv run isort
Run before opening a PR:
# Check mode (no changes applied)BASE_REF=main CHECK_ONLY=true SKIP_DOCS=false bash tools/autoformat.sh
# Fix modeBASE_REF=main CHECK_ONLY=false bash tools/autoformat.shTools invoked: black, isort, pylint, ruff, mypy.
After editing imports in any Python files, always run uv run isort on those
files before committing (repo CLAUDE.md requirement).
常见陷阱
这个 SKILL.md 最后列出来常见问题及解决方案,如依赖冲突、空间不足、权限问题等。
| Problem | Cause | Fix |
|---|---|---|
uv sync --locked fails | Dependency conflict or stale uv.lock | Re-run uv lock inside the container and commit updated lock |
ModuleNotFoundError after pip install | pip installed outside the uv-managed venv | Use uv add and uv sync, never bare pip install |
uv: command not found inside container | Wrong container image | Use the megatron-lm image built from Dockerfile.ci.dev |
No space left on device during uv ops | Cache fills container’s /root/.cache/ | Mount a host cache dir via -v $HOME/.cache/uv:/root/.cache/uv |
| Pre-commit fails with linting errors | Code style violations | Run BASE_REF=main CHECK_ONLY=false bash tools/autoformat.sh |
docker build fails with secret-related error | Dockerfile.ci.dev has a jet stage that requires an internal secret | Add --target main to stop before the jet stage |
access forbidden when pulling | Registry URL includes an explicit port (e.g. :5005) | Use ${GITLAB_HOST}/adlr/... with no port — the sed extracts the hostname only |
ci-test-system(测试系统与 CI 指南)
这个 SKILL.md 文件是 Megatron-LM 的测试系统和 CI 流程技能文档。
测试目录布局
tests/├── unit_tests/ # pytest, 1 node × 8 GPUs, torch.distributed runner├── functional_tests/ # end-to-end shell + training scripts│ └── test_cases/│ └── {model}/{test_case}/│ ├── model_config.yaml # training args│ └── golden_values_{env}_{platform}.json└── test_utils/ ├── recipes/ │ ├── h100/ # YAML recipes for H100 jobs │ └── gb200/ # YAML recipes for GB200 jobs └── python_scripts/ # helpers (recipe_parser, golden-value download, …)测试执行机制
GitHub Actions 调用 launch_nemo_run_workload.py,通过 nemo-run 启动 Docker 容器
The GitHub Actions runner invokes
launch_nemo_run_workload.py, which uses nemo-run to launch aDockerExecutorcontainer. The repo is bind-mounted at/opt/megatron-lm; training data is mounted at/mnt/artifacts.
单元测试:通过 torch.distributed.run 运行,rank 0 和 3 输出到 stdout,其他 rank 写日志文件
Unit tests are dispatched through
torch.distributed.run:
- Ranks 0 and 3 are tee-d to stdout; all other ranks write only to log files.
- Per-rank log files land at
{assets_dir}/logs/1/and are uploaded as a GitHub artifact after the run.
功能测试:由 run_ci_test.sh 驱动,只有 rank 0 运行 pytest 验证
Functional tests are driven by
tests/functional_tests/shell_test_utils/run_ci_test.sh. Only rank 0 runs the pytest validation step; training output from all ranks is uploaded as an artifact.
自动重试:最多 3 次,针对已知临时故障模式(NCCL timeout、ECC error、segfault 等)
Flaky-failure auto-retry:
launch_nemo_run_workload.pyretries up to 3 times for known transient patterns (NCCL timeout, ECC error, segfault, HuggingFace connectivity, …) before declaring a genuine failure.
Recipe YAML 结构
Recipe 文件定义测试规格,通过 products 块展开为多个 workload:
type: basicformat_version: 1maintainers: [mcore]loggers: [stdout]spec: name: "{test_case}_{environment}_{platforms}" model: gpt # maps to tests/functional_tests/test_cases/{model}/ build: mcore-pyt-{environment} nodes: 1 gpus: 8 n_repeat: 5 platforms: dgx_h100 time_limit: 1800 script_setup: | ... script: |- bash tests/functional_tests/shell_test_utils/run_ci_test.sh ...products: - test_case: [my_test] products: - environment: [dev, lts] scope: [mr-github] platforms: [dgx_h100]CI 测试范围标签
PR 标签决定测试范围、重复次数和容器镜像:
Decision tree (first match wins):
| Condition | scope | n_repeat | lightweight | Notes |
|---|---|---|---|---|
| Merge group | mr-github | 1 | false | Automatic, no label needed |
Label: Run tests | mr-github | 1 | true | Trains 4 steps, no golden-value compare |
Label: Run functional tests | mr-github | 5 | false | Trains 100 steps, golden-value compare |
| (no label) | mr-github-slim | 5 | false | Slim subset only |
Orthogonal image label:
| Label | Effect |
|---|---|
container::lts | Use the LTS base image instead of dev (combinable with any scope label) |
Run MBridge tests | Also triggers the MBridge L1 test suite |
添加测试
单元测试:
- 创建
tests/unit_tests/<category>/test_<name>.py - 使用
conftest.py中的 fixtures - 添加 markers(
@pytest.mark.internal/flaky/experimental) - 在容器内验证
功能测试:
- 创建
tests/functional_tests/test_cases/<model>/<test_name>/ - 编写
model_config.yaml - 添加 YAML recipe
- 推送 PR,添加
Run functional tests标签 - 成功后下载金标准值并提交
CI 失败调查
# Extract PR number from the current branchPR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')
# Fetch the PR metadata (title, labels, author, base branch)gh pr view "$PR_NUMBER" --repo NVIDIA/Megatron-LM
# Show the changeset for that PRgh pr diff "$PR_NUMBER" --repo NVIDIA/Megatron-LM
# List the files changed in the PRgh pr view "$PR_NUMBER" --repo NVIDIA/Megatron-LM --json files --jq '.files[].path'查看日志:
- Runner stdout 只显示 rank 0 和 3
- 完整日志作为 artifact 上传(
logs-<test_case>-<run_id>-<uuid>) - 下载 artifact 后用
grep -r -l "ERROR|Traceback"找错误位置
常见错误
| Problem | Cause | Fix |
|---|---|---|
| Port collision on multi-GPU runs | torchrun binding conflicts | Use torch.distributed.run via the container entry point |
| Test passes locally but fails in CI | Different environment or data path | Check DATA_PATH, DATA_CACHE_PATH, and the environment tag (dev vs lts) |
| Golden value mismatch after a code change | Numerical regression | Download new golden values via download_golden_values.py after a clean run |
cicd-integration-tests-gb200 not triggered | GB200 jobs require maintainer status | Ask a maintainer to trigger, or add the Run functional tests label |
run-on-slurm(SLURM 集群上运行 Megatron-LM 指南)
这个 SKILL.md 文件是关于 SLURM 集群上启动分布式 Megatron-LM 训练任务的技能文档。
前置条件
- 拥有 SLURM 集群登录权限,且具备向 GPU 计算分区提交任务的权限。
- 在集群分配的所有节点均可访问的文件系统(NFS、Lustre或同类文件系统)中拉取 Megatron-LM 代码仓库。所有节点必须能通过相同路径访问代码、数据、检查点文件和输出文件。
- 已安装
uv工具;在提交任务前,需在工作目录中执行一次uv sync --extra training --extra dev(或--extra lts)命令,完成虚拟环境.venv的实例化部署,确保所有节点均可识别并使用该虚拟环境。
最简 sbatch 脚本模板
在工作目录中另存为 run_megatron.slurm 文件:
#!/bin/bash#SBATCH --job-name=megatron#SBATCH --account=<SLURM_ACCOUNT>#SBATCH --partition=<SLURM_PARTITION>#SBATCH --nodes=<NODES>#SBATCH --ntasks-per-node=1#SBATCH --gpus-per-node=<GPUS_PER_NODE>#SBATCH --time=<HH:MM:SS>#SBATCH --output=logs/%x-%j.out#SBATCH --error=logs/%x-%j.err
set -euo pipefailcd <MEGATRON_WORKTREE>
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)export MASTER_PORT=${MASTER_PORT:-29500}export NNODES=${SLURM_NNODES}export GPUS_PER_NODE=<GPUS_PER_NODE>export WORLD_SIZE=$((NNODES * GPUS_PER_NODE))
# Set CUDA_DEVICE_MAX_CONNECTIONS only when your configuration requires it# (see the section below). Example for pre-Blackwell with TP>1 or CP>1# (non-FSDP):# export CUDA_DEVICE_MAX_CONNECTIONS=1
srun --ntasks=${NNODES} --ntasks-per-node=1 bash -c ' # NODE_RANK comes from SLURM_NODEID with one task per node. NODE_RANK=${SLURM_NODEID} uv run python -m torch.distributed.run \ --nnodes='"${NNODES}"' \ --nproc-per-node='"${GPUS_PER_NODE}"' \ --node-rank=${NODE_RANK} \ --master-addr='"${MASTER_ADDR}"' \ --master-port='"${MASTER_PORT}"' \ pretrain_gpt.py \ <MEGATRON_ARGS>'提交任务命令:
mkdir -p logs && JOB_ID=$(sbatch --parsable run_megatron.slurm)echo "Submitted ${JOB_ID}"多节点规则
- 所有节点必须能访问共享文件系统的相同路径
- 使用一个跨所有节点的 torchrun worker group,不要启动独立的单节点作业
--nproc-per-node应等于每节点可见 GPU 数- checkpoint、tensorboard 数据写入共享存储
CUDA_DEVICE_MAX_CONNECTIONS
不要盲目设置,根据硬件和并行模式决定:
- Blackwell 架构之前(Hopper、Ampere)、TP>1 或 CP>1、非 FSDP 场景:设置为1。相关代码逻辑会对此值做断言校验,若不为 1 会直接触发断言报错,而非出现静默死锁。
- Blackwell 架构:无需配置,设置该参数不会产生任何效果。
- Torch-FSDP2 或 Megatron-FSDP 模式:严禁设为 1。保持该环境变量不配置,或设置为大于 1 的数值即可。
- 开启
overlap_moe_expert_parallel_comm功能:设置为 32。
当你的配置场景有对应要求时,请在 sbatch 脚本中显式配置该参数。
容器使用
许多平台都在容器中运行 Megatron-LM(使用 enroot/pyxis,或者 singularity)。若采用容器部署,由 uv 管理的 .venv 虚拟环境必须存放在容器内部可访问的路径下,且容器镜像需提供代码仓库所需的 CUDA、NCCL、PyTorch 版本(参考 docker/.ngc_version.dev 与 .ngc_version.lts 文件)。上文给出的基础脚本框架无需改动;只需在调用 srun 命令时,添加对应的容器参数(如--container-image=…、--container-mounts=…等)即可。
监控与收集
squeue -j "$JOB_ID" -o "%.10i %.8T %.10M %.6D %R"sacct -j "$JOB_ID" --format=JobID,State,ExitCode,Elapsedscancel "$JOB_ID"如果你的训练脚本会输出结果产物(如 0 号进程生成的 JSON 指标文件、最终检查点文件等),请轮询检测该产物,而非仅依赖 squeue 任务状态进行等待。有效输出通常会在 SLURM 标记任务完成前就生成,通过轮询产物,你可以在产物生成后立即取消任务,无需一直占用集群资源直到超时结束。
故障诊断
扫描所有 rank 的标准错误输出,而非仅查看 0 号 rank。最早出现的非 NCCL 相关 Python 堆栈追踪信息通常是根本原因;其他 rank 后续出现的 NCCL 超时问题,均为首次崩溃引发的连锁问题。
快速故障分类排查:
- 内存溢出(OOM):记录 rank、阶段(forward/backward/optimizer)、batch size、sequence length、并行配置(TP/DP/CP/PP)以及峰值内存
- 形状/除法错误:校验
WORLD_SIZE = TP × DP × CP × PP,以及注意力头数的整除规则(num_attention_heads % TP == 0)。 - Import Error:检查 worktree 路径、
uv sync是否执行、PYTHONPATH环境变量失效。启动任务前务必确认已切换至<MEGATRON_WORKTREE>工作目录。 - NCCL 失败无 Python traceback:核查资源分配情况、端口可达性、主节点地址
MASTER_ADDR域名解析状态,以及所有 rank 的执行命令是否保持一致。
常见错误
- Forgetting
uv syncbefore the first submission. If the venv is missing, every job rebuilds it from insidesrun, costing minutes per job. - Writing logs to a node-local path that disappears at job exit. Always write to the shared filesystem.
- Setting
CUDA_DEVICE_MAX_CONNECTIONS=1blindly. The right value depends on hardware and parallelism mode (see the dedicated section above). Setting it to1with FSDP causes a different problem; on Blackwell it has no effect; on pre-Blackwell with TP>1 or CP>1 (non-FSDP) the code asserts, it does not deadlock. - Running bare
torchruninstead ofuv run python -m torch.distributed.run. Baretorchrunmay dispatch through a python interpreter that does not see venv packages, depending on how the venv is set up.
支持与分享
如果这篇文章对你有帮助,欢迎分享给更多人或赞助支持!