在 KubeRay 上对多节点 GPU 服务进行故障排除#
本指南将帮助您诊断和解决在使用 KubeRay 部署多节点 GPU 工作负载时遇到的常见问题,特别是在使用 vLLM 进行大型语言模型 (LLM) 服务时。
调试策略#
当遇到多节点 GPU 服务问题时,请使用此系统性方法来隔离问题
在不同平台上进行测试 比较不同环境下的行为
单节点(不使用 KubeRay)
KubeRay 上的独立 vLLM 服务器
KubeRay 上的 Ray Serve LLM 部署
更改硬件配置 测试不同的 GPU 类型,例如 A100 与 H100,以识别硬件特定的问题
使用最小重现器 创建简化的测试用例,以隔离特定组件(NCCL、模型加载等)
常见问题及解决方案#
1. Head pod 调度到 GPU 节点#
症状
ray status显示重复的 GPU 资源,例如,集群只有 16 个 GPU,但显示 24 个 GPU使用流水线并行(PP > 1)时,模型服务挂起
资源分配冲突
根本原因 Ray Head pod 被错误地调度到 GPU 工作节点,导致资源计数问题。
解决方案 在您的 RayCluster 规范中配置 Head pod 使用零个 GPU
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: my-cluster
spec:
headGroupSpec:
rayStartParams:
num-cpus: "0"
num-gpus: "0" # Ensure head pod doesn't claim GPU resources.
# ... other head group configuration
2. AWS OFI 插件版本问题(H100 特定)#
症状
H100 实例上的 NCCL 初始化失败
在 A100 上运行正常,但在 H100 上使用相同配置时失败
拓扑文件格式错误
根本原因 容器镜像中过时的 aws-ofi-plugin 导致 H100 实例上的 NCCL 拓扑检测失败。
相关问题
解决方案
更新到包含更新的
aws-ofi-plugin的较新容器镜像使用下面的 NCCL 调试脚本来验证 NCCL 是否按预期工作
考虑硬件特定的配置调整
进一步故障排除#
如果您在遵循本指南后仍遇到问题
收集诊断信息:运行下面的 NCCL 调试脚本并保存输出
检查兼容性:验证 Ray、vLLM、PyTorch 和 CUDA 版本是否兼容
查看日志:检查 Ray 集群日志和工作节点 pod 日志以获取更多错误详细信息
硬件验证:如果可能,使用不同的 GPU 类型进行测试
社区支持:与 Ray 和 vLLM 社区分享您的发现以获得更多帮助
附加资源#
NCCL 调试脚本#
使用此诊断脚本来识别多节点 GPU 设置中的 NCCL 相关问题
#!/usr/bin/env python3
"""
NCCL Diagnostic Script for Multi-Node GPU Serving
This script helps identify NCCL configuration issues that can cause
multi-node GPU serving failures. Run this script on each node to verify
NCCL function before deploying distributed workloads.
Usage: python3 multi-node-nccl-check.py
"""
import os
import sys
import socket
import torch
from datetime import datetime
def log(msg):
"""Log messages with timestamp for better debugging."""
timestamp = datetime.now().strftime("%H:%M:%S")
print(f"[{timestamp}] {msg}", flush=True)
def print_environment_info():
"""Print relevant environment information for debugging."""
log("=== Environment Information ===")
log(f"Hostname: {socket.gethostname()}")
log(f"CUDA_VISIBLE_DEVICES: {os.environ.get('CUDA_VISIBLE_DEVICES', 'not set')}")
# Print all NCCL-related environment variables.
nccl_vars = [var for var in os.environ.keys() if var.startswith('NCCL_')]
if nccl_vars:
log("NCCL Environment Variables:")
for var in sorted(nccl_vars):
log(f" {var}: {os.environ[var]}")
else:
log("No NCCL environment variables set")
def check_cuda_availability():
"""Verify CUDA is available and functional."""
log("\n=== CUDA Availability Check ===")
if not torch.cuda.is_available():
log("ERROR: CUDA not available")
return False
device_count = torch.cuda.device_count()
log(f"CUDA device count: {device_count}")
log(f"PyTorch version: {torch.__version__}")
# Check NCCL availability in PyTorch.
try:
import torch.distributed as dist
if hasattr(torch.distributed, 'nccl'):
log(f"PyTorch NCCL available: {torch.distributed.is_nccl_available()}")
except Exception as e:
log(f"Error checking NCCL availability: {e}")
return True
def test_individual_gpus():
"""Test that each GPU is working individually."""
log("\n=== Individual GPU Tests ===")
for gpu_id in range(torch.cuda.device_count()):
log(f"\n--- Testing GPU {gpu_id} ---")
try:
torch.cuda.set_device(gpu_id)
device = torch.cuda.current_device()
log(f"Device {device}: {torch.cuda.get_device_name(device)}")
# Print device properties.
props = torch.cuda.get_device_properties(device)
log(f" Compute capability: {props.major}.{props.minor}")
log(f" Total memory: {props.total_memory / 1024**3:.2f} GB")
# Test basic CUDA operations.
log(" Testing basic CUDA operations...")
tensor = torch.ones(1000, device=f'cuda:{gpu_id}')
result = tensor.sum()
log(f" Basic CUDA test passed: sum = {result.item()}")
# Test cross-GPU operations if multiple GPUs are available.
if torch.cuda.device_count() > 1:
log(" Testing cross-GPU operations...")
try:
other_gpu = (gpu_id + 1) % torch.cuda.device_count()
test_tensor = torch.randn(10, 10, device=f'cuda:{gpu_id}')
tensor_copy = test_tensor.to(f'cuda:{other_gpu}')
log(f" Cross-GPU copy successful: GPU {gpu_id} -> GPU {other_gpu}")
except Exception as e:
log(f" Cross-GPU copy failed: {e}")
# Test memory allocation.
log(" Testing large memory allocations...")
try:
large_tensor = torch.zeros(1000, 1000, device=f'cuda:{gpu_id}')
log(" Large memory allocation successful")
del large_tensor
except Exception as e:
log(f" Large memory allocation failed: {e}")
except Exception as e:
log(f"ERROR testing GPU {gpu_id}: {e}")
import traceback
log(f"Traceback:\n{traceback.format_exc()}")
def test_nccl_initialization():
"""Test NCCL initialization and basic operations."""
log("\n=== NCCL Initialization Test ===")
try:
import torch.distributed as dist
# Set up single-process NCCL environment.
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'
os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '1'
log("Attempting single-process NCCL initialization...")
dist.init_process_group(
backend='nccl',
rank=0,
world_size=1
)
log("Single-process NCCL initialization successful!")
# Test basic NCCL operation.
if torch.cuda.is_available():
device = torch.cuda.current_device()
tensor = torch.ones(10, device=device)
# This is a no-op with world_size=1 but exercises NCCL
dist.all_reduce(tensor)
log("NCCL all_reduce test successful!")
dist.destroy_process_group()
log("NCCL cleanup successful!")
except Exception as e:
log(f"NCCL initialization failed: {e}")
import traceback
log(f"Full traceback:\n{traceback.format_exc()}")
def main():
"""Main diagnostic routine."""
log("Starting NCCL Diagnostic Script")
log("=" * 50)
print_environment_info()
if not check_cuda_availability():
sys.exit(1)
test_individual_gpus()
test_nccl_initialization()
log("\n" + "=" * 50)
log("NCCL diagnostic script completed")
log("If you encountered errors, check the specific error messages above")
log("and refer to the troubleshooting guide for solutions.")
if __name__ == "__main__":
main()