在 KubeRay 上对多节点 GPU 服务进行故障排除#

本指南将帮助您诊断和解决在使用 KubeRay 部署多节点 GPU 工作负载时遇到的常见问题，特别是在使用 vLLM 进行大型语言模型 (LLM) 服务时。

调试策略#

当遇到多节点 GPU 服务问题时，请使用此系统性方法来隔离问题

在不同平台上进行测试 比较不同环境下的行为
- 单节点（不使用 KubeRay）
- KubeRay 上的独立 vLLM 服务器
- KubeRay 上的 Ray Serve LLM 部署
更改硬件配置 测试不同的 GPU 类型，例如 A100 与 H100，以识别硬件特定的问题
使用最小重现器 创建简化的测试用例，以隔离特定组件（NCCL、模型加载等）

常见问题及解决方案#

1. Head pod 调度到 GPU 节点#

症状

ray status 显示重复的 GPU 资源，例如，集群只有 16 个 GPU，但显示 24 个 GPU
使用流水线并行（PP > 1）时，模型服务挂起
资源分配冲突

根本原因 Ray Head pod 被错误地调度到 GPU 工作节点，导致资源计数问题。

解决方案 在您的 RayCluster 规范中配置 Head pod 使用零个 GPU

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: my-cluster
spec:
  headGroupSpec:
    rayStartParams:
      num-cpus: "0"
      num-gpus: "0"  # Ensure head pod doesn't claim GPU resources.
    # ... other head group configuration

2. AWS OFI 插件版本问题（H100 特定）#

症状

H100 实例上的 NCCL 初始化失败
在 A100 上运行正常，但在 H100 上使用相同配置时失败
拓扑文件格式错误

根本原因 容器镜像中过时的 aws-ofi-plugin 导致 H100 实例上的 NCCL 拓扑检测失败。

相关问题

解决方案

更新到包含更新的 aws-ofi-plugin 的较新容器镜像
使用下面的 NCCL 调试脚本来验证 NCCL 是否按预期工作
考虑硬件特定的配置调整

进一步故障排除#

如果您在遵循本指南后仍遇到问题

收集诊断信息：运行下面的 NCCL 调试脚本并保存输出
检查兼容性：验证 Ray、vLLM、PyTorch 和 CUDA 版本是否兼容
查看日志：检查 Ray 集群日志和工作节点 pod 日志以获取更多错误详细信息
硬件验证：如果可能，使用不同的 GPU 类型进行测试
社区支持：与 Ray 和 vLLM 社区分享您的发现以获得更多帮助

附加资源#

NCCL 调试脚本#

使用此诊断脚本来识别多节点 GPU 设置中的 NCCL 相关问题

#!/usr/bin/env python3
"""
NCCL Diagnostic Script for Multi-Node GPU Serving

This script helps identify NCCL configuration issues that can cause
multi-node GPU serving failures. Run this script on each node to verify
NCCL function before deploying distributed workloads.

Usage: python3 multi-node-nccl-check.py
"""
import os
import sys
import socket
import torch
from datetime import datetime

def log(msg):
    """Log messages with timestamp for better debugging."""
    timestamp = datetime.now().strftime("%H:%M:%S")
    print(f"[{timestamp}] {msg}", flush=True)

def print_environment_info():
    """Print relevant environment information for debugging."""
    log("=== Environment Information ===")
    log(f"Hostname: {socket.gethostname()}")
    log(f"CUDA_VISIBLE_DEVICES: {os.environ.get('CUDA_VISIBLE_DEVICES', 'not set')}")
    
    # Print all NCCL-related environment variables.
    nccl_vars = [var for var in os.environ.keys() if var.startswith('NCCL_')]
    if nccl_vars:
        log("NCCL Environment Variables:")
        for var in sorted(nccl_vars):
            log(f"  {var}: {os.environ[var]}")
    else:
        log("No NCCL environment variables set")

def check_cuda_availability():
    """Verify CUDA is available and functional."""
    log("\n=== CUDA Availability Check ===")
    
    if not torch.cuda.is_available():
        log("ERROR: CUDA not available")
        return False
    
    device_count = torch.cuda.device_count()
    log(f"CUDA device count: {device_count}")
    log(f"PyTorch version: {torch.__version__}")
    
    # Check NCCL availability in PyTorch.
    try:
        import torch.distributed as dist
        if hasattr(torch.distributed, 'nccl'):
            log(f"PyTorch NCCL available: {torch.distributed.is_nccl_available()}")
    except Exception as e:
        log(f"Error checking NCCL availability: {e}")
    
    return True

def test_individual_gpus():
    """Test that each GPU is working individually."""
    log("\n=== Individual GPU Tests ===")
    
    for gpu_id in range(torch.cuda.device_count()):
        log(f"\n--- Testing GPU {gpu_id} ---")
        
        try:
            torch.cuda.set_device(gpu_id)
            device = torch.cuda.current_device()
            
            log(f"Device {device}: {torch.cuda.get_device_name(device)}")
            
            # Print device properties.
            props = torch.cuda.get_device_properties(device)
            log(f"  Compute capability: {props.major}.{props.minor}")
            log(f"  Total memory: {props.total_memory / 1024**3:.2f} GB")
            
            # Test basic CUDA operations.
            log("  Testing basic CUDA operations...")
            tensor = torch.ones(1000, device=f'cuda:{gpu_id}')
            result = tensor.sum()
            log(f"  Basic CUDA test passed: sum = {result.item()}")
            
            # Test cross-GPU operations if multiple GPUs are available.
            if torch.cuda.device_count() > 1:
                log("  Testing cross-GPU operations...")
                try:
                    other_gpu = (gpu_id + 1) % torch.cuda.device_count()
                    test_tensor = torch.randn(10, 10, device=f'cuda:{gpu_id}')
                    tensor_copy = test_tensor.to(f'cuda:{other_gpu}')
                    log(f"  Cross-GPU copy successful: GPU {gpu_id} -> GPU {other_gpu}")
                except Exception as e:
                    log(f"  Cross-GPU copy failed: {e}")
            
            # Test memory allocation.
            log("  Testing large memory allocations...")
            try:
                large_tensor = torch.zeros(1000, 1000, device=f'cuda:{gpu_id}')
                log("  Large memory allocation successful")
                del large_tensor
            except Exception as e:
                log(f"  Large memory allocation failed: {e}")
                
        except Exception as e:
            log(f"ERROR testing GPU {gpu_id}: {e}")
            import traceback
            log(f"Traceback:\n{traceback.format_exc()}")

def test_nccl_initialization():
    """Test NCCL initialization and basic operations."""
    log("\n=== NCCL Initialization Test ===")
    
    try:
        import torch.distributed as dist
        
        # Set up single-process NCCL environment.
        os.environ['MASTER_ADDR'] = 'localhost'
        os.environ['MASTER_PORT'] = '29500'
        os.environ['RANK'] = '0'
        os.environ['WORLD_SIZE'] = '1'
        
        log("Attempting single-process NCCL initialization...")
        dist.init_process_group(
            backend='nccl',
            rank=0,
            world_size=1
        )
        
        log("Single-process NCCL initialization successful!")
        
        # Test basic NCCL operation.
        if torch.cuda.is_available():
            device = torch.cuda.current_device()
            tensor = torch.ones(10, device=device)
            
            # This is a no-op with world_size=1 but exercises NCCL
            dist.all_reduce(tensor)
            log("NCCL all_reduce test successful!")
        
        dist.destroy_process_group()
        log("NCCL cleanup successful!")
        
    except Exception as e:
        log(f"NCCL initialization failed: {e}")
        import traceback
        log(f"Full traceback:\n{traceback.format_exc()}")

def main():
    """Main diagnostic routine."""
    log("Starting NCCL Diagnostic Script")
    log("=" * 50)
    
    print_environment_info()
    
    if not check_cuda_availability():
        sys.exit(1)
    
    test_individual_gpus()
    test_nccl_initialization()
    
    log("\n" + "=" * 50)
    log("NCCL diagnostic script completed")
    log("If you encountered errors, check the specific error messages above")
    log("and refer to the troubleshooting guide for solutions.")

if __name__ == "__main__":
    main()