KubeRay 集成 Volcano#

Volcano 是一个构建在 Kubernetes 之上的批处理调度系统。它提供了一套 Kubernetes 目前缺失的机制(例如:Gang 调度、作业队列、公平调度策略),这些机制对许多类型的批处理和弹性工作负载都至关重要。KubeRay 与 Volcano 的集成能够提高在多租户 Kubernetes 环境中 Ray pod 的调度效率。

设置#

步骤 1:使用 KinD 创建 Kubernetes 集群#

在终端中运行以下命令

kind create cluster

步骤 2:安装 Volcano#

在启用 KubeRay 的 Volcano 集成之前,您需要成功在 Kubernetes 集群上安装 Volcano。请参阅 快速入门指南 以获取 Volcano 的安装说明。

步骤 3:安装支持批处理调度的 KubeRay Operator#

使用 --batch-scheduler=volcano 标志来部署 KubeRay Operator,以启用 Volcano 批处理调度支持。

使用 Helm 安装 KubeRay Operator 时,您应该使用以下两种选项之一:

  • 在您的 values.yaml 文件中将 batchScheduler.name 设置为 volcano

# values.yaml file
batchScheduler:
    name: volcano
  • 在命令行运行时,传递 --set batchScheduler.name=volcano 标志。

# Install the Helm chart with the --batch-scheduler=volcano flag
helm install kuberay-operator kuberay/kuberay-operator --version 1.5.1 --set batchScheduler.name=volcano

步骤 4:安装支持 Volcano 调度的 RayCluster#

RayCluster 自定义资源必须包含 ray.io/scheduler-name: volcano 标签,以便将集群的 Pod 提交给 Volcano 进行调度。

# Path: kuberay/ray-operator/config/samples
# Includes label `ray.io/scheduler-name: volcano` in the metadata.labels
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.5.1/ray-operator/config/samples/ray-cluster.volcano-scheduler.yaml
kubectl apply -f ray-cluster.volcano-scheduler.yaml

# Check the RayCluster
kubectl get pod -l ray.io/cluster=test-cluster-0
# NAME                                 READY   STATUS    RESTARTS   AGE
# test-cluster-0-head-jj9bg            1/1     Running   0          36s

您也可以在 RayCluster 的元数据中提供以下标签:

  • ray.io/priority-class-name:由 Kubernetes 定义的集群优先级类。

    • 此标签仅在您创建了 PriorityClass 资源后生效。

    • labels:
        ray.io/scheduler-name: volcano
        ray.io/priority-class-name: <replace with correct PriorityClass resource name>
      
  • volcano.sh/queue-name:集群提交到的 Volcano 队列名称。

    • 此标签仅在您创建了 Queue 资源后生效。

    • labels:
        ray.io/scheduler-name: volcano
        volcano.sh/queue-name: <replace with correct Queue resource name>
      

如果启用了自动扩缩容,则 minReplicas 用于 Gang 调度;否则,将使用所需的 replicas

步骤 5:使用 Volcano 进行批处理调度#

有关指导,请参阅 示例

示例#

在进行示例之前,请移除所有正在运行的 Ray 集群,以确保以下示例能够成功运行。

kubectl delete raycluster --all

Gang 调度#

此示例将演示 Gang 调度如何与 Volcano 和 KubeRay 一起工作。

首先,创建一个容量为 4 CPUs 和 6Gi RAM 的队列。

kubectl create -f - <<EOF
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: kuberay-test-queue
spec:
  weight: 1
  capability:
    cpu: 4
    memory: 6Gi
EOF

上述定义中的 **weight** 指示了队列在集群资源分配中的相对权重。当您集群中所有队列的总 **capability** 超过总可用资源时,请使用此参数,这将迫使队列之间进行共享。具有较高权重的队列将获得更大比例的可用总资源。

**capability** 是队列在任何给定时间支持的最大资源的硬约束。您可以根据需要更新它,以允许一次运行更多或更少的工作负载。

接下来,创建一个包含一个 head 节点(1 CPU + 2Gi RAM)和两个 worker 节点(每个 1 CPU + 1Gi RAM)的 RayCluster,总共需要 3 CPU 和 4Gi RAM。

# Path: kuberay/ray-operator/config/samples
# Includes  the `ray.io/scheduler-name: volcano` and `volcano.sh/queue-name: kuberay-test-queue` labels in the metadata.labels
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.5.1/ray-operator/config/samples/ray-cluster.volcano-scheduler-queue.yaml
kubectl apply -f ray-cluster.volcano-scheduler-queue.yaml

由于队列的容量为 4 CPU 和 6Gi RAM,因此此资源应该能够成功调度,不会出现任何问题。您可以通过检查集群的 Volcano PodGroup 的状态来验证这一点,确保其 phase 为 Running,last status 为 Scheduled

kubectl get podgroup ray-test-cluster-0-pg -o yaml

# apiVersion: scheduling.volcano.sh/v1beta1
# kind: PodGroup
# metadata:
#   creationTimestamp: "2022-12-01T04:43:30Z"
#   generation: 2
#   name: ray-test-cluster-0-pg
#   namespace: test
#   ownerReferences:
#   - apiVersion: ray.io/v1alpha1
#     blockOwnerDeletion: true
#     controller: true
#     kind: RayCluster
#     name: test-cluster-0
#     uid: 7979b169-f0b0-42b7-8031-daef522d25cf
#   resourceVersion: "4427347"
#   uid: 78902d3d-b490-47eb-ba12-d6f8b721a579
# spec:
#   minMember: 3
#   minResources:
#     cpu: "3"
#     memory: 4Gi
#   queue: kuberay-test-queue
# status:
#   conditions:
#   - lastTransitionTime: "2022-12-01T04:43:31Z"
#     reason: tasks in the gang are ready to be scheduled
#     status: "True"
#     transitionID: f89f3062-ebd7-486b-8763-18ccdba1d585
#     type: Scheduled
#   phase: Running

检查队列的状态以查看已分配的资源。

kubectl get queue kuberay-test-queue -o yaml

# apiVersion: scheduling.volcano.sh/v1beta1
# kind: Queue
# metadata:
#   creationTimestamp: "2022-12-01T04:43:21Z"
#   generation: 1
#   name: kuberay-test-queue
#   resourceVersion: "4427348"
#   uid: a6c4f9df-d58c-4da8-8a58-e01c93eca45a
# spec:
#   capability:
#     cpu: 4
#     memory: 6Gi
#   reclaimable: true
#   weight: 1
# status:
#   allocated:
#     cpu: "3"
#     memory: 4Gi 
#     pods: "3"
#   reservation: {}
#   state: Open

接下来,添加另一个具有相同 head 和 worker 节点配置但名称不同的 RayCluster。

# Path: kuberay/ray-operator/config/samples
# Includes the `ray.io/scheduler-name: volcano` and `volcano.sh/queue-name: kuberay-test-queue` labels in the metadata.labels
# Replaces the name to test-cluster-1
sed 's/test-cluster-0/test-cluster-1/' ray-cluster.volcano-scheduler-queue.yaml | kubectl apply -f-

检查其 PodGroup 的状态,确保其 phase 为 Pending,last status 为 Unschedulable

kubectl get podgroup ray-test-cluster-1-pg -o yaml

# apiVersion: scheduling.volcano.sh/v1beta1
# kind: PodGroup
# metadata:
#   creationTimestamp: "2022-12-01T04:48:18Z"
#   generation: 2
#   name: ray-test-cluster-1-pg
#   namespace: test
#   ownerReferences:
#   - apiVersion: ray.io/v1alpha1
#     blockOwnerDeletion: true
#     controller: true
#     kind: RayCluster
#     name: test-cluster-1
#     uid: b3cf83dc-ef3a-4bb1-9c42-7d2a39c53358
#   resourceVersion: "4427976"
#   uid: 9087dd08-8f48-4592-a62e-21e9345b0872
# spec:
#   minMember: 3
#   minResources:
#     cpu: "3"
#     memory: 4Gi
#   queue: kuberay-test-queue
# status:
#   conditions:
#   - lastTransitionTime: "2022-12-01T04:48:19Z"
#     message: '3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending,
#       3 minAvailable; Pending: 3 Undetermined'
#     reason: NotEnoughResources
#     status: "True"
#     transitionID: 3956b64f-fc52-4779-831e-d379648eecfc
#     type: Unschedulable
#   phase: Pending

由于新集群所需的 CPU 和 RAM 超出了我们队列的允许范围,即使其中一个 Pod 可以放入剩余的 1 CPU 和 2Gi RAM 中,但直到有足够的空间容纳所有 Pod 之前,都不会放置该集群的任何 Pod。如果像这样不使用 Volcano 进行 Gang 调度,通常会放置其中一个 Pod,导致集群部分分配,并且一些作业(例如 Horovod 训练)将卡在等待资源可用。

查看新 RayCluster 的 Pod 状态,它们被列为 Pending,以了解这对调度它们的影响。

kubectl get pods

# NAME                                            READY   STATUS         RESTARTS   AGE
# test-cluster-0-worker-worker-ddfbz              1/1     Running        0          7m
# test-cluster-0-head                             1/1     Running        0          7m
# test-cluster-0-worker-worker-57pc7              1/1     Running        0          6m59s
# test-cluster-1-worker-worker-6tzf7              0/1     Pending        0          2m12s
# test-cluster-1-head                             0/1     Pending        0          2m12s
# test-cluster-1-worker-worker-n5g8k              0/1     Pending        0          2m12s

查看 Pod 详细信息,以了解 Volcano 无法调度 Gang 的原因。

kubectl describe pod test-cluster-1-head-6668q | tail -n 3

# Type     Reason            Age   From     Message
# ----     ------            ----  ----     -------
# Warning  FailedScheduling  4m5s  volcano  3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending, 3 minAvailable; Pending: 3 Undetermined

删除第一个 RayCluster 以腾出队列空间。

kubectl delete raycluster test-cluster-0

第二个集群的 PodGroup 变为 Running 状态,因为现在有足够的资源来调度整个 Pod 集合。

kubectl get podgroup ray-test-cluster-1-pg -o yaml

# apiVersion: scheduling.volcano.sh/v1beta1
# kind: PodGroup
# metadata:
#   creationTimestamp: "2022-12-01T04:48:18Z"
#   generation: 9
#   name: ray-test-cluster-1-pg
#   namespace: test
#   ownerReferences:
#   - apiVersion: ray.io/v1alpha1
#     blockOwnerDeletion: true
#     controller: true
#     kind: RayCluster
#     name: test-cluster-1
#     uid: b3cf83dc-ef3a-4bb1-9c42-7d2a39c53358
#   resourceVersion: "4428864"
#   uid: 9087dd08-8f48-4592-a62e-21e9345b0872
# spec:
#   minMember: 3
#   minResources:
#     cpu: "3"
#     memory: 4Gi
#   queue: kuberay-test-queue
# status:
#   conditions:
#   - lastTransitionTime: "2022-12-01T04:54:04Z"
#     message: '3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending,
#       3 minAvailable; Pending: 3 Undetermined'
#     reason: NotEnoughResources
#     status: "True"
#     transitionID: db90bbf0-6845-441b-8992-d0e85f78db77
#     type: Unschedulable
#   - lastTransitionTime: "2022-12-01T04:55:10Z"
#     reason: tasks in the gang are ready to be scheduled
#     status: "True"
#     transitionID: 72bbf1b3-d501-4528-a59d-479504f3eaf5
#     type: Scheduled
#   phase: Running
#   running: 3

再次检查 Pod,以了解第二个集群现在已成功启动并运行。

kubectl get pods

# NAME                                            READY   STATUS         RESTARTS   AGE
# test-cluster-1-worker-worker-n5g8k              1/1     Running        0          9m4s
# test-cluster-1-head                             1/1     Running        0          9m4s
# test-cluster-1-worker-worker-6tzf7              1/1     Running        0          9m4s

最后,清理剩余的集群和队列。

kubectl delete raycluster test-cluster-1
kubectl delete queue kuberay-test-queue

为 RayJob 使用 Volcano 进行 Gang 调度#

从 KubeRay 1.5.1 开始,KubeRay 支持 RayJob 自定义资源的 Gang 调度。

首先,创建一个容量为 4 CPUs 和 6Gi RAM 的队列,以及一个 RayJob A,它包含一个 head 节点(1 CPU + 2Gi RAM)、两个 worker 节点(每个 1 CPU + 1Gi RAM)和一个 submitter Pod(0.5 CPU + 200Mi RAM),总共需要 3500m CPU 和 4296Mi RAM。

curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.5.1/ray-operator/config/samples/ray-job.volcano-scheduler-queue.yaml
kubectl apply -f ray-job.volcano-scheduler-queue.yaml

等待所有 Pod 进入 Running 状态。

kubectl get pod

# NAME                                             READY   STATUS      RESTARTS   AGE
# rayjob-sample-0-k449j-head-rlgxj                 1/1     Running     0          93s
# rayjob-sample-0-k449j-small-group-worker-c6dt8   1/1     Running     0          93s
# rayjob-sample-0-k449j-small-group-worker-cq6xn   1/1     Running     0          93s
# rayjob-sample-0-qmm8s                            0/1     Completed   0          32s

添加另一个具有相同配置但名称不同的 RayJob。

sed 's/rayjob-sample-0/rayjob-sample-1/' ray-job.volcano-scheduler-queue.yaml | kubectl apply -f-

新 RayJob 的所有 Pod 都卡在 Pending 状态。

# NAME                                             READY   STATUS      RESTARTS   AGE
# rayjob-sample-0-k449j-head-rlgxj                 1/1     Running     0          3m27s
# rayjob-sample-0-k449j-small-group-worker-c6dt8   1/1     Running     0          3m27s
# rayjob-sample-0-k449j-small-group-worker-cq6xn   1/1     Running     0          3m27s
# rayjob-sample-0-qmm8s                            0/1     Completed   0          2m26s
# rayjob-sample-1-mvgqf-head-qb7wm                 0/1     Pending     0          21s
# rayjob-sample-1-mvgqf-small-group-worker-jfzt5   0/1     Pending     0          21s
# rayjob-sample-1-mvgqf-small-group-worker-ng765   0/1     Pending     0          21s

检查其 PodGroup 的状态,确保其 phase 为 Pending,last status 为 Unschedulable

kubectl get podgroup ray-rayjob-sample-1-pg  -o yaml

# apiVersion: scheduling.volcano.sh/v1beta1
# kind: PodGroup
# metadata:
#   creationTimestamp: "2025-10-30T17:10:18Z"
#   generation: 2
#   name: ray-rayjob-sample-1-pg
#   namespace: default
#   ownerReferences:
#   - apiVersion: ray.io/v1
#     blockOwnerDeletion: true
#     controller: true
#     kind: RayJob
#     name: rayjob-sample-1
#     uid: 5835c896-c75d-4692-b10a-2871a79f141a
#   resourceVersion: "3226"
#   uid: 9fd55cbd-ba69-456d-b305-f61ffd6d935d
# spec:
#   minMember: 3
#   minResources:
#     cpu: 3500m
#     memory: 4296Mi
#   queue: kuberay-test-queue
# status:
#   conditions:
#   - lastTransitionTime: "2025-10-30T17:10:18Z"
#     message: '3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending,
#       3 minAvailable; Pending: 3 Unschedulable'
#     reason: NotEnoughResources
#     status: "True"
#     transitionID: 7866f533-6590-4a4d-83cf-8f1db0214609
#     type: Unschedulable
#   phase: Pending

删除第一个 RayJob 以腾出队列空间。

kubectl delete rayjob rayjob-sample-0

第二个集群的 PodGroup 变为 Running 状态,因为现在有足够的资源来调度整个 Pod 集合。

kubectl get podgroup ray-rayjob-sample-1-pg  -o yaml

# apiVersion: scheduling.volcano.sh/v1beta1
# kind: PodGroup
# metadata:
#   creationTimestamp: "2025-10-30T17:10:18Z"
#   generation: 7
#   name: ray-rayjob-sample-1-pg
#   namespace: default
#   ownerReferences:
#   - apiVersion: ray.io/v1
#     blockOwnerDeletion: true
#     controller: true
#     kind: RayJob
#     name: rayjob-sample-1
#     uid: 5835c896-c75d-4692-b10a-2871a79f141a
#   resourceVersion: "3724"
#   uid: 9fd55cbd-ba69-456d-b305-f61ffd6d935d
# spec:
#   minMember: 3
#   minResources:
#     cpu: 3500m
#     memory: 4296Mi
#   queue: kuberay-test-queue
# status:
#   conditions:
#   - lastTransitionTime: "2025-10-30T17:10:18Z"
#     message: '3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending,
#       3 minAvailable; Pending: 3 Unschedulable'
#     reason: NotEnoughResources
#     status: "True"
#     transitionID: 7866f533-6590-4a4d-83cf-8f1db0214609
#     type: Unschedulable
#   - lastTransitionTime: "2025-10-30T17:14:44Z"
#     reason: tasks in gang are ready to be scheduled
#     status: "True"
#     transitionID: 36e0222d-eee3-444a-9889-5b9c255f41af
#     type: Scheduled
#   phase: Running
#   running: 4

再次检查 Pod,以了解第二个 RayJob 现在已成功启动并运行。

kubectl get pod
# NAME                                             READY   STATUS      RESTARTS   AGE
# rayjob-sample-1-mvgqf-head-qb7wm                 1/1     Running     0          5m47s
# rayjob-sample-1-mvgqf-small-group-worker-jfzt5   1/1     Running     0          5m47s
# rayjob-sample-1-mvgqf-small-group-worker-ng765   1/1     Running     0          5m47s
# rayjob-sample-1-tcd4m                            0/1     Completed   0          84s

最后,清理剩余的 rayjob、队列和 configmap。

kubectl delete rayjob rayjob-sample-1
kubectl delete queue kuberay-test-queue
kubectl delete configmap ray-job-code-sample