KubeRay 集成 Volcano#
Volcano 是一个构建在 Kubernetes 之上的批处理调度系统。它提供了一套 Kubernetes 目前缺失的机制(例如:Gang 调度、作业队列、公平调度策略),这些机制对许多类型的批处理和弹性工作负载都至关重要。KubeRay 与 Volcano 的集成能够提高在多租户 Kubernetes 环境中 Ray pod 的调度效率。
设置#
步骤 1:使用 KinD 创建 Kubernetes 集群#
在终端中运行以下命令
kind create cluster
步骤 2:安装 Volcano#
在启用 KubeRay 的 Volcano 集成之前,您需要成功在 Kubernetes 集群上安装 Volcano。请参阅 快速入门指南 以获取 Volcano 的安装说明。
步骤 3:安装支持批处理调度的 KubeRay Operator#
使用 --batch-scheduler=volcano 标志来部署 KubeRay Operator,以启用 Volcano 批处理调度支持。
使用 Helm 安装 KubeRay Operator 时,您应该使用以下两种选项之一:
在您的
values.yaml文件中将batchScheduler.name设置为volcano。
# values.yaml file
batchScheduler:
name: volcano
在命令行运行时,传递
--set batchScheduler.name=volcano标志。
# Install the Helm chart with the --batch-scheduler=volcano flag
helm install kuberay-operator kuberay/kuberay-operator --version 1.5.1 --set batchScheduler.name=volcano
步骤 4:安装支持 Volcano 调度的 RayCluster#
RayCluster 自定义资源必须包含 ray.io/scheduler-name: volcano 标签,以便将集群的 Pod 提交给 Volcano 进行调度。
# Path: kuberay/ray-operator/config/samples
# Includes label `ray.io/scheduler-name: volcano` in the metadata.labels
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.5.1/ray-operator/config/samples/ray-cluster.volcano-scheduler.yaml
kubectl apply -f ray-cluster.volcano-scheduler.yaml
# Check the RayCluster
kubectl get pod -l ray.io/cluster=test-cluster-0
# NAME READY STATUS RESTARTS AGE
# test-cluster-0-head-jj9bg 1/1 Running 0 36s
您也可以在 RayCluster 的元数据中提供以下标签:
ray.io/priority-class-name:由 Kubernetes 定义的集群优先级类。此标签仅在您创建了
PriorityClass资源后生效。labels: ray.io/scheduler-name: volcano ray.io/priority-class-name: <replace with correct PriorityClass resource name>
volcano.sh/queue-name:集群提交到的 Volcano 队列名称。此标签仅在您创建了
Queue资源后生效。labels: ray.io/scheduler-name: volcano volcano.sh/queue-name: <replace with correct Queue resource name>
如果启用了自动扩缩容,则 minReplicas 用于 Gang 调度;否则,将使用所需的 replicas。
步骤 5:使用 Volcano 进行批处理调度#
有关指导,请参阅 示例。
示例#
在进行示例之前,请移除所有正在运行的 Ray 集群,以确保以下示例能够成功运行。
kubectl delete raycluster --all
Gang 调度#
此示例将演示 Gang 调度如何与 Volcano 和 KubeRay 一起工作。
首先,创建一个容量为 4 CPUs 和 6Gi RAM 的队列。
kubectl create -f - <<EOF
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: kuberay-test-queue
spec:
weight: 1
capability:
cpu: 4
memory: 6Gi
EOF
上述定义中的 **weight** 指示了队列在集群资源分配中的相对权重。当您集群中所有队列的总 **capability** 超过总可用资源时,请使用此参数,这将迫使队列之间进行共享。具有较高权重的队列将获得更大比例的可用总资源。
**capability** 是队列在任何给定时间支持的最大资源的硬约束。您可以根据需要更新它,以允许一次运行更多或更少的工作负载。
接下来,创建一个包含一个 head 节点(1 CPU + 2Gi RAM)和两个 worker 节点(每个 1 CPU + 1Gi RAM)的 RayCluster,总共需要 3 CPU 和 4Gi RAM。
# Path: kuberay/ray-operator/config/samples
# Includes the `ray.io/scheduler-name: volcano` and `volcano.sh/queue-name: kuberay-test-queue` labels in the metadata.labels
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.5.1/ray-operator/config/samples/ray-cluster.volcano-scheduler-queue.yaml
kubectl apply -f ray-cluster.volcano-scheduler-queue.yaml
由于队列的容量为 4 CPU 和 6Gi RAM,因此此资源应该能够成功调度,不会出现任何问题。您可以通过检查集群的 Volcano PodGroup 的状态来验证这一点,确保其 phase 为 Running,last status 为 Scheduled。
kubectl get podgroup ray-test-cluster-0-pg -o yaml
# apiVersion: scheduling.volcano.sh/v1beta1
# kind: PodGroup
# metadata:
# creationTimestamp: "2022-12-01T04:43:30Z"
# generation: 2
# name: ray-test-cluster-0-pg
# namespace: test
# ownerReferences:
# - apiVersion: ray.io/v1alpha1
# blockOwnerDeletion: true
# controller: true
# kind: RayCluster
# name: test-cluster-0
# uid: 7979b169-f0b0-42b7-8031-daef522d25cf
# resourceVersion: "4427347"
# uid: 78902d3d-b490-47eb-ba12-d6f8b721a579
# spec:
# minMember: 3
# minResources:
# cpu: "3"
# memory: 4Gi
# queue: kuberay-test-queue
# status:
# conditions:
# - lastTransitionTime: "2022-12-01T04:43:31Z"
# reason: tasks in the gang are ready to be scheduled
# status: "True"
# transitionID: f89f3062-ebd7-486b-8763-18ccdba1d585
# type: Scheduled
# phase: Running
检查队列的状态以查看已分配的资源。
kubectl get queue kuberay-test-queue -o yaml
# apiVersion: scheduling.volcano.sh/v1beta1
# kind: Queue
# metadata:
# creationTimestamp: "2022-12-01T04:43:21Z"
# generation: 1
# name: kuberay-test-queue
# resourceVersion: "4427348"
# uid: a6c4f9df-d58c-4da8-8a58-e01c93eca45a
# spec:
# capability:
# cpu: 4
# memory: 6Gi
# reclaimable: true
# weight: 1
# status:
# allocated:
# cpu: "3"
# memory: 4Gi
# pods: "3"
# reservation: {}
# state: Open
接下来,添加另一个具有相同 head 和 worker 节点配置但名称不同的 RayCluster。
# Path: kuberay/ray-operator/config/samples
# Includes the `ray.io/scheduler-name: volcano` and `volcano.sh/queue-name: kuberay-test-queue` labels in the metadata.labels
# Replaces the name to test-cluster-1
sed 's/test-cluster-0/test-cluster-1/' ray-cluster.volcano-scheduler-queue.yaml | kubectl apply -f-
检查其 PodGroup 的状态,确保其 phase 为 Pending,last status 为 Unschedulable。
kubectl get podgroup ray-test-cluster-1-pg -o yaml
# apiVersion: scheduling.volcano.sh/v1beta1
# kind: PodGroup
# metadata:
# creationTimestamp: "2022-12-01T04:48:18Z"
# generation: 2
# name: ray-test-cluster-1-pg
# namespace: test
# ownerReferences:
# - apiVersion: ray.io/v1alpha1
# blockOwnerDeletion: true
# controller: true
# kind: RayCluster
# name: test-cluster-1
# uid: b3cf83dc-ef3a-4bb1-9c42-7d2a39c53358
# resourceVersion: "4427976"
# uid: 9087dd08-8f48-4592-a62e-21e9345b0872
# spec:
# minMember: 3
# minResources:
# cpu: "3"
# memory: 4Gi
# queue: kuberay-test-queue
# status:
# conditions:
# - lastTransitionTime: "2022-12-01T04:48:19Z"
# message: '3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending,
# 3 minAvailable; Pending: 3 Undetermined'
# reason: NotEnoughResources
# status: "True"
# transitionID: 3956b64f-fc52-4779-831e-d379648eecfc
# type: Unschedulable
# phase: Pending
由于新集群所需的 CPU 和 RAM 超出了我们队列的允许范围,即使其中一个 Pod 可以放入剩余的 1 CPU 和 2Gi RAM 中,但直到有足够的空间容纳所有 Pod 之前,都不会放置该集群的任何 Pod。如果像这样不使用 Volcano 进行 Gang 调度,通常会放置其中一个 Pod,导致集群部分分配,并且一些作业(例如 Horovod 训练)将卡在等待资源可用。
查看新 RayCluster 的 Pod 状态,它们被列为 Pending,以了解这对调度它们的影响。
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# test-cluster-0-worker-worker-ddfbz 1/1 Running 0 7m
# test-cluster-0-head 1/1 Running 0 7m
# test-cluster-0-worker-worker-57pc7 1/1 Running 0 6m59s
# test-cluster-1-worker-worker-6tzf7 0/1 Pending 0 2m12s
# test-cluster-1-head 0/1 Pending 0 2m12s
# test-cluster-1-worker-worker-n5g8k 0/1 Pending 0 2m12s
查看 Pod 详细信息,以了解 Volcano 无法调度 Gang 的原因。
kubectl describe pod test-cluster-1-head-6668q | tail -n 3
# Type Reason Age From Message
# ---- ------ ---- ---- -------
# Warning FailedScheduling 4m5s volcano 3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending, 3 minAvailable; Pending: 3 Undetermined
删除第一个 RayCluster 以腾出队列空间。
kubectl delete raycluster test-cluster-0
第二个集群的 PodGroup 变为 Running 状态,因为现在有足够的资源来调度整个 Pod 集合。
kubectl get podgroup ray-test-cluster-1-pg -o yaml
# apiVersion: scheduling.volcano.sh/v1beta1
# kind: PodGroup
# metadata:
# creationTimestamp: "2022-12-01T04:48:18Z"
# generation: 9
# name: ray-test-cluster-1-pg
# namespace: test
# ownerReferences:
# - apiVersion: ray.io/v1alpha1
# blockOwnerDeletion: true
# controller: true
# kind: RayCluster
# name: test-cluster-1
# uid: b3cf83dc-ef3a-4bb1-9c42-7d2a39c53358
# resourceVersion: "4428864"
# uid: 9087dd08-8f48-4592-a62e-21e9345b0872
# spec:
# minMember: 3
# minResources:
# cpu: "3"
# memory: 4Gi
# queue: kuberay-test-queue
# status:
# conditions:
# - lastTransitionTime: "2022-12-01T04:54:04Z"
# message: '3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending,
# 3 minAvailable; Pending: 3 Undetermined'
# reason: NotEnoughResources
# status: "True"
# transitionID: db90bbf0-6845-441b-8992-d0e85f78db77
# type: Unschedulable
# - lastTransitionTime: "2022-12-01T04:55:10Z"
# reason: tasks in the gang are ready to be scheduled
# status: "True"
# transitionID: 72bbf1b3-d501-4528-a59d-479504f3eaf5
# type: Scheduled
# phase: Running
# running: 3
再次检查 Pod,以了解第二个集群现在已成功启动并运行。
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# test-cluster-1-worker-worker-n5g8k 1/1 Running 0 9m4s
# test-cluster-1-head 1/1 Running 0 9m4s
# test-cluster-1-worker-worker-6tzf7 1/1 Running 0 9m4s
最后,清理剩余的集群和队列。
kubectl delete raycluster test-cluster-1
kubectl delete queue kuberay-test-queue
为 RayJob 使用 Volcano 进行 Gang 调度#
从 KubeRay 1.5.1 开始,KubeRay 支持 RayJob 自定义资源的 Gang 调度。
首先,创建一个容量为 4 CPUs 和 6Gi RAM 的队列,以及一个 RayJob A,它包含一个 head 节点(1 CPU + 2Gi RAM)、两个 worker 节点(每个 1 CPU + 1Gi RAM)和一个 submitter Pod(0.5 CPU + 200Mi RAM),总共需要 3500m CPU 和 4296Mi RAM。
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.5.1/ray-operator/config/samples/ray-job.volcano-scheduler-queue.yaml
kubectl apply -f ray-job.volcano-scheduler-queue.yaml
等待所有 Pod 进入 Running 状态。
kubectl get pod
# NAME READY STATUS RESTARTS AGE
# rayjob-sample-0-k449j-head-rlgxj 1/1 Running 0 93s
# rayjob-sample-0-k449j-small-group-worker-c6dt8 1/1 Running 0 93s
# rayjob-sample-0-k449j-small-group-worker-cq6xn 1/1 Running 0 93s
# rayjob-sample-0-qmm8s 0/1 Completed 0 32s
添加另一个具有相同配置但名称不同的 RayJob。
sed 's/rayjob-sample-0/rayjob-sample-1/' ray-job.volcano-scheduler-queue.yaml | kubectl apply -f-
新 RayJob 的所有 Pod 都卡在 Pending 状态。
# NAME READY STATUS RESTARTS AGE
# rayjob-sample-0-k449j-head-rlgxj 1/1 Running 0 3m27s
# rayjob-sample-0-k449j-small-group-worker-c6dt8 1/1 Running 0 3m27s
# rayjob-sample-0-k449j-small-group-worker-cq6xn 1/1 Running 0 3m27s
# rayjob-sample-0-qmm8s 0/1 Completed 0 2m26s
# rayjob-sample-1-mvgqf-head-qb7wm 0/1 Pending 0 21s
# rayjob-sample-1-mvgqf-small-group-worker-jfzt5 0/1 Pending 0 21s
# rayjob-sample-1-mvgqf-small-group-worker-ng765 0/1 Pending 0 21s
检查其 PodGroup 的状态,确保其 phase 为 Pending,last status 为 Unschedulable。
kubectl get podgroup ray-rayjob-sample-1-pg -o yaml
# apiVersion: scheduling.volcano.sh/v1beta1
# kind: PodGroup
# metadata:
# creationTimestamp: "2025-10-30T17:10:18Z"
# generation: 2
# name: ray-rayjob-sample-1-pg
# namespace: default
# ownerReferences:
# - apiVersion: ray.io/v1
# blockOwnerDeletion: true
# controller: true
# kind: RayJob
# name: rayjob-sample-1
# uid: 5835c896-c75d-4692-b10a-2871a79f141a
# resourceVersion: "3226"
# uid: 9fd55cbd-ba69-456d-b305-f61ffd6d935d
# spec:
# minMember: 3
# minResources:
# cpu: 3500m
# memory: 4296Mi
# queue: kuberay-test-queue
# status:
# conditions:
# - lastTransitionTime: "2025-10-30T17:10:18Z"
# message: '3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending,
# 3 minAvailable; Pending: 3 Unschedulable'
# reason: NotEnoughResources
# status: "True"
# transitionID: 7866f533-6590-4a4d-83cf-8f1db0214609
# type: Unschedulable
# phase: Pending
删除第一个 RayJob 以腾出队列空间。
kubectl delete rayjob rayjob-sample-0
第二个集群的 PodGroup 变为 Running 状态,因为现在有足够的资源来调度整个 Pod 集合。
kubectl get podgroup ray-rayjob-sample-1-pg -o yaml
# apiVersion: scheduling.volcano.sh/v1beta1
# kind: PodGroup
# metadata:
# creationTimestamp: "2025-10-30T17:10:18Z"
# generation: 7
# name: ray-rayjob-sample-1-pg
# namespace: default
# ownerReferences:
# - apiVersion: ray.io/v1
# blockOwnerDeletion: true
# controller: true
# kind: RayJob
# name: rayjob-sample-1
# uid: 5835c896-c75d-4692-b10a-2871a79f141a
# resourceVersion: "3724"
# uid: 9fd55cbd-ba69-456d-b305-f61ffd6d935d
# spec:
# minMember: 3
# minResources:
# cpu: 3500m
# memory: 4296Mi
# queue: kuberay-test-queue
# status:
# conditions:
# - lastTransitionTime: "2025-10-30T17:10:18Z"
# message: '3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending,
# 3 minAvailable; Pending: 3 Unschedulable'
# reason: NotEnoughResources
# status: "True"
# transitionID: 7866f533-6590-4a4d-83cf-8f1db0214609
# type: Unschedulable
# - lastTransitionTime: "2025-10-30T17:14:44Z"
# reason: tasks in gang are ready to be scheduled
# status: "True"
# transitionID: 36e0222d-eee3-444a-9889-5b9c255f41af
# type: Scheduled
# phase: Running
# running: 4
再次检查 Pod,以了解第二个 RayJob 现在已成功启动并运行。
kubectl get pod
# NAME READY STATUS RESTARTS AGE
# rayjob-sample-1-mvgqf-head-qb7wm 1/1 Running 0 5m47s
# rayjob-sample-1-mvgqf-small-group-worker-jfzt5 1/1 Running 0 5m47s
# rayjob-sample-1-mvgqf-small-group-worker-ng765 1/1 Running 0 5m47s
# rayjob-sample-1-tcd4m 0/1 Completed 0 84s
最后,清理剩余的 rayjob、队列和 configmap。
kubectl delete rayjob rayjob-sample-1
kubectl delete queue kuberay-test-queue
kubectl delete configmap ray-job-code-sample