KubeRay 与 Volcano 集成#

Volcano 是一个构建在 Kubernetes 上的批处理调度系统。它提供了一系列 Kubernetes 目前所缺乏的机制(gang 调度、作业队列、公平调度策略),这些机制是许多批处理和弹性工作负载类型通常需要的。KubeRay 与 Volcano 的集成可以在多租户 Kubernetes 环境中更有效地调度 Ray Pod。

设置#

步骤 1:使用 KinD 创建一个 Kubernetes 集群#

在终端中运行以下命令

kind create cluster

步骤 2:安装 Volcano#

您需要在启用 KubeRay 与 Volcano 集成之前,在 Kubernetes 集群上成功安装 Volcano。有关 Volcano 的安装说明,请参阅快速入门指南

步骤 3:安装支持批处理调度的 KubeRay Operator#

使用 --enable-batch-scheduler 标志部署 KubeRay Operator 以启用 Volcano 批处理调度支持。

使用 Helm 安装 KubeRay Operator 时,应使用以下两种选项之一

  • 在您的 values.yaml 文件中将 batchScheduler.enabled 设置为 true

# values.yaml file
batchScheduler:
    enabled: true
  • 在命令行运行时传递 --set batchScheduler.enabled=true 标志

# Install the Helm chart with --enable-batch-scheduler flag set to true
helm install kuberay-operator kuberay/kuberay-operator --version 1.3.0 --set batchScheduler.enabled=true

步骤 4:安装使用 Volcano 调度器的 RayCluster#

RayCluster 自定义资源必须包含 ray.io/scheduler-name: volcano 标签,以便将集群 Pod 提交给 Volcano 进行调度。

# Path: kuberay/ray-operator/config/samples
# Includes label `ray.io/scheduler-name: volcano` in the metadata.labels
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.3.0/ray-operator/config/samples/ray-cluster.volcano-scheduler.yaml
kubectl apply -f ray-cluster.volcano-scheduler.yaml

# Check the RayCluster
kubectl get pod -l ray.io/cluster=test-cluster-0
# NAME                                 READY   STATUS    RESTARTS   AGE
# test-cluster-0-head-jj9bg            1/1     Running   0          36s

您还可以在 RayCluster 元数据中提供以下标签

  • ray.io/priority-class-name: Kubernetes 定义的集群优先级类

    • 此标签仅在创建 PriorityClass 资源后生效

    • labels:
        ray.io/scheduler-name: volcano
        ray.io/priority-class-name: <replace with correct PriorityClass resource name>
      
  • volcano.sh/queue-name: 集群提交到的 Volcano 队列名称。

    • 此标签仅在创建 Queue 资源后生效

    • labels:
        ray.io/scheduler-name: volcano
        volcano.sh/queue-name: <replace with correct Queue resource name>
      

如果启用了自动扩缩,则 minReplicas 用于 Gang 调度,否则使用所需的 replicas

步骤 5:使用 Volcano 进行批处理调度#

有关指南,请参阅示例

示例#

在进行示例之前,请删除所有正在运行的 Ray 集群,以确保以下示例能成功运行。

kubectl delete raycluster --all

Gang 调度#

本示例详细介绍了 Gang 调度如何与 Volcano 和 KubeRay 协同工作。

首先,创建一个容量为 4 个 CPU 和 6Gi 内存的队列

kubectl create -f - <<EOF
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: kuberay-test-queue
spec:
  weight: 1
  capability:
    cpu: 4
    memory: 6Gi
EOF

上述定义中的 weight 表示队列在集群资源划分中的相对权重。当集群中所有队列的总 capability 超出总可用资源时,可以使用此参数强制队列之间共享资源。权重较高的队列将被分配到按比例更大的总资源份额。

capability 是对队列在任何给定时间支持的最大资源的硬性限制。您可以根据需要更新它,以便在特定时间允许运行更多或更少的工作负载。

接下来,创建一个 RayCluster,包含一个头节点(1 CPU + 2Gi 内存)和两个工作节点(每个 1 CPU + 1Gi 内存),总共需要 3 个 CPU 和 4Gi 内存

# Path: kuberay/ray-operator/config/samples
# Includes  the `ray.io/scheduler-name: volcano` and `volcano.sh/queue-name: kuberay-test-queue` labels in the metadata.labels
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.3.0/ray-operator/config/samples/ray-cluster.volcano-scheduler-queue.yaml
kubectl apply -f ray-cluster.volcano-scheduler-queue.yaml

由于队列的容量为 4 个 CPU 和 6Gi 内存,此资源应该能成功调度而不会出现任何问题。您可以通过检查集群的 Volcano PodGroup 状态来验证这一点,查看其 phase 是否为 Running,并且 last status 是否为 Scheduled

kubectl get podgroup ray-test-cluster-0-pg -o yaml

# apiVersion: scheduling.volcano.sh/v1beta1
# kind: PodGroup
# metadata:
#   creationTimestamp: "2022-12-01T04:43:30Z"
#   generation: 2
#   name: ray-test-cluster-0-pg
#   namespace: test
#   ownerReferences:
#   - apiVersion: ray.io/v1alpha1
#     blockOwnerDeletion: true
#     controller: true
#     kind: RayCluster
#     name: test-cluster-0
#     uid: 7979b169-f0b0-42b7-8031-daef522d25cf
#   resourceVersion: "4427347"
#   uid: 78902d3d-b490-47eb-ba12-d6f8b721a579
# spec:
#   minMember: 3
#   minResources:
#     cpu: "3"
#     memory: 4Gi
#   queue: kuberay-test-queue
# status:
#   conditions:
#   - lastTransitionTime: "2022-12-01T04:43:31Z"
#     reason: tasks in the gang are ready to be scheduled
#     status: "True"
#     transitionID: f89f3062-ebd7-486b-8763-18ccdba1d585
#     type: Scheduled
#   phase: Running

检查队列状态,查看是否有 1 个正在运行的作业

kubectl get queue kuberay-test-queue -o yaml

# apiVersion: scheduling.volcano.sh/v1beta1
# kind: Queue
# metadata:
#   creationTimestamp: "2022-12-01T04:43:21Z"
#   generation: 1
#   name: kuberay-test-queue
#   resourceVersion: "4427348"
#   uid: a6c4f9df-d58c-4da8-8a58-e01c93eca45a
# spec:
#   capability:
#     cpu: 4
#     memory: 6Gi
#   reclaimable: true
#   weight: 1
# status:
#   reservation: {}
#   running: 1
#   state: Open

接下来,添加一个额外的 RayCluster,配置与前一个相同,但名称不同

# Path: kuberay/ray-operator/config/samples
# Includes the `ray.io/scheduler-name: volcano` and `volcano.sh/queue-name: kuberay-test-queue` labels in the metadata.labels
# Replaces the name to test-cluster-1
sed 's/test-cluster-0/test-cluster-1/' ray-cluster.volcano-scheduler-queue.yaml | kubectl apply -f-

检查其 PodGroup 状态,查看其 phase 是否为 Pending,并且 last status 是否为 Unschedulable

kubectl get podgroup ray-test-cluster-1-pg -o yaml

# apiVersion: scheduling.volcano.sh/v1beta1
# kind: PodGroup
# metadata:
#   creationTimestamp: "2022-12-01T04:48:18Z"
#   generation: 2
#   name: ray-test-cluster-1-pg
#   namespace: test
#   ownerReferences:
#   - apiVersion: ray.io/v1alpha1
#     blockOwnerDeletion: true
#     controller: true
#     kind: RayCluster
#     name: test-cluster-1
#     uid: b3cf83dc-ef3a-4bb1-9c42-7d2a39c53358
#   resourceVersion: "4427976"
#   uid: 9087dd08-8f48-4592-a62e-21e9345b0872
# spec:
#   minMember: 3
#   minResources:
#     cpu: "3"
#     memory: 4Gi
#   queue: kuberay-test-queue
# status:
#   conditions:
#   - lastTransitionTime: "2022-12-01T04:48:19Z"
#     message: '3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending,
#       3 minAvailable; Pending: 3 Undetermined'
#     reason: NotEnoughResources
#     status: "True"
#     transitionID: 3956b64f-fc52-4779-831e-d379648eecfc
#     type: Unschedulable
#   phase: Pending

由于新集群需要的 CPU 和内存超过了队列允许的容量,即使其中一个 Pod 可以放入剩余的 1 个 CPU 和 2Gi 内存中,在有足够的空间容纳所有 Pod 之前,集群中的任何 Pod 都不会被放置。如果不以这种方式使用 Volcano 进行 Gang 调度,通常会放置其中一个 Pod,导致集群部分分配,并且某些作业(例如 Horovod 训练)会因等待资源可用而卡住。

查看这对我们新 RayCluster 的 Pod 调度产生的影响,它们被列为 Pending

kubectl get pods

# NAME                                            READY   STATUS         RESTARTS   AGE
# test-cluster-0-worker-worker-ddfbz              1/1     Running        0          7m
# test-cluster-0-head-vst5j                       1/1     Running        0          7m
# test-cluster-0-worker-worker-57pc7              1/1     Running        0          6m59s
# test-cluster-1-worker-worker-6tzf7              0/1     Pending        0          2m12s
# test-cluster-1-head-6668q                       0/1     Pending        0          2m12s
# test-cluster-1-worker-worker-n5g8k              0/1     Pending        0          2m12s

查看 Pod 详情,可以看到 Volcano 无法调度这个 Gang

kubectl describe pod test-cluster-1-head-6668q | tail -n 3

# Type     Reason            Age   From     Message
# ----     ------            ----  ----     -------
# Warning  FailedScheduling  4m5s  volcano  3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending, 3 minAvailable; Pending: 3 Undetermined

删除第一个 RayCluster 以在队列中腾出空间

kubectl delete raycluster test-cluster-0

第二个集群的 PodGroup 变为了 Running 状态,因为现在有足够的资源来调度整个 Pod 集

kubectl get podgroup ray-test-cluster-1-pg -o yaml

# apiVersion: scheduling.volcano.sh/v1beta1
# kind: PodGroup
# metadata:
#   creationTimestamp: "2022-12-01T04:48:18Z"
#   generation: 9
#   name: ray-test-cluster-1-pg
#   namespace: test
#   ownerReferences:
#   - apiVersion: ray.io/v1alpha1
#     blockOwnerDeletion: true
#     controller: true
#     kind: RayCluster
#     name: test-cluster-1
#     uid: b3cf83dc-ef3a-4bb1-9c42-7d2a39c53358
#   resourceVersion: "4428864"
#   uid: 9087dd08-8f48-4592-a62e-21e9345b0872
# spec:
#   minMember: 3
#   minResources:
#     cpu: "3"
#     memory: 4Gi
#   queue: kuberay-test-queue
# status:
#   conditions:
#   - lastTransitionTime: "2022-12-01T04:54:04Z"
#     message: '3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending,
#       3 minAvailable; Pending: 3 Undetermined'
#     reason: NotEnoughResources
#     status: "True"
#     transitionID: db90bbf0-6845-441b-8992-d0e85f78db77
#     type: Unschedulable
#   - lastTransitionTime: "2022-12-01T04:55:10Z"
#     reason: tasks in the gang are ready to be scheduled
#     status: "True"
#     transitionID: 72bbf1b3-d501-4528-a59d-479504f3eaf5
#     type: Scheduled
#   phase: Running
#   running: 3

再次检查 Pod,可以看到第二个集群现在已经启动并运行

kubectl get pods

# NAME                                            READY   STATUS         RESTARTS   AGE
# test-cluster-1-worker-worker-n5g8k              1/1     Running        0          9m4s
# test-cluster-1-head-6668q                       1/1     Running        0          9m4s
# test-cluster-1-worker-worker-6tzf7              1/1     Running        0          9m4s

最后,清理剩余的集群和队列

kubectl delete raycluster test-cluster-1
kubectl delete queue kuberay-test-queue