KubeRay 与 Apache YuniKorn 集成#

Apache YuniKorn 是一个轻量级、通用的容器编排系统资源调度器。它可以在大规模、多租户和云原生环境中高效地为各种工作负载执行细粒度的资源共享。YuniKorn 为包含无状态批处理工作负载和有状态服务的混合工作负载带来了统一的跨平台调度体验。

KubeRay 的 Apache YuniKorn 集成使得在多租户 Kubernetes 环境中更高效地调度 Ray Pod 成为可能。

注意

此功能需要 KubeRay 版本 1.2.2 或更高版本，并且目前处于 Alpha 测试阶段。

步骤 1: 使用 KinD 创建 Kubernetes 集群#

在终端中运行以下命令

kind create cluster

步骤 2: 安装 Apache YuniKorn#

在启用 KubeRay 与 Apache YuniKorn 集成之前，您需要在 Kubernetes 集群上成功安装 Apache YuniKorn。请参阅入门获取 Apache YuniKorn 安装说明。

步骤 3: 安装支持 Apache YuniKorn 的 KubeRay Operator#

使用 Helm 安装 KubeRay Operator 时，在命令行中传递 --set batchScheduler.name=yunikorn 标志

helm install kuberay-operator kuberay/kuberay-operator --version 1.3.0 --set batchScheduler.name=yunikorn

步骤 4: 使用 Apache YuniKorn 进行 Gang 调度#

本示例演示了如何使用 Apache YuniKorn 和 KubeRay 进行 Gang 调度。

首先，通过编辑 ConfigMap 创建一个队列，其容量为 4 个 CPU 和 6Gi 内存

运行 kubectl edit configmap -n yunikorn yunikorn-defaults

Helm 在安装 Apache YuniKorn Helm Chart 时会创建此 ConfigMap。

在 data 键下添加一个 queues.yaml 配置。该 ConfigMap 应如下所示

apiVersion: v1
kind: ConfigMap
metadata:
  # Metadata for the ConfigMap, skip for brevity.
data:
  queues.yaml: |
    partitions:
      - name: default
        queues:
          - name: root
            queues:
              - name: test
                submitacl: "*"
                parent: false
                resources:
                  guaranteed:
                    memory: 6G
                    vcore: 4
                  max:
                    memory: 6G
                    vcore: 4

保存更改并退出编辑器。此配置创建了一个名为 root.test 的队列，其容量为 4 个 CPU 和 6Gi 内存。

接下来，创建一个 RayCluster，包含一个拥有 1 个 CPU 和 2GiB 内存的头节点，以及两个分别拥有 1 个 CPU 和 1GiB 内存的工作节点，总计需要 3 个 CPU 和 4GiB 内存。

# Path: kuberay/ray-operator/config/samples
# Configure the necessary labels on the RayCluster custom resource for Apache YuniKorn scheduler's gang scheduling:
# - `ray.io/gang-scheduling-enabled`: Set to `true` to enable gang scheduling.
# - `yunikorn.apache.org/app-id`: Set to a unique identifier for the application in Kubernetes, even across different namespaces.
# - `yunikorn.apache.org/queue`: Set to the name of one of the queues in Apache YuniKorn.
wget https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.yunikorn-scheduler.yaml
kubectl apply -f ray-cluster.yunikorn-scheduler.yaml

检查 KubeRay Operator 创建的 RayCluster

$ kubectl describe raycluster test-yunikorn-0

Name:         test-yunikorn-0
Namespace:    default
Labels:       ray.io/gang-scheduling-enabled=true
              yunikorn.apache.org/app-id=test-yunikorn-0
              yunikorn.apache.org/queue=root.test
Annotations:  <none>
API Version:  ray.io/v1
Kind:         RayCluster
Metadata:
  Creation Timestamp:  2024-09-29T09:52:30Z
  Generation:          1
  Resource Version:    951
  UID:                 cae1dbc9-5a67-4b43-b0d9-be595f21ab85
# Other fields are skipped for brevity

请注意 RayCluster 上的标签：ray.io/gang-scheduling-enabled=true、yunikorn.apache.org/app-id=test-yunikorn-0 和 yunikorn.apache.org/queue=root.test。

注意

只有当您需要 Gang 调度时，才需要 ray.io/gang-scheduling-enabled 标签。如果您不设置此标签，YuniKorn 将在不强制执行 Gang 调度的情况下调度 Ray 集群。

由于队列的容量为 4 个 CPU 和 6GiB 内存，此资源应该能够成功调度，没有任何问题。

$ kubectl get pods

NAME                                  READY   STATUS    RESTARTS   AGE
test-yunikorn-0-head-98fmp            1/1     Running   0          67s
test-yunikorn-0-worker-worker-42tgg   1/1     Running   0          67s
test-yunikorn-0-worker-worker-467mn   1/1     Running   0          67s

通过检查 Apache YuniKorn 控制面板来验证调度情况。

kubectl port-forward svc/yunikorn-service 9889:9889 -n yunikorn

访问 http://localhost:9889/#/applications 查看正在运行的应用。

Apache YuniKorn dashboard

接下来，添加另一个 RayCluster，其头节点和工作节点配置与之前相同，但名称不同

# Replace the name with `test-yunikorn-1`.
sed 's/test-yunikorn-0/test-yunikorn-1/' ray-cluster.yunikorn-scheduler.yaml | kubectl apply -f-

现在 test-yunikorn-1 的所有 Pod 都处于 Pending 状态

$ kubectl get pods

NAME                                      READY   STATUS    RESTARTS   AGE
test-yunikorn-0-head-98fmp                1/1     Running   0          4m22s
test-yunikorn-0-worker-worker-42tgg       1/1     Running   0          4m22s
test-yunikorn-0-worker-worker-467mn       1/1     Running   0          4m22s
test-yunikorn-1-head-xl2r5                0/1     Pending   0          71s
test-yunikorn-1-worker-worker-l6ttz       0/1     Pending   0          71s
test-yunikorn-1-worker-worker-vjsts       0/1     Pending   0          71s
tg-test-yunikorn-1-headgroup-vgzvoot0dh   0/1     Pending   0          69s
tg-test-yunikorn-1-worker-eyti2bn2jv      1/1     Running   0          69s
tg-test-yunikorn-1-worker-k8it0x6s73      0/1     Pending   0          69s

为了 Gang 调度目的，Apache YuniKorn 会为 Pod 创建 tg- 前缀。

访问 http://localhost:9889/#/applications 查看 test-yunikorn-1 处于 Accepted 状态，但尚未运行

Apache YuniKorn dashboard

由于新集群需要的 CPU 和内存超过了队列的允许容量，即使其中一个 Pod 可以容纳在剩余的 1 个 CPU 和 2GiB 内存中，Apache YuniKorn 也不会放置集群的 Pod，直到有足够的空间容纳所有 Pod。如果不以这种方式使用 Apache YuniKorn 进行 Gang 调度，KubeRay 将会放置其中一个 Pod，而只部分分配集群资源。

删除第一个 RayCluster 以释放队列中的资源

kubectl delete raycluster test-yunikorn-0

现在第二个集群的所有 Pod 都变为 Running 状态，因为现在有足够的资源来调度完整的 Pod 集合了

再次检查 Pod，查看第二个集群现在已启动并运行

$ kubectl get pods

NAME                                  READY   STATUS    RESTARTS   AGE
test-yunikorn-1-head-xl2r5            1/1     Running   0          3m34s
test-yunikorn-1-worker-worker-l6ttz   1/1     Running   0          3m34s
test-yunikorn-1-worker-worker-vjsts   1/1     Running   0          3m34s

清理资源

kubectl delete raycluster test-yunikorn-1