KubeRay 基于标签的调度#

本指南将介绍如何在 Kubernetes 上为 Ray 集群使用基于标签的调度。此功能允许您将 Ray 工作负载（任务、Actor 或 Placement Group）定向到运行在 Pod 中的特定 Ray 节点，这些节点具有特定的标签。标签选择器可以精细控制您的工作负载在异构集群中的运行位置，有助于优化性能和成本。

基于标签的调度是异构集群的重要工具，在异构集群中，您的 RayCluster 可能包含不同类型的节点，用于不同目的，例如：

具有不同加速器类型的节点，如 A100 GPU 或 Trillium TPU。
具有不同 CPU 系列的节点，如 Intel 或 AMD。
与成本和可用性相关的不同实例类型的节点，例如 Spot 或按需实例。
位于不同故障域或具有区域或可用区要求的节点。

Ray 调度器使用 @ray.remote 装饰器中指定的 label_selector，在 Ray 节点上定义的标签进行过滤。在 KubeRay 中，使用 RayCluster 自定义资源中定义的标签来设置 Ray 节点标签。

标签选择器是 Ray 2.49.1 中的一个实验性功能。

Ray 2.51.0 和 KubeRay v1.5.1 提供了对具有标签选择器的任务、Actor 和 Placement Group 的完整自动缩放支持。

概述#

在使用 KubeRay 进行基于标签的调度时，需要理解三个调度步骤：

Ray 工作负载：Ray 应用程序使用 label_selector 请求资源，指定您希望调度到具有这些标签的节点上。例如：

@ray.remote(num_gpus=1, label_selector={"ray.io/accelerator-type": "A100"})
def gpu_task():
    pass

RayCluster CR：RayCluster CRD 通过 HeadGroupSpec 和 WorkerGroupSpecs 定义了可用于调度（或通过自动缩放进行缩放）的节点类型。要为给定组设置 Ray 节点标签，可以在顶层的 Labels 字段下指定它们。当 KubeRay 为该组创建 Pod 时，它会在 Ray 运行时环境中设置这些标签。对于启用了自动缩放的 RayClusters，KubeRay 还会将这些标签添加到用于调度 Ray 工作负载的自动缩放配置中。例如：

headGroupSpec:
    labels:
        ray.io/region: us-central2
...
workerGroupSpecs:
  - replicas: 1
    minReplicas: 1
    maxReplicas: 10
    groupName: intel-cpu-group
    labels:
      cpu-family: intel
      ray.io/market-type: on-demand

Kubernetes 调度器：为确保 Ray Pod 能够正确地分配到物理硬件上，请在 Pod 模板中添加标准的 Kubernetes 调度功能，如 nodeSelector 或 podAffinity。与 Ray 处理标签选择器的方式类似，Kubernetes 调度器在调度 Pod 时会根据这些标签过滤 Kubernetes 集群中的底层节点。例如，您可以将以下 nodeSelector 添加到上面的 intel-cpu-group 中，以确保 Ray 和 Kubernetes 都限制调度：

nodeSelector:
    cloud.google.com/machine-family: "N4"
    cloud.google.com/gke-spot: "false"

本快速入门演示了这三个步骤的协同工作。

快速入门#

步骤 1：[可选] 使用 Kind 创建 Kubernetes 集群#

如果您还没有 Kubernetes 集群，请使用 Kind 创建一个新集群进行测试。如果您已经在使用云提供商的 Kubernetes 服务（如 GKE），请跳过此步骤。

kind create cluster --image=kindest/node:v1.26.0

# Mock underlying nodes with GKE-related labels. This is necessary for the `nodeSelector` to be able to schedule Pods.
kubectl label node kind-control-plane \
  cloud.google.com/machine-family="N4" \
  cloud.google.com/gke-spot="true" \
  cloud.google.com/gke-accelerator="nvidia-tesla-a100"

本快速入门为图方便使用了 Kind。

在实际场景中，您会使用云提供商的 Kubernetes 服务（如 GKE 或 EKS），这些服务提供了不同的机器类型，例如 GPU 节点和 Spot 实例。

步骤 2：安装 KubeRay operator#

请按照本指南通过 Helm 仓库安装最新的稳定版 KubeRay 算子。本指南的最低 KubeRay 版本为 v1.5.1。

步骤 3：创建启用了自动缩放并指定了标签的 RayCluster CR#

kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster-label-selector.yaml

步骤 4：验证 Kubernetes 集群状态#

# Step 4.1: List all Ray Pods in the `default` namespace.
kubectl get pods -l=ray.io/is-ray-node=yes

# [Example output]
NAME                                             READY   STATUS     RESTARTS   AGE
ray-label-cluster-head-5tkn2                     2/2     Running    0          3s
ray-label-cluster-large-cpu-group-worker-dhqmt   1/1     Running    0          3s

# Step 4.2: Check the ConfigMap in the `default` namespace.
kubectl get configmaps

# [Example output]
# NAME                  DATA   AGE
# ray-example           3      21s
# ...

RayCluster 包含 1 个 head Pod 和 1 个已缩放的 worker Pod。head Pod 包含两个容器：一个 Ray head 容器和一个 Ray 自动缩放器 sidecar 容器。此外，ray-cluster-label-selector.yaml 包含一个名为 ray-example 的 ConfigMap，其中包含三个 Python 脚本：example_task.py、example_actor.py 和 example_placement_group.py，它们都展示了基于标签的调度。

example_task.py 是一个 Python 脚本，它创建一个简单的任务，需要具有 ray.io/market-type: on-demand 和 cpu-family: in(intel,amd) 标签的节点。in 操作符表示 cpu-family 可以是 Intel 或 AMD。

import ray
@ray.remote(num_cpus=1, label_selector={"ray.io/market-type": "on-demand", "cpu-family": "in(intel,amd)"})
def test_task():
  pass
ray.init()
ray.get(test_task.remote())

example_actor.py 是一个 Python 脚本，它创建一个简单的 Actor，需要具有 ray.io/accelerator-type: A100 标签的节点。当 Ray 能够检测到底层计算时，它会默认设置 ray.io/accelerator-type 标签。

import ray
@ray.remote(num_gpus=1, label_selector={"ray.io/accelerator-type": "A100"})
class Actor:
  def ready(self):
    return True
ray.init()
my_actor = Actor.remote()
ray.get(my_actor.ready.remote())

example_placement_group.py 是一个 Python 脚本，它创建一个 Placement Group，需要两个 1 CPU 的 Bundle，具有 ray.io/market-type: spot 标签，但 **不** 具有 ray.io/region: us-central2。由于策略是 "SPREAD"，我们期望两个具有所需标签的独立 Ray 节点进行扩展，每个 Placement Group Bundle 各一个节点。

import ray
from ray.util.placement_group import placement_group
ray.init()
pg = placement_group(
  [{"CPU": 1}] * 2,
  bundle_label_selector=[{"ray.io/market-type": "spot", "ray.io/region": "!us-central2"},] * 2, strategy="SPREAD"
)
ray.get(pg.ready())

步骤 5：触发 RayCluster 基于标签的调度#

# Step 5.1: Get the head pod name
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)

# Step 5.2: Run the task. The task should target the existing large-cpu-group and not require autoscaling.
kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/example_task.py

# Step 5.3: Run the actor. This should cause the Ray autoscaler to scale a GPU node in accelerator-group. The Pod may not 
#           schedule unless you have GPU resources in your cluster.
kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/example_actor.py

# Step 5.4: Create the placement group. This should cause the Ray autoscaler to scale two nodes in spot-group.
kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/example_placement_group.py

# Step 5.5: List all nodes in the Ray cluster. The nodes scaled for the task, actor, and placement group should be annotated with
#           the expected Ray node labels.
kubectl exec -it $HEAD_POD -- ray list nodes

步骤 6：清理 Kubernetes 集群#

# Delete RayCluster and ConfigMap
kubectl delete -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster-label-selector.yaml

# Uninstall the KubeRay operator
helm uninstall kuberay-operator

# Delete the kind cluster
kind delete cluster

下一步#

有关 Ray 中标签选择器的更多详细信息，请参阅使用标签控制调度。
有关如何使用 KubeRay 配置 Ray 自动缩放器的说明，请参阅 KubeRay 自动缩放。