RayService 高可用性#

RayService 提供了高可用性,以确保在 Ray Head Pod 发生故障时服务能够继续处理请求。

先决条件#

  • 请使用 RayService 和 KubeRay 1.3.0 或更高版本。

  • 在 RayService 中启用 GCS 容错。

快速入门#

步骤 1:使用 Kind 创建 Kubernetes 集群#

kind create cluster --image=kindest/node:v1.26.0

步骤 2:安装 KubeRay operator#

请按照 本文档 从 Helm 仓库安装最新的稳定版 KubeRay Operator。

步骤 3:安装具有 GCS 容错功能的 RayService#

kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.high-availability.yaml

ray-service.high-availability.yaml 文件包含多个 Kubernetes 对象:

  • Redis:Redis 对于使 GCS 容错是必需的。有关更多详细信息,请参阅 GCS 容错

  • RayService:此 RayService 自定义资源包含一个 3 节点 RayCluster 和一个简单的 Ray Serve 应用程序

  • Ray Pod:此 Pod 向 RayService 发送请求。

步骤 4:验证 Kubernetes Serve 服务#

检查以下命令的输出,以验证您是否已成功启动 Kubernetes Serve 服务。

# Step 4.1: Wait until the RayService is ready to serve requests.
kubectl describe rayservices.ray.io rayservice-ha

# [Example output]
#   Conditions:
#     Last Transition Time:  2025-02-13T21:36:18Z
#     Message:               Number of serve endpoints is greater than 0
#     Observed Generation:   1
#     Reason:                NonZeroServeEndpoints
#     Status:                True
#     Type:                  Ready 

# Step 4.2: `rayservice-ha-serve-svc` should have 3 endpoints, including the Ray head and two Ray workers.
kubectl describe svc rayservice-ha-serve-svc

# [Example output]
# Endpoints:         10.244.0.29:8000,10.244.0.30:8000,10.244.0.32:8000

步骤 5:验证 Serve 应用程序#

ray-service.high-availability.yaml 文件中,serveConfigV2 参数为每个 Ray Serve 部署指定了 num_replicas: 2max_replicas_per_node: 1。此外,YAML 将 rayStartParams 参数设置为 num-cpus: "0",以确保系统不会在 Ray Head Pod 上调度任何 Ray Serve 副本。

总而言之,每个 Ray Serve 部署有两个副本,并且每个 Ray 节点最多可以拥有这两个 Ray Serve 副本中的一个。此外,Ray Serve 副本不能在 Ray Head Pod 上调度。因此,每个工作节点应为每个 Ray Serve 部署拥有一个 Ray Serve 副本。

对于 Ray Serve,Ray Head 始终拥有一个 HTTPProxyActor,无论它是否有 Ray Serve 副本。Ray 工作节点仅在拥有 Ray Serve 副本时才拥有 HTTPProxyActor。因此,上一步中的 rayservice-ha-serve-svc 服务有 3 个端点。

# Port forward the Ray Dashboard.
kubectl port-forward svc/rayservice-ha-head-svc 8265:8265
# Visit ${YOUR_IP}:8265 in your browser for the Dashboard (e.g. 127.0.0.1:8265)
# Check:
# (1) Both head and worker nodes have HTTPProxyActors.
# (2) Only worker nodes have Ray Serve replicas.
# (3) Each worker node has one Ray Serve replica for each Ray Serve deployment.

步骤 6:向 RayService 发送请求#

# Log into the separate Ray Pod.
kubectl exec -it ray-pod -- bash

# Send requests to the RayService.
python3 samples/query.py

# This script sends the same request to the RayService consecutively, ensuring at most one in-flight request at a time.
# The request is equivalent to `curl -X POST -H 'Content-Type: application/json' localhost:8000/fruit/ -d '["PEAR", 12]'`.

# [Example output]
# req_index : 2197, num_fail: 0
# response: 12
# req_index : 2198, num_fail: 0
# response: 12
# req_index : 2199, num_fail: 0

步骤 7:删除 Ray Head Pod#

# Step 7.1: Delete the Ray head Pod.
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl delete pod $HEAD_POD

在此示例中,query.py 确保在任何给定时间只有一个请求在进行中。此外,Ray Head Pod 没有 Ray Serve 副本。请求仅在请求在 Ray Head Pod 上的 HTTPProxyActor 中时才可能失败。因此,在删除和恢复 Ray Head Pod 期间,失败极不可能发生。您可以在 Ray 脚本中实现重试逻辑来处理失败。

# [Expected output]: The `num_fail` is highly likely to be 0.
req_index : 32503, num_fail: 0
response: 12
req_index : 32504, num_fail: 0
response: 12

步骤 8:清理#

kind delete cluster