RayService 高可用性#

RayService 提供高可用性，以确保当 Ray head Pod 发生故障时，服务能够继续处理请求。

先决条件#

使用 KubeRay 1.3.0 或更高版本 RayService。
在 RayService 中启用 GCS 容错。

快速入门#

步骤 1: 使用 Kind 创建 Kubernetes 集群#

kind create cluster --image=kindest/node:v1.26.0

步骤 2: 安装 KubeRay operator#

按照本文档从 Helm 仓库安装最新的稳定版 KubeRay operator。

步骤 3: 安装具有 GCS 容错的 RayService#

kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.high-availability.yaml

ray-service.high-availability.yaml 文件包含多个 Kubernetes 对象

Redis: Redis 是实现 GCS 容错所必需的。有关更多详细信息，请参阅 GCS 容错。
RayService: 此 RayService 自定义资源包含一个 3 节点 RayCluster 和一个简单的 Ray Serve 应用。
Ray Pod: 此 Pod 向 RayService 发送请求。

步骤 4: 验证 Kubernetes Serve 服务#

检查以下命令的输出，验证你是否成功启动了 Kubernetes Serve 服务

# Step 4.1: Wait until the RayService is ready to serve requests.
kubectl describe rayservices.ray.io rayservice-ha

# [Example output]
#   Conditions:
#     Last Transition Time:  2025-02-13T21:36:18Z
#     Message:               Number of serve endpoints is greater than 0
#     Observed Generation:   1
#     Reason:                NonZeroServeEndpoints
#     Status:                True
#     Type:                  Ready 

# Step 4.2: `rayservice-ha-serve-svc` should have 3 endpoints, including the Ray head and two Ray workers.
kubectl describe svc rayservice-ha-serve-svc

# [Example output]
# Endpoints:         10.244.0.29:8000,10.244.0.30:8000,10.244.0.32:8000

步骤 5: 验证 Serve 应用#

在 ray-service.high-availability.yaml 文件中，serveConfigV2 参数为每个 Ray Serve 部署指定了 num_replicas: 2 和 max_replicas_per_node: 1。此外，YAML 将 rayStartParams 参数设置为 num-cpus: "0"，以确保系统不会在 Ray head Pod 上调度任何 Ray Serve 副本。

总共有，每个 Ray Serve 部署有两个副本，每个 Ray 节点最多可以拥有这两个 Ray Serve 副本中的一个。此外，Ray Serve 副本无法调度到 Ray head Pod 上。因此，每个 worker 节点应该为每个 Ray Serve 部署恰好有一个 Ray Serve 副本。

对于 Ray Serve，无论 Ray head 是否有 Ray Serve 副本，它总是有一个 HTTPProxyActor。Ray worker 节点只有在有 Ray Serve 副本时才会有 HTTPProxyActors。因此，前一步中的 rayservice-ha-serve-svc 服务有 3 个端点。

# Port forward the Ray Dashboard.
kubectl port-forward svc/rayservice-ha-head-svc 8265:8265
# Visit ${YOUR_IP}:8265 in your browser for the Dashboard (e.g. 127.0.0.1:8265)
# Check:
# (1) Both head and worker nodes have HTTPProxyActors.
# (2) Only worker nodes have Ray Serve replicas.
# (3) Each worker node has one Ray Serve replica for each Ray Serve deployment.

步骤 6: 向 RayService 发送请求#

# Log into the separate Ray Pod.
kubectl exec -it ray-pod -- bash

# Send requests to the RayService.
python3 samples/query.py

# This script sends the same request to the RayService consecutively, ensuring at most one in-flight request at a time.
# The request is equivalent to `curl -X POST -H 'Content-Type: application/json' localhost:8000/fruit/ -d '["PEAR", 12]'`.

# [Example output]
# req_index : 2197, num_fail: 0
# response: 12
# req_index : 2198, num_fail: 0
# response: 12
# req_index : 2199, num_fail: 0

步骤 7: 删除 Ray head Pod#

# Step 7.1: Delete the Ray head Pod.
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl delete pod $HEAD_POD

在此示例中，query.py 确保在任何给定时间最多只有一个请求处于正在进行状态。此外，Ray head Pod 没有 Ray Serve 副本。只有当请求位于 Ray head Pod 上的 HTTPProxyActor 中时，请求才可能失败。因此，在 Ray head Pod 删除和恢复期间，发生故障的可能性非常小。你可以在 Ray 脚本中实现重试逻辑来处理故障。

# [Expected output]: The `num_fail` is highly likely to be 0.
req_index : 32503, num_fail: 0
response: 12
req_index : 32504, num_fail: 0
response: 12

步骤 8: 清理#

kind delete cluster