RayService worker Pods 尚未就绪#

本指南探讨了 KubeRay 的 RayService API 中一个特定场景,即 Ray worker Pod 因缺少 Ray Serve 副本而一直处于未就绪状态。

为了更好地理解本节内容,您应该熟悉以下 Ray Serve 组件:Ray Serve 副本和 ProxyActor

ProxyActor 负责将传入的请求转发给相应的 Ray Serve 副本。因此,如果没有运行 ProxyActor 的 Ray Pod 收到请求,这些请求将会失败。KubeRay 的就绪探测将失败,导致 Pod 处于未就绪状态,并阻止 ProxyActor 向其发送请求。

Ray Serve 的默认行为是在运行 Ray Serve 副本的 Ray Pod 上创建 ProxyActor。为说明这一点,以下示例使用 RayService 来服务一个简单的 Ray Serve 应用。

步骤 1:使用 Kind 创建 Kubernetes 集群#

kind create cluster --image=kindest/node:v1.26.0

步骤 2:安装 KubeRay operator#

请按照 本文档 使用 Helm 仓库安装最新的稳定版 KubeRay Operator。

第 3 步:安装 RayService#

curl -O https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-service.no-ray-serve-replica.yaml
kubectl apply -f ray-service.no-ray-serve-replica.yaml

查看嵌入在 RayService YAML 中的 Ray Serve 配置 serveConfigV2。注意应用程序 simple_appdeployments 中只有一个部署。

  • num_replicas:控制运行的副本数量,这些副本将处理此部署的请求。初始化为 1,以确保 Ray Serve 副本的总数为 1。

  • max_replicas_per_node:控制单个 Pod 上的最大副本数量。

有关更多详细信息,请参阅Ray Serve 文档

serveConfigV2: |
  applications:
    - name: simple_app
      import_path: ray-operator.config.samples.ray-serve.single_deployment_dag:DagNode
      route_prefix: /basic
      runtime_env:
        working_dir: "https://github.com/ray-project/kuberay/archive/master.zip"
      deployments:
        - name: BaseService
          num_replicas: 1
          max_replicas_per_node: 1
          ray_actor_options:
            num_cpus: 0.1

查看嵌入在 RayService YAML 中的 head Pod 配置 rayClusterConfig:headGroupSpec
该配置通过将选项 num-cpus: "0" 传递给 rayStartParams,将 head Pod 的 CPU 资源设置为 0。此设置可避免在 head Pod 上运行 Ray Serve 副本。有关更多详细信息,请参阅rayStartParams

headGroupSpec:
  rayStartParams:
    num-cpus: "0"
  template: ...

步骤 4:为什么 1 个 worker Pod 尚未就绪?#

# Step 4.1: Wait until the RayService is ready to serve requests.
kubectl describe rayservices.ray.io rayservice-no-ray-serve-replica

# [Example output]
#  Conditions:
#    Last Transition Time:  2025-03-18T14:14:43Z
#    Message:               Number of serve endpoints is greater than 0
#    Observed Generation:   1
#    Reason:                NonZeroServeEndpoints
#    Status:                True
#    Type:                  Ready
#    Last Transition Time:  2025-03-18T14:12:03Z
#    Message:               Active Ray cluster exists and no pending Ray cluster
#    Observed Generation:   1
#    Reason:                NoPendingCluster
#    Status:                False
#    Type:                  UpgradeInProgress

# Step 4.2: List all Ray Pods in the `default` namespace.
kubectl get pods -l=ray.io/is-ray-node=yes

# [Example output]
# NAME                                                              READY   STATUS    RESTARTS   AGE
# rayservice-no-ray-serve-replica-raycluster-dnm28-head-9h2qt       1/1     Running   0          2m21s
# rayservice-no-ray-serve-replica-raycluster-dnm28-s-worker-46t7l   1/1     Running   0          2m21s
# rayservice-no-ray-serve-replica-raycluster-dnm28-s-worker-77rzk   0/1     Running   0          2m20s

# Step 4.3: Check unready worker pod events
kubectl describe pods {YOUR_UNREADY_WORKER_POD_NAME}

# [Example output]
# Events:
#   Type     Reason     Age                   From               Message
#   ----     ------     ----                  ----               -------
#   Normal   Scheduled  3m4s                  default-scheduler  Successfully assigned default/rayservice-no-ray-serve-replica-raycluster-dnm28-s-worker-77rzk to kind-control-plane
#   Normal   Pulled     3m3s                  kubelet            Container image "rayproject/ray:2.46.0" already present on machine
#   Normal   Created    3m3s                  kubelet            Created container wait-gcs-ready
#   Normal   Started    3m3s                  kubelet            Started container wait-gcs-ready
#   Normal   Pulled     2m57s                 kubelet            Container image "rayproject/ray:2.46.0" already present on machine
#   Normal   Created    2m57s                 kubelet            Created container ray-worker
#   Normal   Started    2m57s                 kubelet            Started container ray-worker
#   Warning  Unhealthy  78s (x19 over 2m43s)  kubelet            Readiness probe failed: success

查看步骤 4.2 的输出。一个 worker Pod 正在运行且已就绪,而另一个正在运行但未就绪。
从 Ray 2.8 开始,没有 Ray Serve 副本的 Ray worker Pod 将没有 Proxy actor。
从 KubeRay v1.1.0 开始,KubeRay 为每个 worker Pod 的 Ray 容器添加了一个就绪探测,以检查 worker Pod 是否具有 Proxy actor。
如果 worker Pod 缺少 Proxy actor,就绪探测将失败,导致 worker Pod 未就绪,因此它不会接收任何流量。

使用 spec.serveConfigV2,KubeRay 只创建一个 Ray Serve 副本并将其调度到一个 worker Pod。KubeRay 设置了一个带有 Ray Serve 副本和 Proxy actor 的 worker Pod,并将其标记为就绪。KubeRay 将另一个没有 Ray Serve 副本和 Proxy actor 的 worker Pod 标记为未就绪。

步骤 5:验证 Serve 应用的状态#

kubectl port-forward svc/rayservice-no-ray-serve-replica-head-svc 8265:8265

有关 RayService 可观测性的更多详细信息,请参阅rayservice-troubleshooting.md

下面是 Ray Dashboard 中 Serve 页面的截图示例。
请注意,一个 worker Pod 上运行着 ray::ServeReplica::simple_app::BaseService 和一个 ray::ProxyActor,而另一个 worker Pod 上没有运行 Ray Serve 副本和 Proxy actor。KubeRay 将前者标记为就绪,后者标记为未就绪。Ray Serve Dashboard

步骤 6:通过 Kubernetes serve 服务将请求发送到 Serve 应用#

rayservice-no-ray-serve-serve-svc 在所有具有 Ray Serve 副本的 worker 之间进行流量路由。即使一个 worker Pod 未就绪,Ray Serve 仍可以将流量路由到运行 Ray Serve 副本的就绪 worker Pod。因此,用户仍然可以向应用发送请求并收到响应。

# Step 6.1: Run a curl Pod.
# If you already have a curl Pod, you can use `kubectl exec -it <curl-pod> -- sh` to access the Pod.
kubectl run curl --image=radial/busyboxplus:curl -i --tty

# Step 6.2: Send a request to the simple_app.
curl -X POST -H 'Content-Type: application/json' rayservice-no-ray-serve-replica-serve-svc:8000/basic
# [Expected output]: hello world

步骤 7:Ray Serve 应用的就地更新#

ray-service.no-ray-serve-replica.yaml 中将应用程序的 num_replicas1 更新到 2。此更改将重新配置现有的 RayCluster。

# Step 7.1: Update the num_replicas of the app from 1 to 2.
# [ray-service.no-ray-serve-replica.yaml]
# deployments:
#   - name: BaseService
#     num_replicas: 2
#     max_replicas_per_node: 1
#     ray_actor_options:
#       num_cpus: 0.1

# Step 7.2: Apply the updated RayService config.
kubectl apply -f ray-service.no-ray-serve-replica.yaml

# Step 7.3: List all Ray Pods in the `default` namespace.
kubectl get pods -l=ray.io/is-ray-node=yes

# [Example output]
# NAME                                                              READY   STATUS    RESTARTS   AGE
# rayservice-no-ray-serve-replica-raycluster-dnm28-head-9h2qt       1/1     Running   0          46m
# rayservice-no-ray-serve-replica-raycluster-dnm28-s-worker-46t7l   1/1     Running   0          46m
# rayservice-no-ray-serve-replica-raycluster-dnm28-s-worker-77rzk   1/1     Running   0          46m

重新配置后,KubeRay 请求 head Pod 创建一个额外的 Ray Serve 副本以匹配 num_replicas 配置。由于 max_replicas_per_node1,新的 Ray Serve 副本将运行在没有任何副本的 worker Pod 上。之后,KubeRay 将该 worker Pod 标记为就绪。

步骤 8:清理 Kubernetes 集群#

# Delete the RayService.
kubectl delete -f ray-service.no-ray-serve-replica.yaml

# Uninstall the KubeRay operator.
helm uninstall kuberay-operator

# Delete the curl Pod.
kubectl delete pod curl

下一步#