KubeRay 可观测性#

KubeRay / Kubernetes 可观测性#

检查 KubeRay Operator 的日志是否有错误#

# Typically, the operator's Pod name is kuberay-operator-xxxxxxxxxx-yyyyy.
kubectl logs $KUBERAY_OPERATOR_POD -n $YOUR_NAMESPACE | tee operator-log

使用此命令将 Operator 的日志重定向到名为 operator-log 的文件。然后在文件中搜索错误。

检查自定义资源的状态和事件#

kubectl describe [raycluster|rayjob|rayservice] $CUSTOM_RESOURCE_NAME -n $YOUR_NAMESPACE

运行此命令后，检查自定义资源状态中的事件、state 和 conditions，查找任何错误和进展。

RayCluster `.Status.State`#

.Status.State 字段表示集群的情况，但其有限的表示限制了其实用性。请使用新的 Status.Conditions 字段代替。

状态	描述
就绪	一旦集群中的所有 Pod 都准备就绪，KubeRay 就会将状态设置为 `Ready`。`State` 会一直保持 `Ready`，直到 KubeRay 暂停集群。
已暂停	当 KubeRay 将 `Spec.Suspend` 设置为 true 并删除集群中的所有 Pod 时，KubeRay 将状态设置为 `Suspended`。

RayCluster `.Status.Conditions`#

虽然 Status.State 可以表示集群状况，但它仍然只是一个单一字段。通过在 KubeRay v1.2.1 上启用功能门 RayClusterStatusConditions，您可以访问新的 Status.Conditions 以获取更详细的集群历史记录和状态。

警告

RayClusterStatusConditions 仍然是 Alpha 功能，将来可能会发生变化。

如果您使用 Helm 部署了 KubeRay，请在 Helm values 的 featureGates 中启用 RayClusterStatusConditions 门。

helm upgrade kuberay-operator kuberay/kuberay-operator --version 1.2.2 \
  --set featureGates\[0\].name=RayClusterStatusConditions \
  --set featureGates\[0\].enabled=true

或者，只需使用参数 --feature-gates=RayClusterStatusConditions=true 运行您的 KubeRay Operator 可执行文件即可。

类型	状态	原因	描述
RayClusterProvisioned	True	AllPodRunningAndReadyFirstTime	当集群中的所有 Pod 都准备就绪时，系统将条件标记为 `True`。即使后来有些 Pod 失败，系统仍保持此 `True` 状态。
	False	RayClusterPodsProvisioning
RayClusterReplicaFailure	True	FailedDeleteAllPods	当发生协调错误时，KubeRay 将此条件设置为 `True`，否则 KubeRay 会清除此条件。
	True	FailedDeleteHeadPod	请参阅条件的 `Reason` 和 `Message`，以获取更详细的调试信息。
	True	FailedCreateHeadPod
	True	FailedDeleteWorkerPod
	True	FailedCreateWorkerPod
HeadPodReady	True	HeadPodRunningAndReady	只有当 HeadPod 当前准备就绪时，此条件才为 `True`；否则，它为 `False`。
	False	HeadPodNotFound

RayService `.Status.Conditions`#

从 KubeRay v1.3.0 开始，RayService 也支持 Status.Conditions 字段。

Ready：如果 Ready 为 true，则 RayService 已准备好服务请求。
UpgradeInProgress：如果 UpgradeInProgress 为 true，则 RayService 当前处于升级过程中，并且同时存在活动的和待处理的 RayCluster。

kubectl describe rayservices.ray.io rayservice-sample

# [Example output]
# Conditions:
#   Last Transition Time:  2025-02-08T06:45:20Z
#   Message:               Number of serve endpoints is greater than 0
#   Observed Generation:   1
#   Reason:                NonZeroServeEndpoints
#   Status:                True
#   Type:                  Ready
#   Last Transition Time:  2025-02-08T06:44:28Z
#   Message:               Active Ray cluster exists and no pending Ray cluster
#   Observed Generation:   1
#   Reason:                NoPendingCluster
#   Status:                False
#   Type:                  UpgradeInProgress

Kubernetes 事件#

KubeRay 为 KubeRay Operator 与 Kubernetes API 服务器之间的每一次交互创建 Kubernetes 事件，例如创建 Kubernetes 服务、更新 RayCluster 和删除 RayCluster。此外，如果自定义资源验证失败，KubeRay 也会创建一个 Kubernetes 事件。

# Example:
kubectl describe rayclusters.ray.io raycluster-kuberay

# Events:
#   Type    Reason            Age   From                   Message
#   ----    ------            ----  ----                   -------
#   Normal  CreatedService    37m   raycluster-controller  Created service default/raycluster-kuberay-head-svc
#   Normal  CreatedHeadPod    37m   raycluster-controller  Created head Pod default/raycluster-kuberay-head-l7v7q
#   Normal  CreatedWorkerPod  ...

Ray 可观测性#

Ray Dashboard#

要查看在头节点 Pod 上运行的Ray Dashboard，请按照这些说明进行操作。
要将 Ray Dashboard 与 Prometheus 和 Grafana 集成，请参阅使用 Prometheus 和 Grafana 以了解更多详细信息。
要启用“CPU 火焰图”和“堆栈跟踪”功能，请参阅使用 py-spy 进行性能分析。

检查 Ray Pod 的日志#

通过访问 Pod 上的日志文件直接检查 Ray 日志。有关更多详细信息，请参阅Ray 日志记录。

kubectl exec -it $RAY_POD -n $YOUR_NAMESPACE -- bash
# Check the logs under /tmp/ray/session_latest/logs/

检查 Dashboard#

export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl port-forward $RAY_POD -n $YOUR_NAMESPACE 8265:8265
# Check $YOUR_IP:8265 in your browser to access the dashboard.
# For most cases, 127.0.0.1:8265 or localhost:8265 should work.

Ray State CLI#

您可以在头节点 Pod 上使用Ray State CLI检查 Ray Serve 应用的状态。

# Log into the head Pod
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl exec -it $HEAD_POD -- ray summary actors

# [Example output]:
# ======== Actors Summary: 2023-07-11 17:58:24.625032 ========
# Stats:
# ------------------------------------
# total_actors: 14


# Table (group by class):
# ------------------------------------
#     CLASS_NAME                          STATE_COUNTS
# 0   ...                                 ALIVE: 1
# 1   ...                                 ALIVE: 1
# 2   ...                                 ALIVE: 3
# 3   ...                                 ALIVE: 1
# 4   ...                                 ALIVE: 1
# 5   ...                                 ALIVE: 1
# 6   ...                                 ALIVE: 1
# 7   ...                                 ALIVE: 1
# 8   ...                                 ALIVE: 1
# 9   ...                                 ALIVE: 1
# 10  ...                                 ALIVE: 1
# 11  ...                                 ALIVE: 1

KubeRay 可观测性#

KubeRay / Kubernetes 可观测性#

检查 KubeRay Operator 的日志是否有错误#

检查自定义资源的状态和事件#

RayCluster .Status.State#

RayCluster .Status.Conditions#

RayService .Status.Conditions#

Kubernetes 事件#

Ray 可观测性#

Ray Dashboard#

检查 Ray Pod 的日志#

检查 Dashboard#

Ray State CLI#

RayCluster `.Status.State`#

RayCluster `.Status.Conditions`#

RayService `.Status.Conditions`#