使用 CLI 或 SDK 进行监控#

Ray 的监控和调试功能可通过 CLI 或 SDK 获取。

CLI 命令 `ray status`#

您可以在头节点上运行 CLI 命令 ray status 来监控节点状态和资源使用情况。它显示

节点状态：正在运行并自动伸缩（放大或缩小）的节点。正在运行节点的地址。关于待处理节点和失败节点的信息。
资源使用情况：集群的 Ray 资源使用情况。例如，所有 Ray 任务和 Actor 请求的 CPU。正在使用的 GPU 数量。

以下是示例输出

$ ray status
======== Autoscaler status: 2021-10-12 13:10:21.035674 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.default
 2 ray.worker.cpu
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/10.0 CPU
 0.00/70.437 GiB memory
 0.00/10.306 GiB object_store_memory

Demands:
 (no resource demands)

当您需要每个节点的更详细信息时，请运行 ray status -v。这对于调查特定节点为何不自动缩容非常有用。

Ray State CLI 和 SDK#

提示

提供有关使用 Ray state API 的反馈 - 反馈表！

使用 Ray State API 通过 CLI 或 Python SDK（开发者 API）访问 Ray 的当前状态（快照）。

注意

此功能需要使用 pip install "ray[default]" 进行 Ray 的完整安装。此功能还需要仪表盘组件可用。启动 Ray 集群时需要包含仪表盘组件，这是 ray start 和 ray.init() 的默认行为。

注意

State API CLI 命令是稳定的，而 Python SDK 是开发者 API。建议优先使用 CLI。

入门#

此示例使用以下脚本，该脚本运行两个任务并创建两个 Actor。

import ray
import time

ray.init(num_cpus=4)

@ray.remote
def task_running_300_seconds():
    time.sleep(300)

@ray.remote
class Actor:
    def __init__(self):
        pass

# Create 2 tasks
tasks = [task_running_300_seconds.remote() for _ in range(2)]

# Create 2 actors
actors = [Actor.remote() for _ in range(2)]

查看任务的汇总状态。如果未立即返回输出，请重试该命令。

CLI（推荐）

ray summary tasks

======== Tasks Summary: 2022-07-22 08:54:38.332537 ========
Stats:
------------------------------------
total_actor_scheduled: 2
total_actor_tasks: 0
total_tasks: 2


Table (group by func_name):
------------------------------------
    FUNC_OR_CLASS_NAME        STATE_COUNTS    TYPE
0   task_running_300_seconds  RUNNING: 2      NORMAL_TASK
1   Actor.__init__            FINISHED: 2     ACTOR_CREATION_TASK

Python SDK（内部开发者 API）

from ray.util.state import summarize_tasks
print(summarize_tasks())

{'cluster': {'summary': {'task_running_300_seconds': {'func_or_class_name': 'task_running_300_seconds', 'type': 'NORMAL_TASK', 'state_counts': {'RUNNING': 2}}, 'Actor.__init__': {'func_or_class_name': 'Actor.__init__', 'type': 'ACTOR_CREATION_TASK', 'state_counts': {'FINISHED': 2}}}, 'total_tasks': 2, 'total_actor_tasks': 0, 'total_actor_scheduled': 2, 'summary_by': 'func_name'}}

列出所有 Actor。

CLI（推荐）

ray list actors

======== List: 2022-07-23 21:29:39.323925 ========
Stats:
------------------------------
Total: 2

Table:
------------------------------
    ACTOR_ID                          CLASS_NAME    NAME      PID  STATE
0  31405554844820381c2f0f8501000000  Actor                 96956  ALIVE
1  f36758a9f8871a9ca993b1d201000000  Actor                 96955  ALIVE

Python SDK（内部开发者 API）

from ray.util.state import list_actors
print(list_actors())

[ActorState(actor_id='...', class_name='Actor', state='ALIVE', job_id='01000000', name='', node_id='...', pid=..., ray_namespace='...', serialized_runtime_env=None, required_resources=None, death_cause=None, is_detached=None, placement_group_id=None, repr_name=None), ActorState(actor_id='...', class_name='Actor', state='ALIVE', job_id='01000000', name='', node_id='...', pid=..., ray_namespace='...', serialized_runtime_env=None, required_resources=None, death_cause=None, is_detached=None, placement_group_id=None, repr_name=None)]

使用 get API 获取单个任务的状态。

CLI（推荐）

# In this case, 31405554844820381c2f0f8501000000
ray get actors <ACTOR_ID>

---
actor_id: 31405554844820381c2f0f8501000000
class_name: Actor
death_cause: null
is_detached: false
name: ''
pid: 96956
resource_mapping: []
serialized_runtime_env: '{}'
state: ALIVE

Python SDK（内部开发者 API）

from ray.util.state import get_actor
# In this case, 31405554844820381c2f0f8501000000
print(get_actor(id=<ACTOR_ID>))

通过 ray logs API 访问日志。

CLI（推荐）

ray list actors
# In this case, ACTOR_ID is 31405554844820381c2f0f8501000000
ray logs actor --id <ACTOR_ID>

--- Log has been truncated to last 1000 lines. Use `--tail` flag to toggle. ---

:actor_name:Actor
Actor created

Python SDK（内部开发者 API）

from ray.util.state import get_log

# In this case, ACTOR_ID is 31405554844820381c2f0f8501000000
for line in get_log(actor_id=<ACTOR_ID>):
    print(line)

关键概念#

Ray State API 允许您通过 summary、list 和 get API 访问资源的状态。它还支持 logs API 来访问日志。

状态：相应资源的集群状态。状态包括不可变元数据（例如 Actor 的名称）和可变状态（例如 Actor 的调度状态或 pid）。
资源：Ray 创建的资源。例如，Actor、任务、对象、放置组等。
summary：返回资源汇总视图的 API。
list：返回每个独立资源实体的 API。
get：详细返回单个资源实体的 API。
logs：访问 Actor、任务、Worker 或系统日志文件的 API。

API 参考#

有关 CLI 参考，请参阅State CLI 参考。
有关 SDK 参考，请参阅State API 参考。
有关 Log CLI 参考，请参阅Log CLI 参考。

从集群外部使用 Ray CLI 工具#

这些 CLI 命令必须在 Ray 集群的节点上运行。从 Ray 集群外部的机器执行这些命令的示例如下。

VM 集群启动器

使用 ray exec 在集群上执行命令

$ ray exec <cluster config file> "ray status"

KubeRay

使用 kubectl exec 和配置的 RayCluster 名称在集群上执行命令。Ray 使用指向 Ray head pod 的 Service 在集群上执行 CLI 命令。

# First, find the name of the Ray head service.
$ kubectl get pod | grep <RayCluster name>-head
# NAME                                             READY   STATUS    RESTARTS   AGE
# <RayCluster name>-head-xxxxx                     2/2     Running   0          XXs

# Then, use the name of the Ray head service to run `ray status`.
$ kubectl exec <RayCluster name>-head-xxxxx -- ray status

使用 CLI 或 SDK 进行监控#

CLI 命令 `ray status`#

Ray State CLI 和 SDK#

入门#

关键概念#

用户指南#

按类型获取实体的状态汇总#

列出特定类型所有实体的状态#

获取特定实体（任务、Actor 等）的状态#

获取特定实体（任务、Actor 等）的日志#

故障语义#

API 参考#

从集群外部使用 Ray CLI 工具#

使用 CLI 或 SDK 进行监控#

CLI 命令 ray status#

Ray State CLI 和 SDK#

入门#

关键概念#

用户指南#

按类型获取实体的状态汇总#

列出特定类型所有实体的状态#

获取特定实体（任务、Actor 等）的状态#

获取特定实体（任务、Actor 等）的日志#

故障语义#

API 参考#

从集群外部使用 Ray CLI 工具#

CLI 命令 `ray status`#