集群启动器命令#

本文档概述了使用 Ray 集群启动器的常用命令。请参阅集群配置文档，了解如何自定义配置文件。

启动集群 (`ray up`)#

这将在云中启动机器，安装您的依赖项并运行您配置的任何 setup 命令，自动配置 Ray 集群，并为您扩展分布式系统做好准备。请参阅 ray up 的文档。示例配置文件可以在此处访问。

提示

工作节点只有在头节点启动完成后才会启动。要监控集群设置进度，您可以运行 ray monitor <cluster yaml>。

# Replace '<your_backend>' with one of: 'aws', 'gcp', 'kubernetes', or 'local'.
$ BACKEND=<your_backend>

# Create or update the cluster.
$ ray up ray/python/ray/autoscaler/$BACKEND/example-full.yaml

# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/$BACKEND/example-full.yaml

更新现有集群 (`ray up`)#

如果您想更新集群配置（添加更多文件，更改依赖项），请在现有集群上再次运行 ray up。

此命令会检查本地配置是否与集群的已应用配置不同。这包括配置中 file_mounts 部分指定的所有同步文件的更改。如果不同，新文件和配置将被上传到集群。随后，Ray 服务/进程将被重启。

提示

请勿对云提供商规范进行此类操作（例如，在运行的集群上将 AWS 更改为 GCP）或更改集群名称（因为这只会启动一个新集群并孤立原有的集群）。

如果集群状态不佳，您也可以运行 ray up 来重启集群（即使没有配置更改，这也会重启所有 Ray 服务）。

在现有集群上运行 ray up 将执行以下所有操作

如果头节点与集群规范匹配，则将重新应用文件挂载，并运行 setup_commands 和 ray start 命令。这里可能存在一些缓存行为，以跳过 setup/文件挂载。
如果头节点与指定的 YAML 过期（例如，YAML 中的 head_node_type 已更改），则该过期节点将被终止，并会配置一个新节点来替换它。Setup/文件挂载/ray start 将被应用。
头节点达到一致状态后（ray start 命令完成后），上述相同的过程将应用于所有工作节点。ray start 命令通常会运行 ray stop + ray start，因此这将终止当前正在运行的作业。

如果您不希望更新重启服务（例如，因为更改不需要重启），请在 update 调用时传递 --no-restart 参数。

如果您想强制重新生成配置以获取云环境中可能发生的更改，请在 update 调用时传递 --no-config-cache 参数。

如果您想跳过 setup 命令，只在所有节点上运行 ray stop/ray start，请在 update 调用时传递 --restart-only 参数。

请参阅 ray up 的文档。

# Reconfigure autoscaling behavior without interrupting running jobs.
$ ray up ray/python/ray/autoscaler/$BACKEND/example-full.yaml \
    --max-workers=N --no-restart

在集群上运行 shell 命令 (`ray exec`)#

您可以使用 ray exec 方便地在集群上运行命令。请参阅 ray exec 的文档。

# Run a command on the cluster
$ ray exec cluster.yaml 'echo "hello world"'

# Run a command on the cluster, starting it if needed
$ ray exec cluster.yaml 'echo "hello world"' --start

# Run a command on the cluster, stopping the cluster after it finishes
$ ray exec cluster.yaml 'echo "hello world"' --stop

# Run a command on a new cluster called 'experiment-1', stopping it after
$ ray exec cluster.yaml 'echo "hello world"' \
    --start --stop --cluster-name experiment-1

# Run a command in a detached tmux session
$ ray exec cluster.yaml 'echo "hello world"' --tmux

# Run a command in a screen (experimental)
$ ray exec cluster.yaml 'echo "hello world"' --screen

如果您想在集群上运行可通过 Web 浏览器访问的应用程序（例如 Jupyter notebook），可以使用 --port-forward。本地打开的端口与远程端口相同。

$ ray exec cluster.yaml --port-forward=8899 'source ~/anaconda3/bin/activate tensorflow_p36 && jupyter notebook --port=8899'

注意

对于 Kubernetes 集群，在执行命令时不能使用 port-forward 选项。要同时进行端口转发和运行命令，您需要分别调用两次 ray exec。

在集群上运行 Ray 脚本 (`ray submit`)#

您还可以使用 ray submit 在集群上执行 Python 脚本。这会将指定文件通过 rsync 同步到头节点集群，并使用给定的参数执行它。请参阅 ray submit 的文档。

# Run a Python script in a detached tmux session
$ ray submit cluster.yaml --tmux --start --stop tune_experiment.py

# Run a Python script with arguments.
# This executes script.py on the head node of the cluster, using
# the command: python ~/script.py --arg1 --arg2 --arg3
$ ray submit cluster.yaml script.py -- --arg1 --arg2 --arg3

连接到正在运行的集群 (`ray attach`)#

您可以使用 ray attach 连接到集群上的交互式 screen 会话。请参阅 ray attach 的文档，或运行 ray attach --help。

# Open a screen on the cluster
$ ray attach cluster.yaml

# Open a screen on a new cluster called 'session-1'
$ ray attach cluster.yaml --start --cluster-name=session-1

# Attach to tmux session on cluster (creates a new one if none available)
$ ray attach cluster.yaml --tmux

从集群同步文件 (`ray rsync-up/down`)#

要向集群头节点下载或上传文件，请使用 ray rsync_down 或 ray rsync_up。

$ ray rsync_down cluster.yaml '/path/on/cluster' '/local/path'
$ ray rsync_up cluster.yaml '/local/path' '/path/on/cluster'

监控集群状态 (`ray dashboard/status`)#

Ray 还自带在线仪表盘。仪表盘可通过头节点的 HTTP 访问（默认监听 localhost:8265）。您也可以使用内置的 ray dashboard 自动设置端口转发，以便在本地浏览器通过 localhost:8265 查看远程仪表盘。

$ ray dashboard cluster.yaml

您可以通过运行（在头节点上）来监控集群使用情况和自动伸缩状态

$ ray status

查看状态实时更新

$ watch -n 1 ray status

Ray 自动伸缩器还会以实例标签的形式报告每个节点的状态。在您的云提供商控制台中，您可以点击一个节点，进入“标签”面板，并将 ray-node-status 标签添加为列。这使您可以一目了然地查看每个节点的状态。

常见工作流程：同步 Git 分支#

一个常见的用例是将特定的本地 Git 分支同步到集群的所有工作节点。但是，如果您只在 setup 命令中放置一个 git checkout <branch>，自动伸缩器将不知道何时重新运行该命令以拉取更新。有一个很好的解决方法是通过在输入中包含 Git SHA（如果分支更新，文件的哈希值会改变）

file_mounts: {
    "/tmp/current_branch_sha": "/path/to/local/repo/.git/refs/heads/<YOUR_BRANCH_NAME>",
}

setup_commands:
    - test -e <REPO_NAME> || git clone https://github.com/<REPO_ORG>/<REPO_NAME>.git
    - cd <REPO_NAME> && git fetch && git checkout `cat /tmp/current_branch_sha`

这会告诉 ray up 将当前 Git 分支的 SHA 从您的个人计算机同步到集群上的一个临时文件（假设您已推送了分支头）。然后，setup 命令读取该文件以确定应在节点上 checkout 哪个 SHA。请注意，每个命令都在其自己的会话中运行。因此，更新集群的最终工作流程变为如下步骤：

在本地 Git 分支上进行更改
使用 git commit 和 git push 提交更改
使用 ray up 更新您的 Ray 集群上的文件

集群启动器命令#

启动集群 (ray up)#

更新现有集群 (ray up)#

在集群上运行 shell 命令 (ray exec)#

在集群上运行 Ray 脚本 (ray submit)#

连接到正在运行的集群 (ray attach)#

从集群同步文件 (ray rsync-up/down)#

监控集群状态 (ray dashboard/status)#

常见工作流程：同步 Git 分支#

启动集群 (`ray up`)#

更新现有集群 (`ray up`)#

在集群上运行 shell 命令 (`ray exec`)#

在集群上运行 Ray 脚本 (`ray submit`)#

连接到正在运行的集群 (`ray attach`)#

从集群同步文件 (`ray rsync-up/down`)#

监控集群状态 (`ray dashboard/status`)#