入门#
本快速入门演示了 Ray 集群的功能。使用 Ray 集群,我们将采用一个设计用于在笔记本电脑上运行的示例应用程序,并将其扩展到云端。Ray 只需几条命令即可启动集群并扩展 Python。
要手动启动 Ray 集群,您可以参考 本地集群设置 指南。
关于演示#
本演示将引导您完成一个端到端的流程
创建一个(基本的)Python 应用程序。
在云提供商上启动一个集群。
在云端运行应用程序。
要求#
要运行此演示,您将需要
在您的开发机器(通常是您的笔记本电脑)上安装 Python,以及
在您偏好的云提供商(AWS、GCP、Azure、阿里云或 vSphere)上拥有一个帐户。
设置#
在开始之前,您需要安装一些 Python 依赖项,如下所示
$ pip install -U "ray[default]" boto3
$ pip install -U "ray[default]" azure-cli azure-core
$ pip install -U "ray[default]" google-api-python-client
$ pip install -U "ray[default]" aliyun-python-sdk-core aliyun-python-sdk-ecs
阿里云集群启动器维护者(GitHub 句柄):@zhuangzhuang131419、@chenk008
$ pip install -U "ray[default]"
vSphere 集群启动器维护者(GitHub 句柄):@roshankathawate、@ankitasonawane30、@VamshikShetty
接下来,如果您还没有配置好从命令行使用您的云提供商,您将需要配置您的凭据
按照 文档中的说明,获取并设置阿里云帐户的 AccessKey 对。
请确保授予 RAM 用户必要的权限,并将 AccessKey 对设置在您的集群配置文件中。有关示例集群配置,请参阅提供的 aliyun/example-full.yaml。
请确保 Ray 监督服务已根据 Ray-on-VCF 文档 <https://github-vcf.devops.broadcom.net/vcf/vmray> 启动并运行。
创建一个(基本的)Python 应用程序#
我们将编写一个简单的 Python 应用程序,该程序会跟踪其任务执行所在的机器的 IP 地址。
from collections import Counter
import socket
import time
def f():
time.sleep(0.001)
# Return IP address.
return socket.gethostbyname("localhost")
ip_addresses = [f() for _ in range(10000)]
print(Counter(ip_addresses))
将此应用程序另存为 script.py,然后通过运行命令 python script.py 来执行。该应用程序应运行 10 秒钟,并输出类似于 Counter({'127.0.0.1': 10000}) 的内容。
通过一些小的更改,我们可以使此应用程序在 Ray 上运行(有关如何执行此操作的更多信息,请参阅 Ray Core Walkthrough)。
from collections import Counter
import socket
import time
import ray
ray.init()
@ray.remote
def f():
time.sleep(0.001)
# Return IP address.
return socket.gethostbyname("localhost")
object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)
print(Counter(ip_addresses))
最后,让我们添加一些代码以使输出更有趣。
from collections import Counter
import socket
import time
import ray
ray.init()
print('''This cluster consists of
{} nodes in total
{} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))
@ray.remote
def f():
time.sleep(0.001)
# Return IP address.
return socket.gethostbyname("localhost")
object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)
print('Tasks executed')
for ip_address, num_tasks in Counter(ip_addresses).items():
print(' {} tasks on {}'.format(num_tasks, ip_address))
运行 python script.py 现在应该会输出类似以下内容:
This cluster consists of
1 nodes in total
4.0 CPU resources in total
Tasks executed
10000 tasks on 127.0.0.1
在云提供商上启动一个集群#
要启动 Ray 集群,首先我们需要定义集群配置。集群配置在一个 YAML 文件中定义,该文件将由集群启动器用于启动主节点,由自动伸缩器用于启动工作节点。
一个最小的示例集群配置文件如下所示
# An unique identifier for the head node and workers of this cluster.
cluster_name: aws-example-minimal
# Cloud-provider specific configuration.
provider:
type: aws
region: us-west-2
# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 3
# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
ray.head.default:
# The node type's CPU and GPU resources are auto-detected based on AWS instance type.
# If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
# You can also set custom resources.
# For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
# resources: {"CPU": 1, "GPU": 1, "custom": 5}
resources: {}
# Provider-specific config for this node type, e.g., instance type. By default
# Ray auto-configures unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
node_config:
InstanceType: m5.large
ray.worker.default:
# The minimum number of worker nodes of this type to launch.
# This number should be >= 0.
min_workers: 3
# The maximum number of worker nodes of this type to launch.
# This parameter takes precedence over min_workers.
max_workers: 3
# The node type's CPU and GPU resources are auto-detected based on AWS instance type.
# If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
# You can also set custom resources.
# For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
# resources: {"CPU": 1, "GPU": 1, "custom": 5}
resources: {}
# Provider-specific config for this node type, e.g., instance type. By default
# Ray auto-configures unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
node_config:
InstanceType: m5.large
# An unique identifier for the head node and workers of this cluster.
cluster_name: minimal
# Cloud-provider specific configuration.
provider:
type: azure
location: westus2
resource_group: ray-cluster
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ubuntu
# you must specify paths to matching private and public key pair files
# use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair
ssh_private_key: ~/.ssh/id_rsa
# changes to this should match what is specified in file_mounts
ssh_public_key: ~/.ssh/id_rsa.pub
# A unique identifier for the head node and workers of this cluster.
cluster_name: minimal
# Cloud-provider specific configuration.
provider:
type: gcp
region: us-west1
请参考 example-full.yaml。
请确保您的帐户余额不少于 100 元人民币,否则您将收到错误 InvalidAccountStatus.NotEnoughBalance。
# An unique identifier for the head node and workers of this cluster.
cluster_name: default
# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 5
# Cloud-provider specific configuration.
provider:
type: vsphere
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ray
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below.
ssh_private_key: ~/ray-bootstrap-key.pem
# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
ray.head.default:
# You can override the resources here. Adding GPU to the head node is not recommended.
# resources: { "CPU": 2, "Memory": 4096}
resources: {}
ray.worker.default:
# The minimum number of nodes of this type to launch.
# This number should be >= 0.
min_workers: 1
max_workers: 3
# You can override the resources here. For GPU, currently only NVIDIA GPU is supported. If no ESXi host can
# fulfill the requirement, the Ray node creation will fail. The number of created nodes may not meet the desired
# minimum number. The vSphere node provider will not distinguish the GPU type. It will just count the quantity:
# mount the first k random available NVIDIA GPU to the VM, if the user set {"GPU": k}.
# resources: {"CPU": 2, "Memory": 4096, "GPU": 1}
resources: {}
# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default
将此配置文件另存为 config.yaml。您可以在配置文件中指定更多详细信息:要使用的实例类型、要启动的工作节点的最小和最大数量、自动伸缩策略、要同步的文件等等。有关可用配置属性的完整参考,请参阅 集群 YAML 配置选项参考。
在定义了我们的配置后,我们将使用 Ray 集群启动器在云端启动一个集群,创建一个指定的“主节点”和工作节点。要启动 Ray 集群,我们将使用 Ray CLI。运行以下命令
$ ray up -y config.yaml
在 Ray 集群上运行应用程序#
我们现在已准备好在我们的 Ray 集群上执行应用程序。 ray.init() 现在将自动连接到新创建的集群。
作为一个快速示例,我们在 Ray 集群上执行一个 Python 命令,该命令连接到 Ray 并退出。
$ ray exec config.yaml 'python -c "import ray; ray.init()"'
2022-08-10 11:23:17,093 INFO worker.py:1312 -- Connecting to existing Ray cluster at address: <remote IP address>:6379...
2022-08-10 11:23:17,097 INFO worker.py:1490 -- Connected to Ray cluster.
您还可以选择使用 ray attach 获取远程 shell,并在集群上直接运行命令。此命令将创建到 Ray 集群主节点的 SSH 连接。
# From a remote client:
$ ray attach config.yaml
# Now on the head node...
$ python -c "import ray; ray.init()"
有关 Ray 集群 CLI 工具的完整参考,请参阅 集群命令参考。
虽然这些工具对于在 Ray 集群上进行临时执行很有用,但推荐的在 Ray 集群上执行应用程序的方法是使用 Ray Jobs。查看 快速入门指南 以开始!
删除 Ray 集群#
要关闭您的集群,请运行以下命令
$ ray down -y config.yaml