入门#

本快速入门演示了 Ray 集群的功能。我们将使用 Ray 集群,把一个为笔记本电脑设计的示例应用扩展到云端。Ray 只需几个命令即可启动集群并扩展 Python。

要手动启动 Ray 集群,您可以参考本地集群设置指南。

关于演示#

本演示将引导您完成端到端流程

  1. 创建一个(基本)Python 应用。

  2. 在云提供商上启动集群。

  3. 在云中运行应用。

要求#

要运行此演示,您需要

  • 在您的开发机器(通常是您的笔记本电脑)上安装 Python,并且

  • 在您偏好的云提供商(AWS, GCP, Azure, 阿里云 或 vSphere)有一个账户。

设置#

在开始之前,您需要安装一些 Python 依赖项,如下所示

$ pip install -U "ray[default]" boto3
$ pip install -U "ray[default]" google-api-python-client
$ pip install -U "ray[default]" azure-cli azure-core
$ pip install -U "ray[default]" aliyun-python-sdk-core aliyun-python-sdk-ecs

阿里云集群启动器维护者(GitHub 用户名):@zhuangzhuang131419, @chenk008

$ pip install -U "ray[default]" "git+https://github.com/vmware/vsphere-automation-sdk-python.git"

vSphere 集群启动器维护者(GitHub 用户名):@LaynePeng, @roshankathawate, @JingChen23

接下来,如果您尚未设置通过命令行使用您的云提供商,您需要配置您的凭据

按照AWS 文档中的说明在 ~/.aws/credentials 中配置您的凭据。

按照GCP 文档中的说明设置 GOOGLE_APPLICATION_CREDENTIALS 环境变量。

使用 az login 登录,然后使用 az account set -s <subscription_id> 配置您的凭据。

按照文档中的说明获取并设置阿里云账户的 AccessKey 对。

确保授予 RAM 用户必要的权限,并在您的集群配置文件中设置 AccessKey 对。请参考提供的 aliyun/example-full.yaml 获取示例集群配置。

$ export VSPHERE_SERVER=192.168.0.1 # Enter your vSphere vCenter Address
$ export VSPHERE_USER=user # Enter your username
$ export VSPHERE_PASSWORD=password # Enter your password

创建一个(基本)Python 应用#

我们将编写一个简单的 Python 应用,用于跟踪其任务执行所在机器的 IP 地址

from collections import Counter
import socket
import time

def f():
    time.sleep(0.001)
    # Return IP address.
    return socket.gethostbyname("localhost")

ip_addresses = [f() for _ in range(10000)]
print(Counter(ip_addresses))

将此应用保存为 script.py,并通过运行命令 python script.py 执行它。此应用应运行 10 秒并输出类似 Counter({'127.0.0.1': 10000}) 的内容。

稍作改动,我们就可以让此应用在 Ray 上运行(有关如何执行此操作的更多信息,请参阅Ray Core 演练

from collections import Counter
import socket
import time

import ray

ray.init()

@ray.remote
def f():
    time.sleep(0.001)
    # Return IP address.
    return socket.gethostbyname("localhost")

object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)
print(Counter(ip_addresses))

最后,让我们添加一些代码使输出更具趣味性

from collections import Counter
import socket
import time

import ray

ray.init()

print('''This cluster consists of
    {} nodes in total
    {} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))

@ray.remote
def f():
    time.sleep(0.001)
    # Return IP address.
    return socket.gethostbyname("localhost")

object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)

print('Tasks executed')
for ip_address, num_tasks in Counter(ip_addresses).items():
    print('    {} tasks on {}'.format(num_tasks, ip_address))

运行 python script.py 现在应该输出类似以下内容

This cluster consists of
    1 nodes in total
    4.0 CPU resources in total

Tasks executed
    10000 tasks on 127.0.0.1

在云提供商上启动集群#

要启动 Ray 集群,首先需要定义集群配置。集群配置在一个 YAML 文件中定义,该文件将用于集群启动器启动头节点和自动扩缩容启动工作节点。

一个最小的示例集群配置文件如下所示

# An unique identifier for the head node and workers of this cluster.
cluster_name: aws-example-minimal

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-west-2

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 3

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
        # You can also set custom resources.
        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
        resources: {}
        # Provider-specific config for this node type, e.g., instance type. By default
        # Ray auto-configures unspecified fields such as SubnetId and KeyName.
        # For more documentation on available fields, see
        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
        node_config:
            InstanceType: m5.large
    ray.worker.default:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 3
        # The maximum number of worker nodes of this type to launch.
        # This parameter takes precedence over min_workers.
        max_workers: 3
        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
        # You can also set custom resources.
        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
        resources: {}
        # Provider-specific config for this node type, e.g., instance type. By default
        # Ray auto-configures unspecified fields such as SubnetId and KeyName.
        # For more documentation on available fields, see
        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
        node_config:
            InstanceType: m5.large
# A unique identifier for the head node and workers of this cluster.
cluster_name: minimal

# Cloud-provider specific configuration.
provider:
    type: gcp
    region: us-west1
# An unique identifier for the head node and workers of this cluster.
cluster_name: minimal

# Cloud-provider specific configuration.
provider:
    type: azure
    location: westus2
    resource_group: ray-cluster

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
    # you must specify paths to matching private and public key pair files
    # use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair
    ssh_private_key: ~/.ssh/id_rsa
    # changes to this should match what is specified in file_mounts
    ssh_public_key: ~/.ssh/id_rsa.pub

请参考example-full.yaml

确保您的账户余额不低于 100 元人民币,否则您将收到错误 InvalidAccountStatus.NotEnoughBalance

# An unique identifier for the head node and workers of this cluster.
cluster_name: default

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 5

# Cloud-provider specific configuration.
provider:
    type: vsphere

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ray
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below.
    ssh_private_key: ~/ray-bootstrap-key.pem

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # You can override the resources here. Adding GPU to the head node is not recommended.
        # resources: { "CPU": 2, "Memory": 4096}
        resources: {}
    ray.worker.default:
        # The minimum number of nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 1
        max_workers: 3
        # You can override the resources here. For GPU, currently only Nvidia GPU is supported. If no ESXi host can
        # fulfill the requirement, the Ray node creation will fail. The number of created nodes may not meet the desired
        # minimum number. The vSphere node provider will not distinguish the GPU type. It will just count the quantity:
        # mount the first k random available Nvidia GPU to the VM, if the user set {"GPU": k}.
        # resources: {"CPU": 2, "Memory": 4096, "GPU": 1}
        resources: {}

# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

将此配置文件保存为 config.yaml。您可以在配置文件中指定更多详细信息:要使用的实例类型、要启动的最小和最大工作节点数、自动扩缩容策略、要同步的文件等。有关可用配置属性的完整参考,请参阅集群 YAML 配置选项参考

定义配置后,我们将使用 Ray 集群启动器在云端启动集群,创建指定的“头节点”和工作节点。要启动 Ray 集群,我们将使用Ray CLI。运行以下命令

$ ray up -y config.yaml

在 Ray 集群上运行应用#

现在我们准备好在 Ray 集群上执行应用了。ray.init() 现在将自动连接到新创建的集群。

作为一个快速示例,我们在 Ray 集群上执行一个连接到 Ray 并退出的 Python 命令

$ ray exec config.yaml 'python -c "import ray; ray.init()"'
2022-08-10 11:23:17,093 INFO worker.py:1312 -- Connecting to existing Ray cluster at address: <remote IP address>:6379...
2022-08-10 11:23:17,097 INFO worker.py:1490 -- Connected to Ray cluster.

您还可以选择使用 ray attach 获取远程 shell 并在集群上直接运行命令。此命令将创建到 Ray 集群头节点的 SSH 连接。

# From a remote client:
$ ray attach config.yaml

# Now on the head node...
$ python -c "import ray; ray.init()"

有关 Ray 集群 CLI 工具的完整参考,请参阅集群命令参考

虽然这些工具对于在 Ray 集群上进行临时执行很有用,但推荐在 Ray 集群上执行应用的方式是使用Ray Jobs。请查看快速入门指南开始使用!

删除 Ray 集群#

要关闭您的集群,请运行以下命令

$ ray down -y config.yaml