集群 YAML 配置选项#
集群配置在 YAML 文件中定义,集群启动器(Cluster Launcher)将使用此文件启动头节点,自动伸缩器(Autoscaler)将使用此文件启动工作节点。定义集群配置后,你需要使用 Ray CLI 执行启动和停止集群等操作。
语法#
cluster_name: str max_workers: int upscaling_speed: float idle_timeout_minutes: int docker: docker provider: provider auth: auth available_node_types: node_types head_node_type: str file_mounts: file_mounts cluster_synced_files: - str rsync_exclude: - str rsync_filter: - str initialization_commands: - str setup_commands: - str head_setup_commands: - str worker_setup_commands: - str head_start_ray_commands: - str worker_start_ray_commands: - str
自定义类型#
Docker#
image: str head_image: str worker_image: str container_name: str pull_before_run: bool run_options: - str head_run_options: - str worker_run_options: - str disable_automatic_runtime_detection: bool disable_shm_size_detection: bool
认证#
ssh_user: str ssh_private_key: str
ssh_user: str ssh_private_key: str ssh_public_key: str
ssh_user: str ssh_private_key: str
ssh_user: str
提供商#
type: str region: str availability_zone: str cache_stopped_nodes: bool security_group: Security Group use_internal_ips: bool
type: str location: str resource_group: str subscription_id: str msi_name: str msi_resource_group: str cache_stopped_nodes: bool use_internal_ips: bool use_external_head_ip: bool
type: str region: str availability_zone: str project_id: str cache_stopped_nodes: bool use_internal_ips: bool
type: str vsphere_config: vSphere Config
安全组#
GroupName: str IpPermissions: - IpPermission
vSphere 配置#
vSphere 凭证#
vSphere Frozen VM 配置#
name: str library_item: str resource_pool: str cluster: str datastore: str
vSphere GPU 配置#
dynamic_pci_passthrough: bool
节点类型#
The available_nodes_types
对象的键表示不同节点类型的名称。
从 available_node_types
中删除节点类型并使用 ray up 更新将导致自动伸缩器缩减该类型的所有节点。特别是,更改节点类型对象的键将导致删除对应于旧键的节点;然后将根据集群配置和 Ray 资源需求创建具有新键名的节点。
<node_type_1_name>: node_config: Node config resources: Resources min_workers: int max_workers: int worker_setup_commands: - str docker: Node Docker <node_type_2_name>: ... ...
节点配置#
针对给定节点类型的云服务提供商特定配置。
修改 node_config
并使用 ray up 更新将导致自动伸缩器缩减该节点类型的所有现有节点;然后将根据集群配置和 Ray 资源需求创建应用新 node_config
的节点。
一个符合 AWS 文档中 EC2 create_instances
API 的 YAML 对象。
一个 YAML 对象,其定义在 GCP 文档中。
# The resource pool where the head node should live, if unset, will be
# the frozen VM's resource pool.
resource_pool: str
# The datastore to store the vmdk of the head node vm, if unset, will be
# the frozen VM's datastore.
datastore: str
节点 Docker#
worker_image: str pull_before_run: bool worker_run_options: - str disable_automatic_runtime_detection: bool disable_shm_size_detection: bool
资源#
CPU: int GPU: int object_store_memory: int memory: int <custom_resource1>: int <custom_resource2>: int ...
文件挂载#
<path1_on_remote_machine>: str # Path 1 on local machine
<path2_on_remote_machine>: str # Path 2 on local machine
...
属性和定义#
cluster_name
#
集群的名称。这是集群的命名空间。
必需: 是
重要性: 高
类型: 字符串
默认值: “default”
模式:
[a-zA-Z0-9_]+
max_workers
#
集群在任何给定时间允许的最大工作节点数。
必需: 否
重要性: 高
类型: 整数
默认值:
2
最小值:
0
最大值: 无限制
upscaling_speed
#
允许待处理的节点数是当前节点数的倍数。例如,如果设置为 1.0,集群在任何时候最多可以增长 100%,因此如果集群当前有 20 个节点,则最多允许 20 个待处理的启动。请注意,尽管自动伸缩器会缩减到 min_workers
(可以是 0),但在扩展时,它总是至少扩展到 5 个节点。
必需: 否
重要性: 中等
类型: 浮点数
默认值:
1.0
最小值:
0.0
最大值: 无限制
idle_timeout_minutes
#
自动伸缩器在移除空闲工作节点之前需要经过的分钟数。
必需: 否
重要性: 中等
类型: 整数
默认值:
5
最小值:
0
最大值: 无限制
docker
#
配置 Ray 在 Docker 容器中运行。
必需: 否
重要性: 高
类型: Docker
默认值:
{}
在极少数情况下,如果系统默认未安装 Docker(例如,AMI 不良),请将以下命令添加到 initialization_commands 中进行安装。
initialization_commands:
- curl -fsSL https://get.docker.com -o get-docker.sh
- sudo sh get-docker.sh
- sudo usermod -aG docker $USER
- sudo systemctl restart docker -f
provider
#
云服务提供商特定的配置属性。
必需: 是
重要性: 高
类型: Provider
auth
#
Ray 用于启动节点的认证凭证。
必需: 是
重要性: 高
类型: Auth
available_node_types
#
告诉自动伸缩器允许的节点类型及其提供的资源。每种节点类型都由用户指定的键标识。
必需: 否
重要性: 高
类型: 节点类型
默认值
available_node_types:
ray.head.default:
node_config:
InstanceType: m5.large
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 140
resources: {"CPU": 2}
ray.worker.default:
node_config:
InstanceType: m5.large
InstanceMarketOptions:
MarketType: spot
resources: {"CPU": 2}
min_workers: 0
head_node_type
#
available_node_types 中其中一个节点类型的键。此节点类型将用于启动头节点。
如果更改字段 head_node_type
并使用 ray up 执行更新,当前运行的头节点将被视为过期。用户将收到一个提示,要求确认缩减过期的头节点,然后集群将使用新的头节点重新启动。更改键为 head_node_type
的 node_type 的 node_config 也将在用户提示后导致集群重新启动。
必需: 是
重要性: 高
类型: 字符串
模式:
[a-zA-Z0-9_]+
file_mounts
#
要复制到头节点和工作节点的文件或目录。
必需: 否
重要性: 高
类型: 文件挂载
默认值:
[]
cluster_synced_files
#
要从头节点复制到工作节点的文件或目录路径列表。头节点上的相同路径将复制到工作节点。此行为是 file_mounts 行为的子集,因此在绝大多数情况下,只需使用 file_mounts 即可。
必需: 否
重要性: 低
类型: 字符串列表
默认值:
[]
rsync_exclude
#
运行 rsync up
或 rsync down
时要排除的文件模式列表。过滤器仅应用于源目录。
列表中的模式示例:**/.git/**
。
必需: 否
重要性: 低
类型: 字符串列表
默认值:
[]
rsync_filter
#
运行 rsync up
或 rsync down
时要排除的文件模式列表。过滤器应用于源目录并递归应用于所有子目录。
列表中的模式示例:.gitignore
。
必需: 否
重要性: 低
类型: 字符串列表
默认值:
[]
initialization_commands
#
将在 setup commands 之前运行的命令列表。如果启用了 Docker,这些命令将在容器外部和 Docker 设置之前运行。
必需: 否
重要性: 中等
类型: 字符串列表
默认值:
[]
setup_commands
#
用于设置节点的命令列表。这些命令将始终在头节点和工作节点上运行,并且会与头节点的 head setup commands 和工作节点的 worker setup commands 合并。
必需: 否
重要性: 中等
类型: 字符串列表
默认值
# Default setup_commands:
setup_commands:
- echo 'export PATH="$HOME/anaconda3/envs/tensorflow_p36/bin:$PATH"' >> ~/.bashrc
- pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
设置命令理想情况下应该是幂等的(即可以运行多次而不会改变结果);这允许 Ray 在节点创建后安全地更新它们。通常可以通过小的修改使命令幂等,例如
git clone foo
可以重写为test -e foo || git clone foo
,它首先检查仓库是否已克隆。设置命令是顺序运行但相互独立的。例如,如果你使用 anaconda,你需要运行
conda activate env && pip install -U ray
,因为将命令拆分成两个设置命令将无法工作。理想情况下,你应该通过创建一个预安装了所有依赖项的 Docker 镜像来避免使用 setup_commands,以最大程度地缩短启动时间。
提示:如果你还想在设置过程中运行 apt-get 命令,添加以下命令列表
setup_commands: - sudo pkill -9 apt-get || true - sudo pkill -9 dpkg || true - sudo dpkg --configure -a
head_setup_commands
#
用于设置头节点的命令列表。这些命令将与通用 setup commands 合并。
必需: 否
重要性: 低
类型: 字符串列表
默认值:
[]
worker_setup_commands
#
用于设置工作节点的命令列表。这些命令将与通用 setup commands 合并。
必需: 否
重要性: 低
类型: 字符串列表
默认值:
[]
head_start_ray_commands
#
在头节点上启动 Ray 的命令。你无需更改此项。
必需: 否
重要性: 低
类型: 字符串列表
默认值
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands
#
在工作节点上启动 Ray 的命令。你无需更改此项。
必需: 否
重要性: 低
类型: 字符串列表
默认值
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
docker.image
#
在头节点和工作节点上拉取的默认 Docker 镜像。这可以通过 head_image 和 worker_image 字段覆盖。如果既未指定 image
也未指定(head_image 和 worker_image),Ray 将不使用 Docker。
必需: 是(如果使用 Docker)。
重要性: 高
类型: 字符串
Ray 项目在 DockerHub 上提供 Docker 镜像。仓库包含以下镜像
rayproject/ray-ml:latest-gpu
: 支持 CUDA,包含 ML 依赖项。rayproject/ray:latest-gpu
: 支持 CUDA,不包含 ML 依赖项。rayproject/ray-ml:latest
: 不支持 CUDA,包含 ML 依赖项。rayproject/ray:latest
: 不支持 CUDA,不包含 ML 依赖项。
docker.head_image
#
头节点的 Docker 镜像,用于覆盖默认的 docker image。
必需: 否
重要性: 低
类型: 字符串
docker.worker_image
#
工作节点的 Docker 镜像,用于覆盖默认的 docker image。
必需: 否
重要性: 低
类型: 字符串
docker.container_name
#
启动 Docker 容器时使用的名称。
必需: 是(如果使用 Docker)。
重要性: 低
类型: 字符串
默认值: ray_container
docker.pull_before_run
#
如果启用,启动 Docker 时将拉取最新版本的镜像。如果禁用,docker run
将仅在没有缓存版本时拉取镜像。
必需: 否
重要性: 中等
类型: 布尔值
默认值:
True
docker.run_options
#
传递给 docker run
的额外选项。
必需: 否
重要性: 中等
类型: 字符串列表
默认值:
[]
docker.head_run_options
#
仅针对头节点传递给 docker run
的额外选项。
必需: 否
重要性: 低
类型: 字符串列表
默认值:
[]
docker.worker_run_options
#
仅针对工作节点传递给 docker run
的额外选项。
必需: 否
重要性: 低
类型: 字符串列表
默认值:
[]
docker.disable_automatic_runtime_detection
#
如果启用,如果存在 GPU,Ray 将不会尝试使用 NVIDIA 容器运行时。
必需: 否
重要性: 低
类型: 布尔值
默认值:
False
docker.disable_shm_size_detection
#
如果启用,Ray 将不会自动为启动的容器指定 /dev/shm
的大小,并将使用运行时的默认值(Docker 为 64MiB)。如果在 run_options
中手动添加了 --shm-size=<>
,则此项将自动设置为 True
,意味着 Ray 将遵从用户提供的值。
必需: 否
重要性: 低
类型: 布尔值
默认值:
False
auth.ssh_user
#
Ray 在启动新节点时用于身份验证的用户。
必需: 是
重要性: 高
类型: 字符串
auth.ssh_private_key
#
Ray 将使用的现有私钥路径。如果未配置,Ray 将创建一个新的私钥对(默认行为)。如果已配置,密钥必须添加到项目范围的元数据中,并且必须在 节点配置 中定义 KeyName
。
必需: 否
重要性: 低
类型: 字符串
Ray 将使用的现有私钥路径。
必需: 是
重要性: 高
类型: 字符串
你可以使用 ssh-keygen -t rsa -b 4096
生成新的 ssh 密钥对。
Ray 将使用的现有私钥路径。如果未配置,Ray 将创建一个新的私钥对(默认行为)。如果已配置,密钥必须添加到项目范围的元数据中,并且必须在 节点配置 中定义 KeyName
。
必需: 否
重要性: 低
类型: 字符串
不可用。vSphere 提供商期望密钥位于固定路径 ~/ray-bootstrap-key.pem
。
auth.ssh_public_key
#
不可用。
Ray 将使用的现有公钥路径。
必需: 是
重要性: 高
类型: 字符串
不可用。
不可用。
provider.type
#
云服务提供商。对于 AWS,必须设置为 aws
。
必需: 是
重要性: 高
类型: 字符串
云服务提供商。对于 Azure,必须设置为 azure
。
必需: 是
重要性: 高
类型: 字符串
云服务提供商。对于 GCP,必须设置为 gcp
。
必需: 是
重要性: 高
类型: 字符串
云服务提供商。对于 vSphere 和 VCF,必须设置为 vsphere
。
必需: 是
重要性: 高
类型: 字符串
provider.region
#
用于部署 Ray 集群的区域。
必需: 是
重要性: 高
类型: 字符串
默认值: us-west-2
不可用。
用于部署 Ray 集群的区域。
必需: 是
重要性: 高
类型: 字符串
默认值: us-west1
不可用。
provider.availability_zone
#
一个字符串,指定节点可以启动的可用区列表(逗号分隔)。节点将在列表中的第一个可用区中启动,如果启动失败,则将在后续可用区中尝试。
必需: 否
重要性: 低
类型: 字符串
默认值: us-west-2a,us-west-2b
不可用。
一个字符串,指定节点可以启动的可用区列表(逗号分隔)。
必需: 否
重要性: 低
类型: 字符串
默认值: us-west1-a
不可用。
provider.location
#
不可用。
用于部署 Ray 集群的位置。
必需: 是
重要性: 高
类型: 字符串
默认值: westus2
不可用。
不可用。
provider.resource_group
#
不可用。
用于部署 Ray 集群的资源组。
必需: 是
重要性: 高
类型: 字符串
默认值: ray-cluster
不可用。
不可用。
provider.subscription_id
#
不可用。
用于部署 Ray 集群的订阅 ID。如果未指定,Ray 将使用 Azure CLI 中的默认值。
必需: 否
重要性: 高
类型: 字符串
默认值:
""
不可用。
不可用。
provider.msi_name
#
不可用。
用于部署 Ray 集群的托管标识名称。如果未指定,Ray 将创建一个默认的用户分配托管标识。
必需: 否
重要性: 低
类型: 字符串
默认值: ray-default-msi
不可用。
不可用。
provider.msi_resource_group
#
不可用。
Ray 集群部署使用的托管标识的资源组名称,与 msi_name 一起使用。如果未指定,Ray 将在提供者配置中指定的资源组中创建一个默认的用户分配托管标识。
必需: 否
重要性: 低
类型: 字符串
默认值: ray-cluster
不可用。
不可用。
provider.project_id
#
不可用。
不可用。
Ray 集群部署使用的全局唯一项目 ID。
必需: 是
重要性: 低
类型: 字符串
默认值:
null
不可用。
provider.cache_stopped_nodes
#
如果启用,当集群缩减时,节点将被 停止。如果禁用,节点将被 终止。已停止的节点比已终止的节点启动更快。
必需: 否
重要性: 低
类型: 布尔值
默认值:
True
provider.use_internal_ips
#
如果启用,Ray 将使用私有 IP 地址进行节点间的通信。如果您的网络接口使用公共 IP 地址,则应省略此设置。
如果启用,Ray CLI 命令(例如 ray up
)必须从与集群属于同一 VPC 的机器上运行。
此选项不影响节点的公共 IP 地址是否存在,它只影响 Ray 使用哪些 IP 地址。公共 IP 地址的存在由您的云提供商配置控制。
必需: 否
重要性: 低
类型: 布尔值
默认值:
False
provider.use_external_head_ip
#
不可用。
如果启用,无论 use_internal_ips
的值如何,Ray 都将为与头节点的通信配置并使用公共 IP 地址。此选项可与 use_internal_ips
结合使用,以避免为工作节点配置过多的公共 IP(即,节点之间使用私有 IP 进行通信,但仅为头节点通信配置公共 IP)。如果 use_internal_ips
为 False
,则此选项无效。
必需: 否
重要性: 低
类型: 布尔值
默认值:
False
不可用。
不可用。
provider.security_group
#
provider.vsphere_config
#
不可用。
不可用。
不可用。
用于连接 vCenter Server 的 vSphere 配置。如果未配置,将使用 VSPHERE_* 环境变量。
必需: 否
重要性: 低
类型: vSphere 配置
security_group.GroupName
#
安全组的名称。此名称在 VPC 内必须是唯一的。
必需: 否
重要性: 低
类型: 字符串
默认值:
"ray-autoscaler-{cluster-name}"
security_group.IpPermissions
#
与安全组关联的入站规则。
必需: 否
重要性: 中等
类型: IpPermission
vsphere_config.credentials
#
连接到 vSphere vCenter Server 的凭据。
必需: 否
重要性: 低
类型: vSphere 凭据
vsphere_config.credentials.user
#
连接到 vCenter Server 的用户名。
必需: 否
重要性: 低
类型: 字符串
vsphere_config.credentials.password
#
连接到 vCenter Server 的用户的密码。
必需: 否
重要性: 低
类型: 字符串
vsphere_config.credentials.server
#
vSphere vCenter Server 地址。
必需: 否
重要性: 低
类型: 字符串
vsphere_config.frozen_vm
#
冷冻虚拟机(frozen VM)的相关配置。
如果冷冻虚拟机已存在,则不应设置 library_item
。应通过 name
指定一个现有的冷冻虚拟机,或通过 resource_pool
指定每个 ESXi(https://docs.vmware.com/en/VMware-vSphere/index.html)主机上的冷冻虚拟机资源池名称。
如果要从 OVF 模板部署冷冻虚拟机,则必须设置 library_item
以指向内容库中的 OVF 模板(https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vm-administration/GUID-AFEDC48B-C96F-4088-9C1F-4F0A30E965DE.html)。在这种情况下,必须设置 name
以指示冷冻虚拟机的名称或名称前缀。然后,或者设置 resource_pool
以指示将在资源池的每个 ESXi 主机上创建一组冷冻虚拟机,或者设置 cluster
以指示在 vSphere 集群中创建单个冷冻虚拟机。在这种情况下,配置 datastore
(https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.storage.doc/GUID-D5AB2BAD-C69A-4B8D-B468-25D86B8D39CE.html)是强制性的。
有效示例
从 OVF 模板部署到冷冻虚拟机上运行
ray up
frozen_vm: name: single-frozen-vm library_item: frozen-vm-template cluster: vsanCluster datastore: vsanDatastore
在现有冷冻虚拟机上运行
ray up
frozen_vm: name: existing-single-frozen-vm
从 OVF 模板部署到冷冻虚拟机资源池上运行
ray up
frozen_vm: name: frozen-vm-prefix library_item: frozen-vm-template resource_pool: frozen-vm-resource-pool datastore: vsanDatastore
在现有冷冻虚拟机资源池上运行
ray up
frozen_vm: resource_pool: frozen-vm-resource-pool
不在上述示例中的其他情况无效。
必需: 是
重要性: 高
类型: vSphere 冷冻虚拟机配置
vsphere_config.frozen_vm.name
#
冷冻虚拟机的名称或名称前缀。
仅当 resource_pool
已设置并指向现有冷冻虚拟机资源池时,才能取消设置。
必需: 否
重要性: 中等
类型: 字符串
vsphere_config.frozen_vm.library_item
#
冷冻虚拟机的 OVF 模板的库项目(https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vm-administration/GUID-D3DD122F-16A5-4F36-8467-97994A854B16.html#GUID-D3DD122F-16A5-4F36-8467-97994A854B16)。如果设置,冷冻虚拟机或一组冷冻虚拟机将从 library_item
指定的 OVF 模板部署。否则,冷冻虚拟机应已存在。
访问适用于 Ray 项目的 VM Packer(vmware-ai-labs/vm-packer-for-ray)以了解如何为冷冻虚拟机创建 OVF 模板。
必需: 否
重要性: 低
类型: 字符串
vsphere_config.frozen_vm.resource_pool
#
冷冻虚拟机的资源池名称,可以指向现有冷冻虚拟机资源池。否则,必须指定 library_item
,并且将在每个 ESXi 主机上部署一组冷冻虚拟机。
冷冻虚拟机将命名为 “{frozen_vm.name}-{虚拟机的 IP 地址}”
必需: 否
重要性: 中等
类型: 字符串
vsphere_config.frozen_vm.cluster
#
vSphere 集群名称,仅当设置 library_item
且未设置 resource_pool
时生效。表示从 OVF 模板在 vSphere 集群上部署单个冷冻虚拟机。
必需: 否
重要性: 中等
类型: 字符串
vsphere_config.frozen_vm.datastore
#
用于存储将从 OVF 模板部署的冷冻虚拟机的虚拟机文件的目标 vSphere 数据存储名称。仅当设置 library_item
时生效。如果同时设置了 resource_pool
,则此数据存储必须是 ESXi 主机之间共享的数据存储。
必需: 否
重要性: 低
类型: 字符串
vsphere_config.gpu_config
#
vsphere_config.gpu_config.dynamic_pci_passthrough
#
控制将 GPU 从 ESXi 主机绑定到 Ray 节点虚拟机的方式的开关。默认值为 False,表示常规 PCI 直通。如果设置为 True,将为 GPU 启用动态 PCI 直通(https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-esxi-host-client/GUID-2B6D43A6-9598-47C4-A2E7-5924E3367BB6.html)。具有动态 PCI 直通 GPU 的虚拟机仍可支持 vSphere DRS(https://www.vmware.com/products/vsphere/drs-dpm.html)。
必需: 否
重要性: 低
类型: 布尔值
available_node_types.<node_type_name>.node_type.node_config
#
用于在云服务提供商上启动节点的配置。除其他事项外,这将指定要启动的实例类型。
必需: 是
重要性: 高
类型: 节点配置
available_node_types.<node_type_name>.node_type.resources
#
节点类型提供的资源,这使得自动扩缩器能够根据应用程序的资源需求自动选择要启动的正确节点类型。指定的资源将通过环境变量自动传递给节点的 ray start
命令。如果未提供,自动扩缩器只能在 AWS/Kubernetes 云提供商上自动检测它们。更多信息,另请参见资源需求调度器
必需: 是 (AWS/K8s 除外)
重要性: 高
类型: 资源
默认值:
{}
在某些情况下,添加没有任何资源的特殊节点可能是有益的。此类节点可用作连接到集群以启动作业的驱动程序。为了手动将节点添加到自动扩缩集群,应设置 ray-cluster-name 标签,并将 ray-node-type 标签设置为 unmanaged。非托管节点可以通过将资源设置为 {}
并将最大工作节点数设置为 0 来创建。自动扩缩器不会尝试启动、停止或更新非托管节点。用户负责正确设置和清理非托管节点。
available_node_types.<node_type_name>.node_type.min_workers
#
无论利用率如何,为这种节点类型维护的最小工作节点数。
必需: 否
重要性: 高
类型: 整数
默认值:
0
最小值:
0
最大值: 无限制
available_node_types.<node_type_name>.node_type.max_workers
#
无论利用率如何,集群中这种节点类型的最大工作节点数。这优先于最小工作节点数。默认情况下,节点类型的工作节点数不受限制,仅受集群范围的max_workers限制。(在 Ray 1.3.0 之前,此字段的默认值为 0。)
请注意,对于类型为 head_node_type
的节点,最大工作节点数的默认值为 0。
必需: 否
重要性: 高
类型: 整数
默认值: 集群范围的max_workers
最小值:
0
最大值: 集群范围的max_workers
available_node_types.<node_type_name>.node_type.worker_setup_commands
#
为此类节点设置工作节点的命令列表。这些命令将替换节点的通用工作节点设置命令。
必需: 否
重要性: 低
类型: 字符串列表
默认值:
[]
available_node_types.<node_type_name>.node_type.resources.CPU
#
此节点提供的 CPU 数量。如果未配置,自动扩缩器只能在 AWS/Kubernetes 云提供商上自动检测它们。
必需: 是 (AWS/K8s 除外)
重要性: 高
类型: 整数
此节点提供的 CPU 数量。
必需: 是
重要性: 高
类型: 整数
此节点提供的 CPU 数量。
必需: 否
重要性: 高
类型: 整数
此节点提供的 CPU 数量。如果未配置,节点将使用与冷冻虚拟机相同的设置。
必需: 否
重要性: 高
类型: 整数
available_node_types.<node_type_name>.node_type.resources.GPU
#
此节点提供的 GPU 数量。如果未配置,自动扩缩器只能在 AWS/Kubernetes 云提供商上自动检测它们。
必需: 否
重要性: 低
类型: 整数
此节点提供的 GPU 数量。
必需: 否
重要性: 高
类型: 整数
此节点提供的 GPU 数量。
必需: 否
重要性: 高
类型: 整数
此节点提供的 GPU 数量。
必需: 否
重要性: 高
类型: 整数
available_node_types.<node_type_name>.node_type.resources.memory
#
为此节点上的 Python 工作节点堆内存分配的内存大小(以字节为单位)。如果未配置,自动扩缩器将自动检测 AWS/Kubernetes 节点上的 RAM 大小,并为其分配 70% 用于堆内存。
必需: 否
重要性: 低
类型: 整数
为此节点上的 Python 工作节点堆内存分配的内存大小(以字节为单位)。
必需: 否
重要性: 高
类型: 整数
为此节点上的 Python 工作节点堆内存分配的内存大小(以字节为单位)。
必需: 否
重要性: 高
类型: 整数
为此节点上的 Python 工作节点堆内存分配的内存大小(以兆字节为单位)。如果未配置,节点将使用与冷冻虚拟机相同的内存设置。
必需: 否
重要性: 高
类型: 整数
available_node_types.<node_type_name>.node_type.resources.object-store-memory
#
为此节点上的对象存储分配的内存大小(以字节为单位)。如果未配置,自动扩缩器将自动检测 AWS/Kubernetes 节点上的 RAM 大小,并为其分配 30% 用于对象存储。
必需: 否
重要性: 低
类型: 整数
为此节点上的对象存储分配的内存大小(以字节为单位)。
必需: 否
重要性: 高
类型: 整数
为此节点上的对象存储分配的内存大小(以字节为单位)。
必需: 否
重要性: 高
类型: 整数
为此节点上的对象存储分配的内存大小(以字节为单位)。
必需: 否
重要性: 高
类型: 整数
available_node_types.<node_type_name>.docker
#
顶层Docker 配置的一组覆盖设置。
必需: 否
重要性: 低
类型: docker
默认值:
{}
示例#
最小配置#
# An unique identifier for the head node and workers of this cluster.
cluster_name: aws-example-minimal
# Cloud-provider specific configuration.
provider:
type: aws
region: us-west-2
# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 3
# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
ray.head.default:
# The node type's CPU and GPU resources are auto-detected based on AWS instance type.
# If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
# You can also set custom resources.
# For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
# resources: {"CPU": 1, "GPU": 1, "custom": 5}
resources: {}
# Provider-specific config for this node type, e.g., instance type. By default
# Ray auto-configures unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
node_config:
InstanceType: m5.large
ray.worker.default:
# The minimum number of worker nodes of this type to launch.
# This number should be >= 0.
min_workers: 3
# The maximum number of worker nodes of this type to launch.
# This parameter takes precedence over min_workers.
max_workers: 3
# The node type's CPU and GPU resources are auto-detected based on AWS instance type.
# If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
# You can also set custom resources.
# For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
# resources: {"CPU": 1, "GPU": 1, "custom": 5}
resources: {}
# Provider-specific config for this node type, e.g., instance type. By default
# Ray auto-configures unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
node_config:
InstanceType: m5.large
# An unique identifier for the head node and workers of this cluster.
cluster_name: minimal
# The maximum number of workers nodes to launch in addition to the head
# node. min_workers default to 0.
max_workers: 1
# Cloud-provider specific configuration.
provider:
type: azure
location: westus2
resource_group: ray-cluster
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ubuntu
# you must specify paths to matching private and public key pair files
# use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair
ssh_private_key: ~/.ssh/id_rsa
# changes to this should match what is specified in file_mounts
ssh_public_key: ~/.ssh/id_rsa.pub
auth:
ssh_user: ubuntu
cluster_name: minimal
provider:
availability_zone: us-west1-a
project_id: null # TODO: set your GCP project ID here
region: us-west1
type: gcp
# An unique identifier for the head node and workers of this cluster.
cluster_name: default
# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 5
# Cloud-provider specific configuration.
provider:
type: vsphere
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ray
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below.
ssh_private_key: ~/ray-bootstrap-key.pem
# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
ray.head.default:
# You can override the resources here. Adding GPU to the head node is not recommended.
# resources: { "CPU": 2, "Memory": 4096}
resources: {}
ray.worker.default:
# The minimum number of nodes of this type to launch.
# This number should be >= 0.
min_workers: 1
max_workers: 3
# You can override the resources here. For GPU, currently only Nvidia GPU is supported. If no ESXi host can
# fulfill the requirement, the Ray node creation will fail. The number of created nodes may not meet the desired
# minimum number. The vSphere node provider will not distinguish the GPU type. It will just count the quantity:
# mount the first k random available Nvidia GPU to the VM, if the user set {"GPU": k}.
# resources: {"CPU": 2, "Memory": 4096, "GPU": 1}
resources: {}
# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default
完整配置#
# An unique identifier for the head node and workers of this cluster.
cluster_name: default
# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0
# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
# image: rayproject/ray:latest-cpu # use this one if you don't need ML dependencies, it's faster to pull
container_name: "ray_container"
# If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
# if no cached version is present.
pull_before_run: True
run_options: # Extra options to pass into "docker run"
- --ulimit nofile=65536:65536
# Example of running a GPU head with CPU workers
# head_image: "rayproject/ray-ml:latest-gpu"
# Allow Ray to automatically detect GPUs
# worker_image: "rayproject/ray-ml:latest-cpu"
# worker_run_options: []
# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5
# Cloud-provider specific configuration.
provider:
type: aws
region: us-west-2
# Availability zone(s), comma-separated, that nodes may be launched in.
# Nodes will be launched in the first listed availability zone and will
# be tried in the subsequent availability zones if launching fails.
availability_zone: us-west-2a,us-west-2b
# Whether to allow node reuse. If set to False, nodes will be terminated
# instead of stopped.
cache_stopped_nodes: True # If not present, the default is True.
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below.
# ssh_private_key: /path/to/your/key.pem
# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
ray.head.default:
# The node type's CPU and GPU resources are auto-detected based on AWS instance type.
# If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
# You can also set custom resources.
# For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
# resources: {"CPU": 1, "GPU": 1, "custom": 5}
resources: {}
# Provider-specific config for this node type, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see:
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
node_config:
InstanceType: m5.large
# Default AMI for us-west-2.
# Check https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/config.py
# for default images for other zones.
ImageId: ami-0387d929287ab193e
# You can provision additional disk space with a conf as follows
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 140
VolumeType: gp3
# Additional options in the boto docs.
ray.worker.default:
# The minimum number of worker nodes of this type to launch.
# This number should be >= 0.
min_workers: 1
# The maximum number of worker nodes of this type to launch.
# This takes precedence over min_workers.
max_workers: 2
# The node type's CPU and GPU resources are auto-detected based on AWS instance type.
# If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
# You can also set custom resources.
# For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
# resources: {"CPU": 1, "GPU": 1, "custom": 5}
resources: {}
# Provider-specific config for this node type, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see:
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
node_config:
InstanceType: m5.large
# Default AMI for us-west-2.
# Check https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/config.py
# for default images for other zones.
ImageId: ami-0387d929287ab193e
# Run workers on spot by default. Comment this out to use on-demand.
# NOTE: If relying on spot instances, it is best to specify multiple different instance
# types to avoid interruption when one instance type is experiencing heightened demand.
# Demand information can be found at https://aws.amazon.com/ec2/spot/instance-advisor/
InstanceMarketOptions:
MarketType: spot
# Additional options can be found in the boto docs, e.g.
# SpotOptions:
# MaxPrice: MAX_HOURLY_PRICE
# Additional options in the boto docs.
# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default
# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",
# "/path2/on/remote/machine": "/path2/on/local/machine",
}
# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []
# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False
# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
- "**/.git"
- "**/.git/**"
# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
- ".gitignore"
# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []
# List of shell commands to run to set up nodes.
setup_commands: []
# Note: if you're developing Ray, you probably want to create a Docker image that
# has your Ray repo pre-cloned. Then, you can replace the pip installs
# below with a git checkout <your_sha> (and possibly a recompile).
# To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
# that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
# - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"
# Custom commands that will be run on the head node after common setup.
head_setup_commands: []
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
# An unique identifier for the head node and workers of this cluster.
cluster_name: default
# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0
# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty object means disabled.
docker:
image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
# image: rayproject/ray:latest-gpu # use this one if you don't need ML dependencies, it's faster to pull
container_name: "ray_container"
# If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
# if no cached version is present.
pull_before_run: True
run_options: # Extra options to pass into "docker run"
- --ulimit nofile=65536:65536
# Example of running a GPU head with CPU workers
# head_image: "rayproject/ray-ml:latest-gpu"
# Allow Ray to automatically detect GPUs
# worker_image: "rayproject/ray-ml:latest-cpu"
# worker_run_options: []
# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5
# Cloud-provider specific configuration.
provider:
type: azure
# https://azure.microsoft.com/en-us/global-infrastructure/locations
location: westus2
resource_group: ray-cluster
# set subscription id otherwise the default from az cli will be used
# subscription_id: 00000000-0000-0000-0000-000000000000
# set unique subnet mask or a random mask will be used
# subnet_mask: 10.0.0.0/16
# set unique id for resources in this cluster
# if not set a default id will be generated based on the resource group and cluster name
# unique_id: RAY1
# set managed identity name and resource group
# if not set, a default user-assigned identity will be generated in the resource group specified above
# msi_name: ray-cluster-msi
# msi_resource_group: other-rg
# Set provisioning and use of public/private IPs for head and worker nodes. If both options below are true,
# only the head node will have a public IP address provisioned.
# use_internal_ips: True
# use_external_head_ip: True
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ubuntu
# you must specify paths to matching private and public key pair files
# use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair
ssh_private_key: ~/.ssh/id_rsa
# changes to this should match what is specified in file_mounts
ssh_public_key: ~/.ssh/id_rsa.pub
# More specific customization to node configurations can be made using the ARM template azure-vm-template.json file
# See documentation here: https://docs.microsoft.com/en-us/azure/templates/microsoft.compute/2019-03-01/virtualmachines
# Changes to the local file will be used during deployment of the head node, however worker nodes deployment occurs
# on the head node, so changes to the template must be included in the wheel file used in setup_commands section below
# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
ray.head.default:
# The resources provided by this node type.
resources: {"CPU": 2}
# Provider-specific config, e.g. instance type.
node_config:
azure_arm_parameters:
vmSize: Standard_D2s_v3
# List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
imagePublisher: microsoft-dsvm
imageOffer: ubuntu-1804
imageSku: 1804-gen2
imageVersion: latest
ray.worker.default:
# The minimum number of worker nodes of this type to launch.
# This number should be >= 0.
min_workers: 0
# The maximum number of worker nodes of this type to launch.
# This takes precedence over min_workers.
max_workers: 2
# The resources provided by this node type.
resources: {"CPU": 2}
# Provider-specific config, e.g. instance type.
node_config:
azure_arm_parameters:
vmSize: Standard_D2s_v3
# List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
imagePublisher: microsoft-dsvm
imageOffer: ubuntu-1804
imageSku: 1804-gen2
imageVersion: latest
# optionally set priority to use Spot instances
priority: Spot
# set a maximum price for spot instances if desired
# billingProfile:
# maxPrice: -1
# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default
# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",
# "/path2/on/remote/machine": "/path2/on/local/machine",
"~/.ssh/id_rsa.pub": "~/.ssh/id_rsa.pub"
}
# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []
# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False
# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
- "**/.git"
- "**/.git/**"
# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
- ".gitignore"
# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands:
# enable docker setup
- sudo usermod -aG docker $USER || true
- sleep 10 # delay to avoid docker permission denied errors
# get rid of annoying Ubuntu message
- touch ~/.sudo_as_admin_successful
# List of shell commands to run to set up nodes.
# NOTE: rayproject/ray-ml:latest has ray latest bundled
setup_commands: []
# Note: if you're developing Ray, you probably want to create a Docker image that
# has your Ray repo pre-cloned. Then, you can replace the pip installs
# below with a git checkout <your_sha> (and possibly a recompile).
# To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
# that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
# - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl"
# Custom commands that will be run on the head node after common setup.
head_setup_commands:
- pip install -U azure-cli-core==2.29.1 azure-identity==1.7.0 azure-mgmt-compute==23.1.0 azure-mgmt-network==19.0.0 azure-mgmt-resource==20.0.0 msrestazure==0.6.4
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
# An unique identifier for the head node and workers of this cluster.
cluster_name: default
# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0
# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
# image: rayproject/ray:latest-gpu # use this one if you don't need ML dependencies, it's faster to pull
container_name: "ray_container"
# If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
# if no cached version is present.
pull_before_run: True
run_options: # Extra options to pass into "docker run"
- --ulimit nofile=65536:65536
# Example of running a GPU head with CPU workers
# head_image: "rayproject/ray-ml:latest-gpu"
# Allow Ray to automatically detect GPUs
# worker_image: "rayproject/ray-ml:latest-cpu"
# worker_run_options: []
# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5
# Cloud-provider specific configuration.
provider:
type: gcp
region: us-west1
availability_zone: us-west1-a
project_id: null # Globally unique project id
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below. This requires that you have added the key into the
# project wide meta-data.
# ssh_private_key: /path/to/your/key.pem
# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
ray_head_default:
# The resources provided by this node type.
resources: {"CPU": 2}
# Provider-specific config for the head node, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as subnets and ssh-keys.
# For more documentation on available fields, see:
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
node_config:
machineType: n1-standard-2
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 50
# See https://cloud.google.com/compute/docs/images for more images
sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922
# Additional options can be found in in the compute docs at
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
# If the network interface is specified as below in both head and worker
# nodes, the manual network config is used. Otherwise an existing subnet is
# used. To use a shared subnet, ask the subnet owner to grant permission
# for 'compute.subnetworks.use' to the ray autoscaler account...
# networkInterfaces:
# - kind: compute#networkInterface
# subnetwork: path/to/subnet
# aliasIpRanges: []
ray_worker_small:
# The minimum number of worker nodes of this type to launch.
# This number should be >= 0.
min_workers: 1
# The maximum number of worker nodes of this type to launch.
# This takes precedence over min_workers.
max_workers: 2
# The resources provided by this node type.
resources: {"CPU": 2}
# Provider-specific config for the head node, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as subnets and ssh-keys.
# For more documentation on available fields, see:
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
node_config:
machineType: n1-standard-2
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 50
# See https://cloud.google.com/compute/docs/images for more images
sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922
# Run workers on preemtible instance by default.
# Comment this out to use on-demand.
scheduling:
- preemptible: true
# Un-Comment this to launch workers with the Service Account of the Head Node
# serviceAccounts:
# - email: ray-autoscaler-sa-v1@<project_id>.iam.gserviceaccount.com
# scopes:
# - https://www.googleapis.com/auth/cloud-platform
# Additional options can be found in in the compute docs at
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
# Specify the node type of the head node (as configured above).
head_node_type: ray_head_default
# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",
# "/path2/on/remote/machine": "/path2/on/local/machine",
}
# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []
# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False
# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
- "**/.git"
- "**/.git/**"
# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
- ".gitignore"
# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []
# List of shell commands to run to set up nodes.
setup_commands: []
# Note: if you're developing Ray, you probably want to create a Docker image that
# has your Ray repo pre-cloned. Then, you can replace the pip installs
# below with a git checkout <your_sha> (and possibly a recompile).
# To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
# that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
# - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"
# Custom commands that will be run on the head node after common setup.
head_setup_commands:
- pip install google-api-python-client==1.7.8
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- >-
ray start
--head
--port=6379
--object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- >-
ray start
--address=$RAY_HEAD_IP:6379
--object-manager-port=8076
# An unique identifier for the head node and workers of this cluster.
cluster_name: default
# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 5
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0
# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
image: "rayproject/ray-ml:latest"
# image: rayproject/ray:latest # use this one if you don't need ML dependencies, it's faster to pull
container_name: "ray_container"
# If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
# if no cached version is present.
pull_before_run: True
run_options: # Extra options to pass into "docker run"
- --ulimit nofile=65536:65536
# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5
# Cloud-provider specific configuration.
provider:
type: vsphere
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ray
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below.
ssh_private_key: ~/ray-bootstrap-key.pem
# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
ray.head.default:
# You can override the resources here. Adding GPU to the head node is not recommended.
# resources: { "CPU": 2, "Memory": 4096}
resources: {}
node_config: {"vm_class": "best-effort-xlarge"}
worker:
# The minimum number of nodes of this type to launch.
# This number should be >= 0.
min_workers: 1
max_workers: 3
# You can override the resources here. For GPU, currently only Nvidia GPU is supported. If no ESXi host can
# fulfill the requirement, the Ray node creation will fail. The number of created nodes may not meet the desired
# minimum number. The vSphere node provider will not distinguish the GPU type. It will just count the quantity:
# mount the first k random available Nvidia GPU to the VM, if the user set {"GPU": k}.
# resources: {"CPU": 2, "Memory": 4096, "GPU": 1}
resources: {}
node_config: {"vm_class": "best-effort-xlarge"}
worker_2:
# The minimum number of nodes of this type to launch.
# This number should be >= 0.
min_workers: 1
max_workers: 2
# You can override the resources here. For GPU, currently only Nvidia GPU is supported. If no ESXi host can
# fulfill the requirement, the Ray node creation will fail. The number of created nodes may not meet the desired
# minimum number. The vSphere node provider will not distinguish the GPU type. It will just count the quantity:
# mount the first k random available Nvidia GPU to the VM, if the user set {"GPU": k}.
# resources: {"CPU": 2, "Memory": 4096, "GPU": 1}
resources: {}
node_config: {"vm_class": "best-effort-xlarge"}
# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default
# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",
# "/path2/on/remote/machine": "/path2/on/local/machine",
}
# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []
# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False
# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude: []
# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter: []
# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []
# List of shell commands to run to set up nodes.
setup_commands: []
# Custom commands that will be run on the head node after common setup.
head_setup_commands:
- pip install 'git+https://github.com/vmware/vsphere-automation-sdk-python.git'
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379
TPU 配置#
可以在 GCP 上使用TPU 虚拟机。目前,不支持 TPU pod(v2-8、v3-8 和 v4-8 之外的 TPU)。
在使用包含 TPU 的配置之前,请确保已为您的 GCP 项目启用 TPU API。
# A unique identifier for the head node and workers of this cluster.
cluster_name: tputest
# The maximum number of worker nodes to launch in addition to the head node.
max_workers: 7
available_node_types:
ray_head_default:
resources: {"TPU": 1} # use TPU custom resource in your code
node_config:
# Only v2-8, v3-8 and v4-8 accelerator types are currently supported.
# Support for TPU pods will be added in the future.
acceleratorType: v2-8
runtimeVersion: v2-alpha
schedulingConfig:
# Set to false to use non-preemptible TPUs
preemptible: false
ray_tpu:
min_workers: 1
resources: {"TPU": 1} # use TPU custom resource in your code
node_config:
acceleratorType: v2-8
runtimeVersion: v2-alpha
schedulingConfig:
preemptible: true
provider:
type: gcp
region: us-central1
availability_zone: us-central1-b
project_id: null # Replace this with your GCP project ID.
setup_commands:
- sudo apt install python-is-python3 -y
- pip3 install --upgrade pip
- pip3 install -U "ray[default]"
# Specify the node type of the head node (as configured above).
head_node_type: ray_head_default