Cluster YAML Configuration Options#

The cluster configuration is defined within a YAML file that will be used by the Cluster Launcher to launch the head node, and by the Autoscaler to launch worker nodes. Once the cluster configuration is defined, you will need to use the Ray CLI to perform any operations such as starting and stopping the cluster.

Syntax#

cluster_name: str
max_workers: int
upscaling_speed: float
idle_timeout_minutes: int
docker:
    docker
provider:
    provider
auth:
    auth
available_node_types:
    node_types
head_node_type: str
file_mounts:
    file_mounts
cluster_synced_files:
    - str
rsync_exclude:
    - str
rsync_filter:
    - str
initialization_commands:
    - str
setup_commands:
    - str
head_setup_commands:
    - str
worker_setup_commands:
    - str
head_start_ray_commands:
    - str
worker_start_ray_commands:
    - str

Custom types#

Docker#

image: str
head_image: str
worker_image: str
container_name: str
pull_before_run: bool
run_options:
    - str
head_run_options:
    - str
worker_run_options:
    - str
disable_automatic_runtime_detection: bool
disable_shm_size_detection: bool

Auth#

ssh_user: str

Provider#

Security Group#

vSphere Config#

vSphere Credentials#

user: str
password: str
server: str

vSphere Frozen VM Configs#

name: str
library_item: str
resource_pool: str
cluster: str
datastore: str

vSphere GPU Configs#

Node types#

The available_nodes_types object’s keys represent the names of the different node types.

Deleting a node type from available_node_types and updating with ray up will cause the autoscaler to scale down all nodes of that type. In particular, changing the key of a node type object will result in removal of nodes corresponding to the old key; nodes with the new key name will then be created according to cluster configuration and Ray resource demands.

<node_type_1_name>:
    node_config:
        Node config
    resources:
        Resources
    min_workers: int
    max_workers: int
    worker_setup_commands:
        - str
    docker:
        Node Docker
<node_type_2_name>:
    ...
...

Node config#

Cloud-specific configuration for nodes of a given node type.

Modifying the node_config and updating with ray up will cause the autoscaler to scale down all existing nodes of the node type; nodes with the newly applied node_config will then be created according to cluster configuration and Ray resource demands.

A YAML object which conforms to the EC2 create_instances API in the AWS docs.

A YAML object as defined in the deployment template whose resources are defined in the Azure docs.

A YAML object as defined in the GCP docs.

# The resource pool where the head node should live, if unset, will be
# the frozen VM's resource pool.
resource_pool: str
# The datastore to store the vmdk of the head node vm, if unset, will be
# the frozen VM's datastore.
datastore: str

Node Docker#

worker_image: str
pull_before_run: bool
worker_run_options:
    - str
disable_automatic_runtime_detection: bool
disable_shm_size_detection: bool

资源#

CPU: int
GPU: int
object_store_memory: int
memory: int
<custom_resource1>: int
<custom_resource2>: int
...

File mounts#

<path1_on_remote_machine>: str # Path 1 on local machine
<path2_on_remote_machine>: str # Path 2 on local machine
...

Properties and Definitions#

cluster_name#

The name of the cluster. This is the namespace of the cluster.

  • Required: Yes

  • Importance: High

  • Type: String

  • Default: “default”

  • Pattern: [a-zA-Z0-9_]+

max_workers#

The maximum number of workers the cluster will have at any given time.

  • Required: No

  • Importance: High

  • Type: Integer

  • Default: 2

  • Minimum: 0

  • Maximum: Unbounded

upscaling_speed#

The number of nodes allowed to be pending as a multiple of the current number of nodes. For example, if set to 1.0, the cluster can grow in size by at most 100% at any time, so if the cluster currently has 20 nodes, at most 20 pending launches are allowed. Note that although the autoscaler will scale down to min_workers (which could be 0), it will always scale up to 5 nodes at a minimum when scaling up.

  • Required: No

  • Importance: Medium

  • Type: Float

  • Default: 1.0

  • Minimum: 0.0

  • Maximum: Unbounded

idle_timeout_minutes#

The number of minutes that need to pass before an idle worker node is removed by the Autoscaler.

  • Required: No

  • Importance: Medium

  • Type: Integer

  • Default: 5

  • Minimum: 0

  • Maximum: Unbounded

docker#

Configure Ray to run in Docker containers.

  • Required: No

  • Importance: High

  • Type: Docker

  • Default: {}

In rare cases when Docker is not available on the system by default (e.g., bad AMI), add the following commands to initialization_commands to install it.

initialization_commands:
    - curl -fsSL https://get.docker.com -o get-docker.sh
    - sudo sh get-docker.sh
    - sudo usermod -aG docker $USER
    - sudo systemctl restart docker -f

provider#

The cloud provider-specific configuration properties.

  • Required: Yes

  • Importance: High

  • Type: Provider

auth#

Authentication credentials that Ray will use to launch nodes.

  • Required: Yes

  • Importance: High

  • Type: Auth

available_node_types#

Tells the autoscaler the allowed node types and the resources they provide. Each node type is identified by a user-specified key.

  • Required: No

  • Importance: High

  • Type: Node types

  • 默认

available_node_types:
  ray.head.default:
      node_config:
        InstanceType: m5.large
        BlockDeviceMappings:
            - DeviceName: /dev/sda1
              Ebs:
                  VolumeSize: 140
      resources: {"CPU": 2}
  ray.worker.default:
      node_config:
        InstanceType: m5.large
        InstanceMarketOptions:
            MarketType: spot
      resources: {"CPU": 2}
      min_workers: 0

head_node_type#

The key for one of the node types in available_node_types. This node type will be used to launch the head node.

If the field head_node_type is changed and an update is executed with ray up, the currently running head node will be considered outdated. The user will receive a prompt asking to confirm scale-down of the outdated head node, and the cluster will restart with a new head node. Changing the node_config of the node_type with key head_node_type will also result in cluster restart after a user prompt.

  • Required: Yes

  • Importance: High

  • Type: String

  • Pattern: [a-zA-Z0-9_]+

file_mounts#

The files or directories to copy to the head and worker nodes.

  • Required: No

  • Importance: High

  • Type: File mounts

  • Default: []

cluster_synced_files#

A list of paths to the files or directories to copy from the head node to the worker nodes. The same path on the head node will be copied to the worker node. This behavior is a subset of the file_mounts behavior, so in the vast majority of cases one should just use file_mounts.

  • Required: No

  • Importance: Low

  • Type: List of String

  • Default: []

rsync_exclude#

A list of patterns for files to exclude when running rsync up or rsync down. The filter is applied on the source directory only.

Example for a pattern in the list: **/.git/**.

  • Required: No

  • Importance: Low

  • Type: List of String

  • Default: []

rsync_filter#

A list of patterns for files to exclude when running rsync up or rsync down. The filter is applied on the source directory and recursively through all subdirectories.

Example for a pattern in the list: .gitignore.

  • Required: No

  • Importance: Low

  • Type: List of String

  • Default: []

initialization_commands#

A list of commands that will be run before the setup commands. If Docker is enabled, these commands will run outside the container and before Docker is setup.

  • Required: No

  • Importance: Medium

  • Type: List of String

  • Default: []

setup_commands#

A list of commands to run to set up nodes. These commands will always run on the head and worker nodes and will be merged with head setup commands for head and with worker setup commands for workers.

  • Required: No

  • Importance: Medium

  • Type: List of String

  • 默认

# Default setup_commands:
setup_commands:
  - echo 'export PATH="$HOME/anaconda3/envs/tensorflow_p36/bin:$PATH"' >> ~/.bashrc
  - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
  • Setup commands should ideally be idempotent (i.e., can be run multiple times without changing the result); this allows Ray to safely update nodes after they have been created. You can usually make commands idempotent with small modifications, e.g. git clone foo can be rewritten as test -e foo || git clone foo which checks if the repo is already cloned first.

  • Setup commands are run sequentially but separately. For example, if you are using anaconda, you need to run conda activate env && pip install -U ray because splitting the command into two setup commands will not work.

  • Ideally, you should avoid using setup_commands by creating a docker image with all the dependencies preinstalled to minimize startup time.

  • Tip: if you also want to run apt-get commands during setup add the following list of commands

    setup_commands:
      - sudo pkill -9 apt-get || true
      - sudo pkill -9 dpkg || true
      - sudo dpkg --configure -a
    

head_setup_commands#

A list of commands to run to set up the head node. These commands will be merged with the general setup commands.

  • Required: No

  • Importance: Low

  • Type: List of String

  • Default: []

worker_setup_commands#

A list of commands to run to set up the worker nodes. These commands will be merged with the general setup commands.

  • Required: No

  • Importance: Low

  • Type: List of String

  • Default: []

head_start_ray_commands#

Commands to start ray on the head node. You don’t need to change this.

  • Required: No

  • Importance: Low

  • Type: List of String

  • 默认

head_start_ray_commands:
  - ray stop
  - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands#

Command to start ray on worker nodes. You don’t need to change this.

  • Required: No

  • Importance: Low

  • Type: List of String

  • 默认

worker_start_ray_commands:
  - ray stop
  - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

docker.image#

The default Docker image to pull in the head and worker nodes. This can be overridden by the head_image and worker_image fields. If neither image nor (head_image and worker_image) are specified, Ray will not use Docker.

  • Required: Yes (If Docker is in use.)

  • Importance: High

  • Type: String

The Ray project provides Docker images on DockerHub. The repository includes following images

  • rayproject/ray-ml:latest-gpu: CUDA support, includes ML dependencies.

  • rayproject/ray:latest-gpu: CUDA support, no ML dependencies.

  • rayproject/ray-ml:latest: No CUDA support, includes ML dependencies.

  • rayproject/ray:latest: No CUDA support, no ML dependencies.

docker.head_image#

Docker image for the head node to override the default docker image.

  • Required: No

  • Importance: Low

  • Type: String

docker.worker_image#

Docker image for the worker nodes to override the default docker image.

  • Required: No

  • Importance: Low

  • Type: String

docker.container_name#

The name to use when starting the Docker container.

  • Required: Yes (If Docker is in use.)

  • Importance: Low

  • Type: String

  • Default: ray_container

docker.pull_before_run#

If enabled, the latest version of image will be pulled when starting Docker. If disabled, docker run will only pull the image if no cached version is present.

  • Required: No

  • Importance: Medium

  • Type: Boolean

  • Default: True

docker.run_options#

The extra options to pass to docker run.

  • Required: No

  • Importance: Medium

  • Type: List of String

  • Default: []

docker.head_run_options#

The extra options to pass to docker run for head node only.

  • Required: No

  • Importance: Low

  • Type: List of String

  • Default: []

docker.worker_run_options#

The extra options to pass to docker run for worker nodes only.

  • Required: No

  • Importance: Low

  • Type: List of String

  • Default: []

docker.disable_automatic_runtime_detection#

If enabled, Ray will not try to use the NVIDIA Container Runtime if GPUs are present.

  • Required: No

  • Importance: Low

  • Type: Boolean

  • Default: False

docker.disable_shm_size_detection#

If enabled, Ray will not automatically specify the size /dev/shm for the started container and the runtime’s default value (64MiB for Docker) will be used. If --shm-size=<> is manually added to run_options, this is automatically set to True, meaning that Ray will defer to the user-provided value.

  • Required: No

  • Importance: Low

  • Type: Boolean

  • Default: False

auth.ssh_user#

The user that Ray will authenticate with when launching new nodes.

  • Required: Yes

  • Importance: High

  • Type: String

auth.ssh_private_key#

The path to an existing private key for Ray to use. If not configured, Ray will create a new private keypair (default behavior). If configured, the key must be added to the project-wide metadata and KeyName has to be defined in the node configuration.

  • Required: No

  • Importance: Low

  • Type: String

The path to an existing private key for Ray to use.

  • Required: Yes

  • Importance: High

  • Type: String

You may use ssh-keygen -t rsa -b 4096 to generate a new ssh keypair.

The path to an existing private key for Ray to use. If not configured, Ray will create a new private keypair (default behavior). If configured, the key must be added to the project-wide metadata and KeyName has to be defined in the node configuration.

  • Required: No

  • Importance: Low

  • Type: String

Not available. The vSphere provider expects the key to be located at a fixed path ~/ray-bootstrap-key.pem.

auth.ssh_public_key#

Not available.

The path to an existing public key for Ray to use.

  • Required: Yes

  • Importance: High

  • Type: String

Not available.

Not available.

provider.type#

The cloud service provider. For AWS, this must be set to aws.

  • Required: Yes

  • Importance: High

  • Type: String

The cloud service provider. For Azure, this must be set to azure.

  • Required: Yes

  • Importance: High

  • Type: String

The cloud service provider. For GCP, this must be set to gcp.

  • Required: Yes

  • Importance: High

  • Type: String

The cloud service provider. For vSphere and VCF, this must be set to vsphere.

  • Required: Yes

  • Importance: High

  • Type: String

provider.region#

The region to use for deployment of the Ray cluster.

  • Required: Yes

  • Importance: High

  • Type: String

  • Default: us-west-2

Not available.

The region to use for deployment of the Ray cluster.

  • Required: Yes

  • Importance: High

  • Type: String

  • Default: us-west1

Not available.

provider.availability_zone#

A string specifying a comma-separated list of availability zone(s) that nodes may be launched in. Nodes will be launched in the first listed availability zone and will be tried in the following availability zones if launching fails.

  • Required: No

  • Importance: Low

  • Type: String

  • Default: us-west-2a,us-west-2b

A string specifying a comma-separated list of availability zone(s) that nodes may be launched in. This can be specified at the provider level to set defaults for all node types, or at the node level to override the provider setting for specific node types.

For Azure, availability zone availability depends on each specific VM size / location combination. Node-level configuration in available_node_types.<node_type_name>.node_config.azure_arm_parameters.availability_zone takes precedence over provider-level configuration.

  • Required: No

  • Importance: Low

  • Type: String

  • Default: “auto” (let Azure automatically pick zones)

  • Example values

    • "1,2,3" - Use zones 1, 2, and 3

    • "1" - Use only zone 1

    • "none" - Explicitly disable zones

    • "auto" or omit - Let Azure automatically pick zones

See the following example Azure cluster config for more details

# Unique identifier for the head node and workers of this cluster.
cluster_name: nightly-cpu-minimal-2
max_workers: 6
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
  type: azure
    # https://azure.microsoft.com/en-us/global-infrastructure/locations
  location: westus2
  resource_group: ray-zones
  cache_stopped_nodes: False
  # Provider-level availability zone configuration (comma-separated)
  # This will be used as default for all node types unless overridden
  availability_zone: "1,2,3"

auth:
  ssh_user: ubuntu

available_node_types:
  ray.head.default:
    resources: {"CPU": 2}
    node_config:
      azure_arm_parameters:
        vmSize: Standard_D2s_v3
        imagePublisher: microsoft-dsvm
        imageOffer: ubuntu-2204
        imageSku: 2204-gen2
        imageVersion: latest
        # Head node: explicitly disable availability zones
        availability_zone: "none"
  ray.worker.default:
    min_workers: 0
    max_workers: 2
    resources: {"CPU": 2}
    node_config:
      azure_arm_parameters:
        vmSize: Standard_D2s_v3
        imagePublisher: microsoft-dsvm
        imageOffer: ubuntu-2204
        imageSku: 2204-gen2
        imageVersion: latest
        # Workers will use provider specified availability zones
  ray.worker.specific_zone:
    min_workers: 0
    max_workers: 2
    resources: {"CPU": 2}
    node_config:
      azure_arm_parameters:
        vmSize: Standard_D2s_v3
        imagePublisher: microsoft-dsvm
        imageOffer: ubuntu-2204
        imageSku: 2204-gen2
        imageVersion: latest
        # Workers will use availability zone 2 only (overrides provider setting)
        availability_zone: "2"

# Note: The Ubuntu 20.04 dsvm image has a few venvs already configured but
# they all contain python modules that are not compatible with Ray at the moment.
setup_commands:
    - (which conda && echo 'eval "$(conda shell.bash hook)"' >> ~/.bashrc) || true
    - conda tos accept
    - conda create -n ray-env python=3.10 -y
    - conda activate ray-env && echo 'conda activate ray-env' >> ~/.bashrc
    - which ray || pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl"

file_mounts_sync_continuously: False

file_mounts: {
}

A string specifying a comma-separated list of availability zone(s) that nodes may be launched in.

  • Required: No

  • Importance: Low

  • Type: String

  • Default: us-west1-a

Not available.

provider.location#

Not available.

The location to use for deployment of the Ray cluster.

  • Required: Yes

  • Importance: High

  • Type: String

  • Default: westus2

Not available.

Not available.

provider.resource_group#

Not available.

The resource group to use for deployment of the Ray cluster.

  • Required: Yes

  • Importance: High

  • Type: String

  • Default: ray-cluster

Not available.

Not available.

provider.subscription_id#

Not available.

The subscription ID to use for deployment of the Ray cluster. If not specified, Ray will use the default from the Azure CLI.

  • Required: No

  • Importance: High

  • Type: String

  • Default: ""

Not available.

Not available.

provider.msi_name#

Not available.

The name of the managed identity to use for deployment of the Ray cluster. If not specified, Ray will create a default user-assigned managed identity.

  • Required: No

  • Importance: Low

  • Type: String

  • Default: ray-default-msi

Not available.

Not available.

provider.msi_resource_group#

Not available.

The name of the managed identity’s resource group to use for deployment of the Ray cluster, used in conjunction with msi_name. If not specified, Ray will create a default user-assigned managed identity in resource group specified in the provider config.

  • Required: No

  • Importance: Low

  • Type: String

  • Default: ray-cluster

Not available.

Not available.

provider.project_id#

Not available.

Not available.

The globally unique project ID to use for deployment of the Ray cluster.

  • Required: Yes

  • Importance: Low

  • Type: String

  • Default: null

Not available.

provider.cache_stopped_nodes#

If enabled, nodes will be stopped when the cluster scales down. If disabled, nodes will be terminated instead. Stopped nodes launch faster than terminated nodes.

  • Required: No

  • Importance: Low

  • Type: Boolean

  • Default: True

provider.use_internal_ips#

If enabled, Ray will use private IP addresses for communication between nodes. This should be omitted if your network interfaces use public IP addresses.

If enabled, Ray CLI commands (e.g. ray up) will have to be run from a machine that is part of the same VPC as the cluster.

This option does not affect the existence of public IP addresses for the nodes, it only affects which IP addresses are used by Ray. The existence of public IP addresses is controlled by your cloud provider’s configuration.

  • Required: No

  • Importance: Low

  • Type: Boolean

  • Default: False

provider.use_external_head_ip#

Not available.

If enabled, Ray will provision and use a public IP address for communication with the head node, regardless of the value of use_internal_ips. This option can be used in combination with use_internal_ips to avoid provisioning excess public IPs for worker nodes (i.e., communicate among nodes using private IPs, but provision a public IP for head node communication only). If use_internal_ips is False, then this option has no effect.

  • Required: No

  • Importance: Low

  • Type: Boolean

  • Default: False

Not available.

Not available.

provider.security_group#

A security group that can be used to specify custom inbound rules.

Not available.

Not available.

Not available.

provider.vsphere_config#

Not available.

Not available.

Not available.

vSphere configurations used to connect vCenter Server. If not configured, the VSPHERE_* environment variables will be used.

security_group.GroupName#

The name of the security group. This name must be unique within the VPC.

  • Required: No

  • Importance: Low

  • Type: String

  • Default: "ray-autoscaler-{cluster-name}"

security_group.IpPermissions#

The inbound rules associated with the security group.

vsphere_config.credentials#

The credential to connect to the vSphere vCenter Server.

vsphere_config.credentials.user#

连接到 vCenter Server 的用户名。

  • Required: No

  • Importance: Low

  • Type: String

vsphere_config.credentials.password#

连接到 vCenter Server 的用户的密码。

  • Required: No

  • Importance: Low

  • Type: String

vsphere_config.credentials.server#

vSphere vCenter Server 地址。

  • Required: No

  • Importance: Low

  • Type: String

vsphere_config.frozen_vm#

已冻结 VM 的相关配置。

如果已冻结 VM 存在,则应取消设置 library_item。应通过 name 指定一个已存在的已冻结 VM,或通过 resource_pool 指定每个 ESXi 主机上的已冻结 VM 的资源池名称(https://docs.vmware.com/en/VMware-vSphere/index.html)。

如果已冻结 VM 将从 OVF 模板部署,则必须设置 library_item 以指向内容库中的 OVF 模板(https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vm-administration/GUID-AFEDC48B-C96F-4088-9C1F-4F0A30E965DE.html)。在这种情况下,必须设置 name 来指示已冻结 VM 的名称或名称前缀。然后,应设置 resource_pool 以指示将在每个 ESXi 主机上创建一组已冻结 VM,或者应设置 cluster 以指示在 vSphere 集群中创建单个已冻结 VM。在这种情况下,datastore 配置(https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.storage.doc/GUID-D5AB2BAD-C69A-4B8D-B468-25D86B8D39CE.html)是必需的。

有效示例

  1. ray up 从 OVF 模板部署已冻结 VM

    frozen_vm:
        name: single-frozen-vm
        library_item: frozen-vm-template
        cluster: vsanCluster
        datastore: vsanDatastore
    
  2. ray up 已存在的已冻结 VM

    frozen_vm:
        name: existing-single-frozen-vm
    
  3. ray up 从 OVF 模板部署的已冻结 VM 资源池

    frozen_vm:
        name: frozen-vm-prefix
        library_item: frozen-vm-template
        resource_pool: frozen-vm-resource-pool
        datastore: vsanDatastore
    
  4. ray up 已存在的已冻结 VM 资源池

    frozen_vm:
        resource_pool: frozen-vm-resource-pool
    

以上示例中未列出的其他情况均为无效。

vsphere_config.frozen_vm.name#

已冻结 VM 的名称或名称前缀。

仅当设置了 resource_pool 并指向已存在的已冻结 VM 资源池时,才能将其取消设置。

  • Required: No

  • Importance: Medium

  • Type: String

vsphere_config.frozen_vm.library_item#

已冻结 VM 的 OVF 模板的库项(https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-vm-administration/GUID-D3DD122F-16A5-4F36-8467-97994A854B16.html#GUID-D3DD122F-16A5-4F36-8467-97994A854B16)。如果设置了此项,将从 library_item 指定的 OVF 模板部署已冻结 VM 或一组已冻结 VM。否则,已冻结 VM 应已存在。

访问 Ray 项目的 VM Packer(vmware-ai-labs/vm-packer-for-ray)了解如何为已冻结 VM 创建 OVF 模板。

  • Required: No

  • Importance: Low

  • Type: String

vsphere_config.frozen_vm.resource_pool#

已冻结 VM 的资源池名称,可以指向已存在的已冻结 VM 资源池。否则,必须指定 library_item,并且将在每个 ESXi 主机上部署一组已冻结 VM。

已冻结 VM 将命名为“{frozen_vm.name}-{vm 的 IP 地址}”

  • Required: No

  • Importance: Medium

  • Type: String

vsphere_config.frozen_vm.cluster#

vSphere 集群名称,仅在设置了 library_item 且未设置 resource_pool 时生效。表示从 OVF 模板在 vSphere 集群上部署单个已冻结 VM。

  • Required: No

  • Importance: Medium

  • Type: String

vsphere_config.frozen_vm.datastore#

从 OVF 模板部署的已冻结 VM 的虚拟机文件存储目标 vSphere 数据存储名称。仅在设置了 library_item 时生效。如果还设置了 resource_pool,则此数据存储必须是 ESXi 主机之间的共享数据存储。

  • Required: No

  • Importance: Low

  • Type: String

vsphere_config.gpu_config#

vsphere_config.gpu_config.dynamic_pci_passthrough#

控制将 GPU 从 ESXi 主机绑定到 Ray 节点 VM 的方式的开关。默认值为 False,表示常规 PCI Passthrough。如果设置为 True,将为 GPU 启用动态 PCI Passthrough(https://docs.vmware.com/en/VMware-vSphere/8.0/vsphere-esxi-host-client/GUID-2B6D43A6-9598-47C4-A2E7-5924E3367BB6.html)。具有动态 PCI Passthrough GPU 的 VM 仍然支持 vSphere DRS(https://www.vmware.com/products/vsphere/drs-dpm.html)。

  • Required: No

  • Importance: Low

  • Type: Boolean

available_node_types.<node_type_name>.node_type.node_config#

用于在云服务提供商上启动节点的配置。除其他外,这将指定要启动的实例类型。

available_node_types.<node_type_name>.node_type.resources#

节点类型提供的资源,使自动扩展程序能够根据应用程序的资源需求自动选择要启动的节点类型。指定的资源将通过环境变量自动传递给节点的 ray start 命令。如果未提供,自动扩展程序可以仅为 AWS/Kubernetes 云提供商自动检测它们。有关更多信息,请参阅 资源需求调度器

  • 必需: 是(AWS/K8s 除外)

  • Importance: High

  • 类型: 资源

  • Default: {}

在某些情况下,可能需要添加没有资源的特殊节点。这些节点可以用作连接到集群以启动作业的驱动程序。要手动将节点添加到自动扩展集群,应设置 *ray-cluster-name* 标签,并将 *ray-node-type* 标签设置为 unmanaged。可以通过将资源设置为 {} 并将 最大工作节点数 设置为 0 来创建非托管节点。自动扩展程序不会尝试启动、停止或更新非托管节点。用户负责正确设置和清理非托管节点。

available_node_types.<node_type_name>.node_type.min_workers#

无论利用率如何,此节点类型应维护的最小工作节点数。

  • Required: No

  • Importance: High

  • Type: Integer

  • 默认: 0

  • Minimum: 0

  • Maximum: Unbounded

available_node_types.<node_type_name>.node_type.max_workers#

无论利用率如何,此节点类型在集群中的最大工作节点数。这优先于 最小工作节点数。默认情况下,节点类型的数量不受限制,仅受集群范围的 最大工作节点数 约束。(在 Ray 1.3.0 之前,此字段的默认值为 0。)

注意,对于 head_node_type 类型的节点,默认的最大工作节点数为 0。

available_node_types.<node_type_name>.node_type.worker_setup_commands#

用于设置此类型工作节点的一系列命令。这些命令将替换节点的一般 工作节点设置命令

  • Required: No

  • 重要性:

  • Type: List of String

  • Default: []

available_node_types.<node_type_name>.node_type.resources.CPU#

此节点可用的 CPU 数量。如果未配置,自动扩展程序可以仅为 AWS/Kubernetes 云提供商自动检测它们。

  • 必需: 是(AWS/K8s 除外)

  • Importance: High

  • Type: Integer

此节点可用的 CPU 数量。

  • Required: Yes

  • Importance: High

  • Type: Integer

此节点可用的 CPU 数量。

  • Required: No

  • Importance: High

  • Type: Integer

此节点可用的 CPU 数量。如果未配置,节点将使用与已冻结 VM 相同的设置。

  • Required: No

  • Importance: High

  • Type: Integer

available_node_types.<node_type_name>.node_type.resources.GPU#

此节点可用的 GPU 数量。如果未配置,自动扩展程序可以仅为 AWS/Kubernetes 云提供商自动检测它们。

  • Required: No

  • Importance: Low

  • Type: Integer

此节点可用的 GPU 数量。

  • Required: No

  • Importance: High

  • Type: Integer

此节点可用的 GPU 数量。

  • Required: No

  • Importance: High

  • Type: Integer

此节点可用的 GPU 数量。

  • Required: No

  • Importance: High

  • Type: Integer

available_node_types.<node_type_name>.node_type.resources.memory#

为节点上的 Python 工作节点堆内存分配的字节数。如果未配置,自动扩展程序将自动检测 AWS/Kubernetes 节点上的 RAM 量,并为其分配 70% 作为堆。

  • Required: No

  • Importance: Low

  • Type: Integer

为节点上的 Python 工作节点堆内存分配的字节数。

  • Required: No

  • Importance: High

  • Type: Integer

为节点上的 Python 工作节点堆内存分配的字节数。

  • Required: No

  • Importance: High

  • Type: Integer

为节点上的 Python 工作节点堆内存分配的兆字节数。如果未配置,节点将使用与已冻结 VM 相同的内存设置。

  • Required: No

  • Importance: High

  • Type: Integer

available_node_types.<node_type_name>.node_type.resources.object-store-memory#

为节点上的对象存储分配的字节数。如果未配置,自动扩展程序将自动检测 AWS/Kubernetes 节点上的 RAM 量,并为其分配 30% 作为对象存储。

  • Required: No

  • Importance: Low

  • Type: Integer

为节点上的对象存储分配的字节数。

  • Required: No

  • Importance: High

  • Type: Integer

为节点上的对象存储分配的字节数。

  • Required: No

  • Importance: High

  • Type: Integer

为节点上的对象存储分配的字节数。

  • Required: No

  • Importance: High

  • Type: Integer

available_node_types.<node_type_name>.docker#

一组对顶级 Docker 配置的覆盖。

  • Required: No

  • Importance: Low

  • 类型: Docker

  • Default: {}

示例#

最小配置#

# An unique identifier for the head node and workers of this cluster.
cluster_name: aws-example-minimal

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-west-2

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 3

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
        # You can also set custom resources.
        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
        resources: {}
        # Provider-specific config for this node type, e.g., instance type. By default
        # Ray auto-configures unspecified fields such as SubnetId and KeyName.
        # For more documentation on available fields, see
        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
        node_config:
            InstanceType: m5.large
    ray.worker.default:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 3
        # The maximum number of worker nodes of this type to launch.
        # This parameter takes precedence over min_workers.
        max_workers: 3
        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
        # You can also set custom resources.
        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
        resources: {}
        # Provider-specific config for this node type, e.g., instance type. By default
        # Ray auto-configures unspecified fields such as SubnetId and KeyName.
        # For more documentation on available fields, see
        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
        node_config:
            InstanceType: m5.large
# An unique identifier for the head node and workers of this cluster.
cluster_name: minimal

# The maximum number of workers nodes to launch in addition to the head
# node. min_workers default to 0.
max_workers: 2

# Cloud-provider specific configuration.
provider:
    type: azure
    location: westus2
    resource_group: ray-cluster

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
    # SSH keys will be auto-generated with Ray-specific names if not specified
    # Uncomment and specify custom paths if you want to use different existing keys:
    # ssh_private_key: /path/to/your/key.pem
    # ssh_public_key: /path/to/your/key.pub

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}
auth:
  ssh_user: ubuntu
cluster_name: minimal
provider:
  availability_zone: us-west1-a
  project_id: null # TODO: set your GCP project ID here
  region: us-west1
  type: gcp
# An unique identifier for the head node and workers of this cluster.
cluster_name: default

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 5

# Cloud-provider specific configuration.
provider:
    type: vsphere

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ray
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below.
    ssh_private_key: ~/ray-bootstrap-key.pem

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # You can override the resources here. Adding GPU to the head node is not recommended.
        # resources: { "CPU": 2, "Memory": 4096}
        resources: {}
    ray.worker.default:
        # The minimum number of nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 1
        max_workers: 3
        # You can override the resources here. For GPU, currently only NVIDIA GPU is supported. If no ESXi host can
        # fulfill the requirement, the Ray node creation will fail. The number of created nodes may not meet the desired
        # minimum number. The vSphere node provider will not distinguish the GPU type. It will just count the quantity:
        # mount the first k random available NVIDIA GPU to the VM, if the user set {"GPU": k}.
        # resources: {"CPU": 2, "Memory": 4096, "GPU": 1}
        resources: {}

# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

完整配置#

# An unique identifier for the head node and workers of this cluster.
cluster_name: default

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
    image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    # image: rayproject/ray:latest-cpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536

    # Example of running a GPU head with CPU workers
    # head_image: "rayproject/ray-ml:latest-gpu"
    # Allow Ray to automatically detect GPUs

    # worker_image: "rayproject/ray-ml:latest-cpu"
    # worker_run_options: []

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-west-2
    # Availability zone(s), comma-separated, that nodes may be launched in.
    # Nodes will be launched in the first listed availability zone and will
    # be tried in the subsequent availability zones if launching fails.
    availability_zone: us-west-2a,us-west-2b
    # Whether to allow node reuse. If set to False, nodes will be terminated
    # instead of stopped.
    cache_stopped_nodes: True # If not present, the default is True.

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below.
#    ssh_private_key: /path/to/your/key.pem

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
        # You can also set custom resources.
        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
        resources: {}
        # Provider-specific config for this node type, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as SubnetId and KeyName.
        # For more documentation on available fields, see:
        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
        node_config:
            InstanceType: m5.large
            # Default AMI for us-west-2.
            # Check https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/config.py
            # for default images for other zones.
            ImageId: ami-0387d929287ab193e
            # You can provision additional disk space with a conf as follows
            BlockDeviceMappings:
                - DeviceName: /dev/sda1
                  Ebs:
                      VolumeSize: 140
                      VolumeType: gp3
            # Additional options in the boto docs.
    ray.worker.default:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 1
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
        # You can also set custom resources.
        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
        resources: {}
        # Provider-specific config for this node type, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as SubnetId and KeyName.
        # For more documentation on available fields, see:
        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
        node_config:
            InstanceType: m5.large
            # Default AMI for us-west-2.
            # Check https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/config.py
            # for default images for other zones.
            ImageId: ami-0387d929287ab193e
            # Run workers on spot by default. Comment this out to use on-demand.
            # NOTE: If relying on spot instances, it is best to specify multiple different instance
            # types to avoid interruption when one instance type is experiencing heightened demand.
            # Demand information can be found at https://aws.amazon.com/ec2/spot/instance-advisor/
            InstanceMarketOptions:
                MarketType: spot
                # Additional options can be found in the boto docs, e.g.
                #   SpotOptions:
                #       MaxPrice: MAX_HOURLY_PRICE
            # Additional options in the boto docs.

# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up nodes.
setup_commands: []
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
# A unique identifier for the head node and workers of this cluster.
cluster_name: default

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty object means disabled.
docker:
    image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options: # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536

    # Example of running a GPU head with CPU workers
    # head_image: "rayproject/ray-ml:latest-gpu"
    # Allow Ray to automatically detect GPUs

    # worker_image: "rayproject/ray-ml:latest-cpu"
    # worker_run_options: []

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: azure
    # https://azure.microsoft.com/en-us/global-infrastructure/locations
    location: westus2
    resource_group: ray-cluster
    # Set subscription id otherwise the default from az cli will be used.
    # subscription_id: 00000000-0000-0000-0000-000000000000
    # Set unique subnet mask or a random mask will be used.
    # subnet_mask: 10.0.0.0/16
    # Set unique id for resources in this cluster.
    # If not set a default id will be generated based on the resource group and cluster name.
    # unique_id: RAY1
    # Set managed identity name and resource group;
    # If not set, a default user-assigned identity will be generated in the resource group specified above.
    # msi_name: ray-cluster-msi
    # msi_resource_group: other-rg
    # Set provisioning and use of public/private IPs for head and worker nodes;
    # If both options below are true, only the head node will have a public IP address provisioned.
    # use_internal_ips: True
    # use_external_head_ip: True

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
    # SSH keys will be auto-generated with Ray-specific names if not specified
    # Uncomment and specify custom paths if you want to use different existing keys:
    # ssh_private_key: /path/to/your/key.pem
    # ssh_public_key: /path/to/your/key.pub

# You can make more specific customization to node configurations can be made using the ARM template azure-vm-template.json file.
# See this documentation here: https://docs.microsoft.com/en-us/azure/templates/microsoft.compute/2019-03-01/virtualmachines
# Changes to the local file will be used during deployment of the head node, however worker nodes deployment occurs
# on the head node, so changes to the template must be included in the wheel file used in setup_commands section below

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # The resources provided by this node type.
        resources: {"CPU": 4}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D4s_v3
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-2204
                imageSku: 2204-gen2
                imageVersion: latest

                # Or, use a custom image from Azure Compute Gallery.
                # Note: if you use a custom image, then imagePublisher,
                # imageOffer, imageSku, and imageVersion are ignored.
                # imageId: /subscriptions/[subscription-id]/resourceGroups/[resource-group-id]/providers/Microsoft.Compute/galleries/[azure-compute-gallery-id]/images/[image-id]/versions/[image-version]

                # Optionally set osDiskSize if you want to use a custom disk size.
                # osDiskSize: 128

    ray.worker.default:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The resources provided by this node type.
        resources: {"CPU": 4}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D4s_v3
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-2204
                imageSku: 2204-gen2
                imageVersion: latest
                # optionally set priority to use Spot instances
                priority: Spot
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1

# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
    #    "/path1/on/remote/machine": "/path1/on/local/machine",
    #    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. Ray copies the same path on the head node to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously.
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down.
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands:
    # enable docker setup
    - sudo usermod -aG docker $USER || true
    - sleep 10 # delay to avoid docker permission denied errors
    # get rid of annoying Ubuntu message
    - touch ~/.sudo_as_admin_successful

# List of shell commands to run to set up nodes.
# NOTE: rayproject/ray-ml:latest has ray latest bundled
setup_commands: []
# Note: if you're developing Ray, you probably want to create a Docker image that
# has your Ray repo pre-cloned. Then, you can replace the pip installs
# below with a git checkout <your_sha> (and possibly a recompile).
# To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
# that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
# - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
    - pip install -U azure-core==1.35.0 azure-cli-core==2.77.0 azure-identity==1.23.1 azure-mgmt-compute==35.0.0 azure-mgmt-network==29.0.0 azure-mgmt-resource==24.0.0 azure-common==1.1.28 msrest==0.7.1 msrestazure==0.6.4.post1

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
# An unique identifier for the head node and workers of this cluster.
cluster_name: default

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
  image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
  container_name: "ray_container"
  # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
  # if no cached version is present.
  pull_before_run: True
  run_options:  # Extra options to pass into "docker run"
    - --ulimit nofile=65536:65536

  # Example of running a GPU head with CPU workers
  # head_image: "rayproject/ray-ml:latest-gpu"
  # Allow Ray to automatically detect GPUs

  # worker_image: "rayproject/ray-ml:latest-cpu"
  # worker_run_options: []

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: gcp
    region: us-west1
    availability_zone: us-west1-a
    project_id: null # Globally unique project id

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below. This requires that you have added the key into the
# project wide meta-data.
#    ssh_private_key: /path/to/your/key.pem

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray_head_default:
        # The resources provided by this node type.
        resources: {"CPU": 2}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: n1-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922

            # Additional options can be found in in the compute docs at
            # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert

            # If the network interface is specified as below in both head and worker
            # nodes, the manual network config is used.  Otherwise an existing subnet is
            # used.  To use a shared subnet, ask the subnet owner to grant permission
            # for 'compute.subnetworks.use' to the ray autoscaler account...
            # networkInterfaces:
            #   - kind: compute#networkInterface
            #     subnetwork: path/to/subnet
            #     aliasIpRanges: []
    ray_worker_small:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 1
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The resources provided by this node type.
        resources: {"CPU": 2}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: n1-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922
            # Run workers on preemtible instance by default.
            # Comment this out to use on-demand.
            scheduling:
              - preemptible: true
            # Un-Comment this to launch workers with the Service Account of the Head Node
            # serviceAccounts:
            # - email: ray-autoscaler-sa-v1@<project_id>.iam.gserviceaccount.com
            #   scopes:
            #   - https://www.googleapis.com/auth/cloud-platform

    # Additional options can be found in in the compute docs at
    # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert

# Specify the node type of the head node (as configured above).
head_node_type: ray_head_default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up nodes.
setup_commands: []
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"


# Custom commands that will be run on the head node after common setup.
head_setup_commands:
  - pip install google-api-python-client==1.7.8

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076
# An unique identifier for the head node and workers of this cluster.
cluster_name: default

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 5

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
    image: "rayproject/ray-ml:latest"
    # image: rayproject/ray:latest   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: vsphere

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ray
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below.
    ssh_private_key: ~/ray-bootstrap-key.pem

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # You can override the resources here. Adding GPU to the head node is not recommended.
        # resources: { "CPU": 2, "Memory": 4096}
        resources: {}
        node_config: {"vm_class": "best-effort-xlarge"}
    worker:
        # The minimum number of nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 1
        max_workers: 3
        # You can override the resources here. For GPU, currently only NVIDIA GPU is supported. If no ESXi host can
        # fulfill the requirement, the Ray node creation will fail. The number of created nodes may not meet the desired
        # minimum number. The vSphere node provider will not distinguish the GPU type. It will just count the quantity:
        # mount the first k random available NVIDIA GPU to the VM, if the user set {"GPU": k}.
        # resources: {"CPU": 2, "Memory": 4096, "GPU": 1}
        resources: {}
        node_config: {"vm_class": "best-effort-xlarge"}
    worker_2:
         # The minimum number of nodes of this type to launch.
         # This number should be >= 0.
         min_workers: 1
         max_workers: 2
         # You can override the resources here. For GPU, currently only NVIDIA GPU is supported. If no ESXi host can
         # fulfill the requirement, the Ray node creation will fail. The number of created nodes may not meet the desired
         # minimum number. The vSphere node provider will not distinguish the GPU type. It will just count the quantity:
         # mount the first k random available NVIDIA GPU to the VM, if the user set {"GPU": k}.
         # resources: {"CPU": 2, "Memory": 4096, "GPU": 1}
         resources: {}
         node_config: {"vm_class": "best-effort-xlarge"}

# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude: []

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter: []

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up nodes.
setup_commands: []

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
    - pip install 'git+https://github.com/vmware/vsphere-automation-sdk-python.git'

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379

TPU 配置#

可以在 GCP 上使用 TPU VM。目前,不支持 TPU Pod(v2-8、v3-8 和 v4-8 以外的 TPU)。

在使用带有 TPU 的配置之前,请确保您的 GCP 项目已 启用 TPU API

# A unique identifier for the head node and workers of this cluster.
cluster_name: tputest

# The maximum number of worker nodes to launch in addition to the head node.
max_workers: 7

available_node_types:
    ray_head_default:
        resources: {"TPU": 1}  # use TPU custom resource in your code
        node_config:
            # Only v2-8, v3-8 and v4-8 accelerator types are currently supported.
            # Support for TPU pods will be added in the future.
            acceleratorType: v2-8
            runtimeVersion: v2-alpha
            schedulingConfig:
                # Set to false to use non-preemptible TPUs
                preemptible: false
    ray_tpu:
        min_workers: 1
        resources: {"TPU": 1}  # use TPU custom resource in your code
        node_config:
            acceleratorType: v2-8
            runtimeVersion: v2-alpha
            schedulingConfig:
                preemptible: true

provider:
    type: gcp
    region: us-central1
    availability_zone: us-central1-b
    project_id: null # Replace this with your GCP project ID.

setup_commands:
  - sudo apt install python-is-python3 -y
  - pip3 install --upgrade pip
  - pip3 install -U "ray[default]"

# Specify the node type of the head node (as configured above).
head_node_type: ray_head_default