在 Kubernetes 上使用 GPU 训练 PyTorch ResNet 模型#

本指南介绍如何在 Kubernetes 基础设施上使用 GPU 运行 Ray 机器学习训练工作负载示例。它使用 1 GB 的训练集运行 Ray 的 PyTorch 图像训练基准测试。

注意

要学习 Ray 在 Kubernetes 上的基础知识，我们建议首先查看入门指南。

请注意，Kubernetes 和 Kubectl 需要至少 1.19 版本。

端到端工作流程#

以下脚本总结了 GPU 训练的端到端工作流程。这些说明适用于 GCP，但类似设置适用于任何主流云提供商。以下脚本包含：

步骤 1：在 GCP 上设置 Kubernetes 集群。
步骤 2：使用 KubeRay operator 在 Kubernetes 上部署 Ray 集群。
步骤 3：运行 PyTorch 图像训练基准测试。

# Step 1: Set up a Kubernetes cluster on GCP
# Create a node-pool for a CPU-only head node
# e2-standard-8 => 8 vCPU; 32 GB RAM
gcloud container clusters create gpu-cluster-1 \
    --num-nodes=1 --min-nodes 0 --max-nodes 1 --enable-autoscaling \
    --zone=us-central1-c --machine-type e2-standard-8

# Create a node-pool for GPU. The node is for a GPU Ray worker node.
# n1-standard-8 => 8 vCPU; 30 GB RAM
gcloud container node-pools create gpu-node-pool \
  --accelerator type=nvidia-tesla-t4,count=1 \
  --zone us-central1-c --cluster gpu-cluster-1 \
  --num-nodes 1 --min-nodes 0 --max-nodes 1 --enable-autoscaling \
  --machine-type n1-standard-8

# Install NVIDIA GPU device driver
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

# Step 2: Deploy a Ray cluster on Kubernetes with the KubeRay operator.
# Please make sure you are connected to your Kubernetes cluster. For GCP, you can do so by:
#   (Method 1) Copy the connection command from the GKE console
#   (Method 2) "gcloud container clusters get-credentials <your-cluster-name> --region <your-region> --project <your-project>"
#   (Method 3) "kubectl config use-context ..."

# Install both CRDs and KubeRay operator.
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
helm install kuberay-operator kuberay/kuberay-operator --version 1.3.0

# Create a Ray cluster
kubectl apply -f https://raw.githubusercontent.com/ray-project/ray/master/doc/source/cluster/kubernetes/configs/ray-cluster.gpu.yaml

# Set up port-forwarding
kubectl port-forward services/raycluster-head-svc 8265:8265

# Step 3: Run the PyTorch image training benchmark.
# Install Ray if needed
pip3 install -U "ray[default]"

# Download the Python script
curl https://raw.githubusercontent.com/ray-project/ray/master/doc/source/cluster/doc_code/pytorch_training_e2e_submit.py -o pytorch_training_e2e_submit.py

# Submit the training job to your ray cluster
python3 pytorch_training_e2e_submit.py

# Use the following command to follow this Job's logs:
# Substitute the Ray Job's submission id.
ray job logs 'raysubmit_xxxxxxxxxxxxxxxx' --address http://127.0.0.1:8265 --follow

在本文档的其余部分，我们将更详细地分解上述工作流程。

步骤 1：在 GCP 上设置 Kubernetes 集群。#

在本节中，我们将设置一个包含 CPU 和 GPU 节点池的 Kubernetes 集群。这些说明适用于 GCP，但类似设置适用于任何主流云提供商。如果您已经有一个带有 GPU 的 Kubernetes 集群，则可以忽略此步骤。

如果您是 Kubernetes 的新手，并且打算在托管 Kubernetes 服务上部署 Ray 工作负载，我们建议首先查看此入门指南。

本示例无需使用具有如此大内存（以下命令中每个节点 >30GB）的集群来运行。您可以随意更新 machine-type 选项以及 ray-cluster.gpu.yaml 中的资源需求。

在第一个命令中，我们创建了一个名为 gpu-cluster-1 的 Kubernetes 集群，其中包含一个 CPU 节点（e2-standard-8：8 vCPU；32 GB RAM）。在第二个命令中，我们向集群添加一个带有 GPU（nvidia-tesla-t4）的新节点（n1-standard-8：8 vCPU；30 GB RAM）。

# Step 1: Set up a Kubernetes cluster on GCP.
# e2-standard-8 => 8 vCPU; 32 GB RAM
gcloud container clusters create gpu-cluster-1 \
    --num-nodes=1 --min-nodes 0 --max-nodes 1 --enable-autoscaling \
    --zone=us-central1-c --machine-type e2-standard-8

# Create a node-pool for GPU
# n1-standard-8 => 8 vCPU; 30 GB RAM
gcloud container node-pools create gpu-node-pool \
  --accelerator type=nvidia-tesla-t4,count=1 \
  --zone us-central1-c --cluster gpu-cluster-1 \
  --num-nodes 1 --min-nodes 0 --max-nodes 1 --enable-autoscaling \
  --machine-type n1-standard-8

# Install NVIDIA GPU device driver
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml

步骤 2：使用 KubeRay operator 在 Kubernetes 上部署 Ray 集群。#

要执行以下步骤，请确保您已连接到您的 Kubernetes 集群。对于 GCP，您可以通过以下方式连接：

从 GKE 控制台复制连接命令
gcloud container clusters get-credentials <your-cluster-name> --region <your-region> --project <your-project> (链接)
kubectl config use-context (链接)

第一个命令会将 KubeRay (ray-operator) 部署到您的 Kubernetes 集群。第二个命令将借助 KubeRay 创建一个 Ray 集群。

第三个命令用于将 ray-head Pod 的 8265 端口映射到 127.0.0.1:8265。您可以访问 127.0.0.1:8265 查看控制面板。最后一个命令用于提交一个简单任务来测试您的 Ray 集群。它是可选的。

# Step 2: Deploy a Ray cluster on Kubernetes with the KubeRay operator.
# Create the KubeRay operator
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
helm install kuberay-operator kuberay/kuberay-operator --version 1.3.0

# Create a Ray cluster
kubectl apply -f https://raw.githubusercontent.com/ray-project/ray/master/doc/source/cluster/kubernetes/configs/ray-cluster.gpu.yaml

# port forwarding
kubectl port-forward services/raycluster-head-svc 8265:8265

# Test cluster (optional)
ray job submit --address https://:8265 -- python -c "import ray; ray.init(); print(ray.cluster_resources())"

步骤 3：运行 PyTorch 图像训练基准测试。#

我们将使用 Ray Job Python SDK 来提交 PyTorch 工作负载。

from ray.job_submission import JobSubmissionClient

client = JobSubmissionClient("http://127.0.0.1:8265")

kick_off_pytorch_benchmark = (
    # Clone ray. If ray is already present, don't clone again.
    "git clone -b ray-2.2.0 https://github.com/ray-project/ray || true;"
    # Run the benchmark.
    "python ray/release/air_tests/air_benchmarks/workloads/pytorch_training_e2e.py"
    " --data-size-gb=1 --num-epochs=2 --num-workers=1"
)


submission_id = client.submit_job(
    entrypoint=kick_off_pytorch_benchmark,
)

print("Use the following command to follow this Job's logs:")
print(f"ray job logs '{submission_id}' --address http://127.0.0.1:8265 --follow")

要提交工作负载，请运行上面的 Python 脚本。该脚本可在 Ray 仓库中找到。

# Step 3: Run the PyTorch image training benchmark.
# Install Ray if needed
pip3 install -U "ray[default]"

# Download the Python script
curl https://raw.githubusercontent.com/ray-project/ray/master/doc/source/cluster/doc_code/pytorch_training_e2e_submit.py -o pytorch_training_e2e_submit.py

# Submit the training job to your ray cluster
python3 pytorch_training_e2e_submit.py
# Example STDOUT:
# Use the following command to follow this Job's logs:
# ray job logs 'raysubmit_jNQxy92MJ4zinaDX' --follow

# Track job status
# Substitute the Ray Job's submission id.
ray job logs 'raysubmit_xxxxxxxxxxxxxxxx' --address http://127.0.0.1:8265 --follow

清理#

使用以下命令删除您的 Ray 集群和 KubeRay

kubectl delete raycluster raycluster

# Please make sure the ray cluster has already been removed before delete the operator.
helm uninstall kuberay-operator

如果您使用的是公有云，请不要忘记清理底层的节点组和/或 Kubernetes 集群。