在 Kubernetes 上使用 CPU 训练基于 Fashion MNIST 的 PyTorch 模型#

本示例使用 Ray Train 在 Fashion MNIST 上运行 PyTorch 模型的分布式训练。有关更多详细信息，请参阅训练基于 Fashion MNIST 的 PyTorch 模型。

步骤 1：创建 Kubernetes 集群#

此步骤使用 Kind 创建一个本地 Kubernetes 集群。如果您已有 Kubernetes 集群，可以跳过此步骤。

kind create cluster --image=kindest/node:v1.26.0

步骤 2：安装 KubeRay Operator#

请遵循本文档，从 Helm 仓库安装最新的稳定版 KubeRay Operator。

步骤 3：创建 RayJob#

RayJob 由一个 RayCluster 自定义资源和一个可以提交到 RayCluster 的 Job 组成。使用 RayJob，KubeRay 会创建一个 RayCluster 并在集群就绪后提交一个 Job。以下是一个用于 PyTorch 模型 MNIST 训练的仅限 CPU 的 RayJob 描述 YAML 文件。

# Download `ray-job.pytorch-mnist.yaml`
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/pytorch-mnist/ray-job.pytorch-mnist.yaml

您可能需要调整 RayJob 描述 YAML 文件中的一些字段，以便它能在您的环境中运行

replicas 在 rayClusterSpec 的 workerGroupSpecs 下：此字段指定 KubeRay 调度到 Kubernetes 集群的 worker Pod 数量。每个 worker Pod 需要 3 个 CPU，head Pod 需要 1 个 CPU，如 template 字段中所述。一个 RayJob 提交者 Pod 需要 1 个 CPU。例如，如果您的机器有 8 个 CPU，则 replicas 的最大值为 2，以便所有 Pod 都能达到 Running 状态。
NUM_WORKERS 在 spec 的 runtimeEnvYAML 下：此字段指示要启动的 Ray actor 数量（有关更多信息，请参阅本文档中的 ScalingConfig）。每个 Ray actor 必须由 Kubernetes 集群中的 worker Pod 提供服务。因此，NUM_WORKERS 必须小于或等于 replicas。
CPUS_PER_WORKER：此值必须小于或等于 (每个 worker Pod 的 CPU 资源请求) - 1。例如，在示例 YAML 文件中，每个 worker Pod 的 CPU 资源请求为 3 个 CPU，因此 CPUS_PER_WORKER 必须设置为 2 或更少。

# `replicas` and `NUM_WORKERS` set to 2.
# Create a RayJob.
kubectl apply -f ray-job.pytorch-mnist.yaml

# Check existing Pods: According to `replicas`, there should be 2 worker Pods.
# Make sure all the Pods are in the `Running` status.
kubectl get pods
# NAME                                                             READY   STATUS    RESTARTS   AGE
# kuberay-operator-6dddd689fb-ksmcs                                1/1     Running   0          6m8s
# rayjob-pytorch-mnist-raycluster-rkdmq-small-group-worker-c8bwx   1/1     Running   0          5m32s
# rayjob-pytorch-mnist-raycluster-rkdmq-small-group-worker-s7wvm   1/1     Running   0          5m32s
# rayjob-pytorch-mnist-nxmj2                                       1/1     Running   0          4m17s
# rayjob-pytorch-mnist-raycluster-rkdmq-head-m4dsl                 1/1     Running   0          5m32s

检查 RayJob 是否处于 RUNNING 状态

kubectl get rayjob
# NAME                   JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME   AGE
# rayjob-pytorch-mnist   RUNNING      Running             2024-06-17T04:08:25Z              11m

步骤 4：等待 RayJob 完成并检查训练结果#

等待 RayJob 完成。这可能需要几分钟。

kubectl get rayjob
# NAME                   JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME               AGE
# rayjob-pytorch-mnist   SUCCEEDED    Complete            2024-06-17T04:08:25Z   2024-06-17T04:22:21Z   16m

在看到 JOB_STATUS 标记为 SUCCEEDED 后，您可以检查训练日志

# Check Pods name.
kubectl get pods
# NAME                                                             READY   STATUS      RESTARTS   AGE
# kuberay-operator-6dddd689fb-ksmcs                                1/1     Running     0          113m
# rayjob-pytorch-mnist-raycluster-rkdmq-small-group-worker-c8bwx   1/1     Running     0          38m
# rayjob-pytorch-mnist-raycluster-rkdmq-small-group-worker-s7wvm   1/1     Running     0          38m
# rayjob-pytorch-mnist-nxmj2                                       0/1     Completed   0          38m
# rayjob-pytorch-mnist-raycluster-rkdmq-head-m4dsl                 1/1     Running     0          38m

# Check training logs.
kubectl logs -f rayjob-pytorch-mnist-nxmj2

# 2024-06-16 22:23:01,047 INFO cli.py:36 -- Job submission server address: http://rayjob-pytorch-mnist-raycluster-rkdmq-head-svc.default.svc.cluster.local:8265
# 2024-06-16 22:23:01,844 SUCC cli.py:60 -- -------------------------------------------------------
# 2024-06-16 22:23:01,844 SUCC cli.py:61 -- Job 'rayjob-pytorch-mnist-l6ccc' submitted successfully
# 2024-06-16 22:23:01,844 SUCC cli.py:62 -- -------------------------------------------------------
# ...
# (RayTrainWorker pid=1138, ip=10.244.0.18)
#   0%|          | 0/26421880 [00:00<?, ?it/s]
# (RayTrainWorker pid=1138, ip=10.244.0.18)
#   0%|          | 32768/26421880 [00:00<01:27, 301113.97it/s]
# ...
# Training finished iteration 10 at 2024-06-16 22:33:05. Total running time: 7min 9s
# ╭───────────────────────────────╮
# │ Training result               │
# ├───────────────────────────────┤
# │ checkpoint_dir_name           │
# │ time_this_iter_s      28.2635 │
# │ time_total_s          423.388 │
# │ training_iteration         10 │
# │ accuracy               0.8748 │
# │ loss                  0.35477 │
# ╰───────────────────────────────╯

# Training completed after 10 iterations at 2024-06-16 22:33:06. Total running time: 7min 10s

# Training result: Result(
#   metrics={'loss': 0.35476621258825347, 'accuracy': 0.8748},
#   path='/home/ray/ray_results/TorchTrainer_2024-06-16_22-25-55/TorchTrainer_122aa_00000_0_2024-06-16_22-25-55',
#   filesystem='local',
#   checkpoint=None
# )
# ...

清理#

使用以下命令删除您的 RayJob

kubectl delete -f ray-job.pytorch-mnist.yaml