RayJob 批量推理示例#

此示例演示了如何使用 RayJob 自定义资源在 Ray 集群上运行图像分类工作负载的批量推理作业。有关代码的完整解释，请参阅使用 HuggingFace Vision Transformer 进行图像分类批量推理。

先决条件#

您必须拥有一个正在运行的 Kubernetes 集群，并已配置 kubectl 以使用它，并且有可用的 GPU。本示例提供了在 Google Kubernetes Engine (GKE) 上设置必要 GPU 的简要教程，但您可以使用任何带有 GPU 的 Kubernetes 集群。

步骤 0：在 GKE 上创建 Kubernetes 集群（可选）#

如果您已经有一个带有 GPU 的 Kubernetes 集群，可以跳过此步骤。

否则，请按照本教程操作，但替换以下 GPU 节点池创建命令，以在 GKE 上创建具有四个 Nvidia T4 GPU 的 Kubernetes 集群

gcloud container node-pools create gpu-node-pool \
  --accelerator type=nvidia-tesla-t4,count=4,gpu-driver-version=default \
  --zone us-west1-b \
  --cluster kuberay-gpu-cluster \
  --num-nodes 1 \
  --min-nodes 0 \
  --max-nodes 1 \
  --enable-autoscaling \
  --machine-type n1-standard-64

本示例使用四个 Nvidia T4 GPU。机器类型为 n1-standard-64，它有 64 个 vCPU 和 240 GB RAM。

步骤 1：安装 KubeRay Operator#

按照此文档从 Helm 仓库安装最新的稳定版 KubeRay Operator。如果已正确设置 GPU 节点池的污点 (taint)，KubeRay Operator Pod 必须位于 CPU 节点上。

步骤 2：提交 RayJob#

使用 ray-job.batch-inference.yaml 创建 RayJob 自定义资源。

使用 curl 下载文件

curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.3.0/ray-operator/config/samples/ray-job.batch-inference.yaml

注意，RayJob spec 包含 RayCluster 的 spec。本教程使用一个带有 4 个 GPU 的单节点集群。对于生产用例，请使用多节点集群，其中 head 节点不包含 GPU，这样 Ray 就可以自动将 GPU 工作负载调度到不会干扰 head 节点上关键 Ray 进程的 worker 节点上。

请注意 RayJob spec 中的以下字段，它们指定了 Ray 镜像和 Ray 节点的 GPU 资源

        spec:
          containers:
            - name: ray-head
              image: rayproject/ray-ml:2.6.3-gpu
              resources:
                limits:
                  nvidia.com/gpu: "4"
                  cpu: "54"
                  memory: "54Gi"
                requests:
                  nvidia.com/gpu: "4"
                  cpu: "54"
                  memory: "54Gi"
              volumeMounts:
                - mountPath: /home/ray/samples
                  name: code-sample
          nodeSelector:
            cloud.google.com/gke-accelerator: nvidia-tesla-t4 # This is the GPU type we used in the GPU node pool.

要提交作业，请运行以下命令

kubectl apply -f ray-job.batch-inference.yaml

使用 kubectl describe rayjob rayjob-sample 检查状态。

示例输出

[...]
Status:
  Dashboard URL:          rayjob-sample-raycluster-j6t8n-head-svc.default.svc.cluster.local:8265
  End Time:               ...
  Job Deployment Status:  Complete
  Job Id:                 rayjob-sample-ft8lh
  Job Status:             SUCCEEDED
  Message:                Job finished successfully.
  Observed Generation:    2
  ...

要查看日志，首先使用 kubectl get pods 找到运行作业的 pod 名称。

示例输出

NAME                                        READY   STATUS      RESTARTS   AGE
kuberay-operator-8b86754c-r4rc2             1/1     Running     0          25h
rayjob-sample-raycluster-j6t8n-head-kx2gz   1/1     Running     0          35m
rayjob-sample-w98c7                         0/1     Completed   0          30m

Ray 集群仍在运行，因为 RayJob spec 中未设置 shutdownAfterJobFinishes。如果您将 shutdownAfterJobFinishes 设置为 true，则作业完成后集群将关闭。

接下来，运行

kubectl logs rayjob-sample-w98c7

以获取 RayJob 的 entrypoint 命令的标准输出。示例输出

[...]
Running: 62.0/64.0 CPU, 4.0/4.0 GPU, 955.57 MiB/12.83 GiB object_store_memory:   0%|          | 0/200 [00:05<?, ?it/s]
Running: 61.0/64.0 CPU, 4.0/4.0 GPU, 999.41 MiB/12.83 GiB object_store_memory:   0%|          | 0/200 [00:05<?, ?it/s]
Running: 61.0/64.0 CPU, 4.0/4.0 GPU, 999.41 MiB/12.83 GiB object_store_memory:   0%|          | 1/200 [00:05<17:04,  5.15s/it]
Running: 61.0/64.0 CPU, 4.0/4.0 GPU, 1008.68 MiB/12.83 GiB object_store_memory:   0%|          | 1/200 [00:05<17:04,  5.15s/it]
Running: 61.0/64.0 CPU, 4.0/4.0 GPU, 1008.68 MiB/12.83 GiB object_store_memory: 100%|██████████| 1/1 [00:05<00:00,  5.15s/it]

2023-08-22 15:48:33,905 WARNING actor_pool_map_operator.py:267 -- To ensure full parallelization across an actor pool of size 4, the specified batch size should be at most 5. Your configured batch size for this operator was 16.
<PIL.Image.Image image mode=RGB size=500x375 at 0x7B37546CF7F0>
Label:  tench, Tinca tinca
<PIL.Image.Image image mode=RGB size=500x375 at 0x7B37546AE430>
Label:  tench, Tinca tinca
<PIL.Image.Image image mode=RGB size=500x375 at 0x7B37546CF430>
Label:  tench, Tinca tinca
<PIL.Image.Image image mode=RGB size=500x375 at 0x7B37546AE430>
Label:  tench, Tinca tinca
<PIL.Image.Image image mode=RGB size=500x375 at 0x7B37546CF7F0>
Label:  tench, Tinca tinca
2023-08-22 15:48:36,522 SUCC cli.py:33 -- -----------------------------------
2023-08-22 15:48:36,522 SUCC cli.py:34 -- Job 'rayjob-sample-ft8lh' succeeded
2023-08-22 15:48:36,522 SUCC cli.py:35 -- -----------------------------------