微调 Torch 目标检测模型#

本教程介绍如何使用 Ray AI 库进行并行数据摄取和训练，从而微调 fasterrcnn_resnet50_fpn 模型。

您将完成以下任务

将原始图像和 VOC 风格标注加载到 Dataset 中
微调 fasterrcnn_resnet50_fpn（其主干网络在 ImageNet 上预训练）
评估模型的准确性

在开始本教程之前，您应该熟悉 PyTorch。如果您需要复习，请阅读 PyTorch 的训练分类器教程。

开始之前#

安装 Ray Data 和 Ray Train 的依赖项。

!pip install 'ray[data,train]'

安装 torch、torchmetrics、torchvision 和 xmltodict。

!pip install torch torchmetrics torchvision xmltodict

创建 `Dataset`#

您将使用 Pascal VOC 的一个子集，其中包含猫和狗（完整数据集有 20 个类别）。

CLASS_TO_LABEL = {
    "background": 0,
    "cat": 1,
    "dog": 2,
}

数据集包含两个子目录：JPEGImages 和 Annotations。JPEGImages 包含原始图像，Annotations 包含 XML 标注。

AnimalDetection
├── Annotations
│   ├── 2007_000063.xml
│   ├── 2007_000528.xml
│   └──  ...
└── JPEGImages
    ├── 2007_000063.jpg
    ├── 2007_000528.jpg
    └──  ...

解析标注#

每个标注描述了图像中的对象。

例如，查看这张狗的图像

import io

from PIL import Image
import requests

response = requests.get("https://s3-us-west-2.amazonaws.com/air-example-data/AnimalDetection/JPEGImages/2007_000063.jpg")
image = Image.open(io.BytesIO(response.content))
image

../../../_images/ace6066cf2c3c344a5997b5cb8ff8951e133f98ac8446eb939244992aa663135.png

然后，打印图像的标注

!curl "https://s3-us-west-2.amazonaws.com/air-example-data/AnimalDetection/Annotations/2007_000063.xml"

<?xml version="1.0" encoding="utf-8"?>
<annotation>
	<folder>VOC2012</folder>
	<filename>2007_000063.jpg</filename>
	<source>
		<database>The VOC2007 Database</database>
		<annotation>PASCAL VOC2007</annotation>
		<image>flickr</image>
	</source>
	<size>
		<width>500</width>
		<height>375</height>
		<depth>3</depth>
	</size>
	<segmented>1</segmented>
	<object>
		<name>dog</name>
		<pose>Unspecified</pose>
		<truncated>0</truncated>
		<difficult>0</difficult>
		<bndbox>
			<xmin>123</xmin>
			<ymin>115</ymin>
			<xmax>379</xmax>
			<ymax>275</ymax>
		</bndbox>
	</object>
</annotation>

注意只有一个对象被标注为“狗”

<name>dog</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
        <xmin>123</xmin>
        <ymin>115</ymin>
        <xmax>379</xmax>
        <ymax>275</ymax>
</bndbox>

Ray Data 允许您并行读取和预处理数据。Ray Data 没有内置对 VOC 风格标注的支持，因此您需要定义逻辑来解析标注。

from typing import Any, Dict, List, Tuple

import xmltodict


def decode_annotation(row: Dict[str, Any]) -> Dict[str, Any]:
    text = row["bytes"].decode("utf-8")
    annotation = xmltodict.parse(text)["annotation"]

    objects = annotation["object"]
    # If there's one object, `objects` is a `dict`; otherwise, it's a `list[dict]`.
    if isinstance(objects, dict):
        objects = [objects]

    boxes: List[Tuple] = []
    for obj in objects:
        x1 = float(obj["bndbox"]["xmin"])
        y1 = float(obj["bndbox"]["ymin"])
        x2 = float(obj["bndbox"]["xmax"])
        y2 = float(obj["bndbox"]["ymax"])
        boxes.append((x1, y1, x2, y2))

    labels: List[int] = [CLASS_TO_LABEL[obj["name"]] for obj in objects]

    filename = annotation["filename"]

    return {
        "boxes": boxes,
        "labels": labels,
        "filename": filename,
    }

import os
import ray


path = "s3://anonymous@air-example-data/AnimalDetection/Annotations"
annotations: ray.data.Dataset = (
    ray.data.read_binary_files(path)
    .map(decode_annotation)
)

2025-03-10 21:44:27,975	INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.0.49.230:6379...
2025-03-10 21:44:27,985	INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at https://session-l9djtlqx7qqeui6ppmc55icd5n.i.anyscaleuserdata.com 
2025-03-10 21:44:27,989	INFO packaging.py:367 -- Pushing file package 'gcs://_ray_pkg_5faf6ec023d53f8af2bda5976e1d23b9edab9b2d.zip' (0.61MiB) to Ray cluster...
2025-03-10 21:44:27,992	INFO packaging.py:380 -- Successfully pushed file package 'gcs://_ray_pkg_5faf6ec023d53f8af2bda5976e1d23b9edab9b2d.zip'.

查看前两个样本。Ray Data 应该已经正确解析了标签和边界框。

annotations.take(2)

2025-03-10 21:44:29,880	INFO dataset.py:2799 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format.
2025-03-10 21:44:29,890	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-10_20-48-49_801322_2254/logs/ray-data
2025-03-10 21:44:29,890	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[PartitionFiles] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[Map(decode_annotation)] -> LimitOperator[limit=2]

(autoscaler +24s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.

[{'boxes': [[31.0, 166.0, 275.0, 415.0]],
  'labels': [1],
  'filename': '2010_002026.jpg'},
 {'boxes': [[109.0, 29.0, 428.0, 394.0]],
  'labels': [2],
  'filename': '2010_002029.jpg'}]

加载图像#

annotations 的每一行都包含图像的文件名。

编写一个用户定义函数来加载这些图像。对于每个标注，您的函数将

打开与标注关联的图像。
将图像添加到新的 "image" 列中。

from typing import Dict

import numpy as np
from PIL import Image


def read_images(row: Dict[str, np.ndarray]) -> Dict[str, np.ndarray]:
    url = os.path.join("https://s3-us-west-2.amazonaws.com/air-example-data/AnimalDetection/JPEGImages", row["filename"])
    response = requests.get(url)
    image = Image.open(io.BytesIO(response.content))
    row["image"] = np.array(image)
    return row


dataset = annotations.map(read_images)
dataset

2025-03-10 21:45:01,288	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-10_20-48-49_801322_2254/logs/ray-data
2025-03-10 21:45:01,289	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[PartitionFiles] -> TaskPoolMapOperator[ReadFiles]

将数据集拆分为训练集和测试集#

创建 Dataset后，将数据集拆分为训练集和测试集。

train_dataset, test_dataset = dataset.train_test_split(0.2)

2025-03-10 21:46:46,178	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-10_20-48-49_801322_2254/logs/ray-data
2025-03-10 21:46:46,178	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[PartitionFiles] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[Map(decode_annotation)->Map(read_images)] -> AggregateNumRows[AggregateNumRows]

2025-03-10 21:49:52,272	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-10_20-48-49_801322_2254/logs/ray-data
2025-03-10 21:49:52,272	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[PartitionFiles] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[Map(decode_annotation)->Map(read_images)]

定义预处理逻辑#

创建一个函数来预处理数据集中的图像。首先，转置和缩放图像 (ToTensor)。然后，每个 epoch 随机增强图像 (RandomHorizontalFlip)。使用 map 将此转换应用于数据集中的每一行。

from typing import Any
from torchvision import transforms

def preprocess_image(row: Dict[str, Any]) -> Dict[str, Any]:
    transform = transforms.Compose([transforms.ToTensor(), transforms.RandomHorizontalFlip(p=0.5)])
    row["image"] = transform(row["image"])
    return row
    

# The following transform operation is lazy.
# It will be re-run every epoch.
train_dataset = train_dataset.map(preprocess_image)

test_dataset.take(1)

2025-03-10 21:55:10,274	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-10_20-48-49_801322_2254/logs/ray-data
2025-03-10 21:55:10,275	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> LimitOperator[limit=1]

[{'boxes': [[188.0, 72.0, 319.0, 341.0]],
  'labels': [1],
  'filename': '2009_002936.jpg',
  'image': array([[[  0,   4,   0],
          [  0,   4,   0],
          [  1,   6,   2],
          ...,
          [  2,   6,   0],
          [ 60,  79,  96],
          [134, 186, 233]],
  
         [[  6,   5,   0],
          [ 13,  12,   8],
          [  0,   1,   0],
          ...,
          [  0,  11,   0],
          [125, 162, 178],
          [132, 176, 221]],
  
         [[ 10,   3,   0],
          [ 44,  39,  35],
          [ 14,  10,   7],
          ...,
          [ 54,  81,  92],
          [144, 190, 223],
          [124, 160, 212]],
  
         ...,
  
         [[  9,  18,   1],
          [ 12,  21,   2],
          [ 10,  17,  10],
          ...,
          [182, 176, 180],
          [ 92,  78,  49],
          [166, 167, 123]],
  
         [[ 74,  83,  40],
          [  9,  16,   0],
          [  4,   9,   2],
          ...,
          [255, 250, 255],
          [134, 122, 122],
          [148, 142, 116]],
  
         [[164, 166, 155],
          [ 64,  72,  35],
          [ 33,  41,  17],
          ...,
          [107,  95, 107],
          [184, 180, 179],
          [119, 115,  68]]], dtype=uint8)}]

微调目标检测模型#

定义训练循环#

编写一个函数来训练 fasterrcnn_resnet50_fpn。您的代码将看起来像标准的 Torch 代码，但有一些修改。

以下是一些需要注意的事项

使用 ray.train.torch.prepare_model 分布模型。不要使用 DistributedDataParallel。
将您的 Dataset 传递给 Trainer。Trainer 会自动将数据分片到各个 Worker。
使用 DataIterator.iter_batches 迭代数据。不要使用 Torch DataLoader。
将预处理器传递给 Trainer。

此外，使用 train.report 报告指标和检查点。train.report 在 Ray Train 的内部记录中跟踪这些指标，让您可以在训练完成后监控训练并分析训练运行。

import os
import torch
from torchvision import models
from tempfile import TemporaryDirectory

from ray import train


def train_one_epoch(*, model, optimizer, batch_size, epoch):
    model.train()

    lr_scheduler = None
    if epoch == 0:
        warmup_factor = 1.0 / 1000
        lr_scheduler = torch.optim.lr_scheduler.LinearLR(
            optimizer, start_factor=warmup_factor, total_iters=250
        )

    device = ray.train.torch.get_device()
    train_dataset_shard = train.get_dataset_shard("train")

    batches = train_dataset_shard.iter_batches(batch_size=batch_size)
    for batch in batches:
        inputs = [torch.as_tensor(image).to(device) for image in batch["image"]]

        targets = []
        for i in range(len(batch["boxes"])):
            # `boxes` is a (B, 4) tensor, where B is the number of boxes in the image.
            boxes = torch.as_tensor([box for box in batch["boxes"][i]]).to(device)
            # `labels` is a (B,) tensor, where B is the number of boxes in the image.
            labels = torch.as_tensor(batch["labels"][i]).to(device)
            targets.append({"boxes": boxes, "labels": labels})

        loss_dict = model(inputs, targets)
        losses = sum(loss for loss in loss_dict.values())

        optimizer.zero_grad()
        losses.backward()
        optimizer.step()

        if lr_scheduler is not None:
            lr_scheduler.step()

        train.report(
            {
                "losses": losses.item(),
                "epoch": epoch,
                "lr": optimizer.param_groups[0]["lr"],
                **{key: value.item() for key, value in loss_dict.items()},
            }
        )


def train_loop_per_worker(config):
    # By default, `fasterrcnn_resnet50_fpn`'s backbone is pre-trained on ImageNet.
    model = models.detection.fasterrcnn_resnet50_fpn(num_classes=3)
    model = ray.train.torch.prepare_model(model)
    parameters = [p for p in model.parameters() if p.requires_grad]
    optimizer = torch.optim.SGD(
        parameters,
        lr=config["lr"],
        momentum=config["momentum"],
        weight_decay=config["weight_decay"],
    )
    lr_scheduler = torch.optim.lr_scheduler.MultiStepLR(
        optimizer, milestones=config["lr_steps"], gamma=config["lr_gamma"]
    )

    for epoch in range(0, config["epochs"]):
        train_one_epoch(
            model=model,
            optimizer=optimizer,
            batch_size=config["batch_size"],
            epoch=epoch,
        )
        lr_scheduler.step()

微调模型#

定义训练循环后，创建一个 TorchTrainer 并将训练循环传递给构造函数。然后，调用 TorchTrainer.fit 来训练模型。

from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer


trainer = TorchTrainer(
    train_loop_per_worker=train_loop_per_worker,
    train_loop_config={
        "batch_size": 2,
        "lr": 0.02,
        "epochs": 1,  # You'd normally train for 26 epochs.
        "momentum": 0.9,
        "weight_decay": 1e-4,
        "lr_steps": [16, 22],
        "lr_gamma": 0.1,
    },
    scaling_config=ScalingConfig(num_workers=4, use_gpu=True),
    datasets={"train": train_dataset},
)
results = trainer.fit()

/home/ray/anaconda3/lib/python3.12/site-packages/ray/tune/impl/tuner_internal.py:125: RayDeprecationWarning: The `RunConfig` class should be imported from `ray.tune` when passing it to the Tuner. Please update your imports. See this issue for more context and migration options: https://github.com/ray-project/ray/issues/49454. Disable these warnings by setting the environment variable: RAY_TRAIN_ENABLE_V2_MIGRATION_WARNINGS=0
  _log_deprecation_warning(
2025-03-10 21:55:21,525	INFO tune.py:616 -- [output] This uses the legacy output and progress reporter, as Jupyter notebooks are not supported by the new engine, yet. For more information, please see https://github.com/ray-project/ray/issues/36949

== Status ==
Current time: 2025-03-10 21:55:21 (running for 00:00:00.11)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 4.0/4 GPUs (0.0/1.0 anyscale/node-group:head, 0.0/2.0 anyscale/provider:aws, 0.0/1.0 anyscale/cpu_only:true, 0.0/2.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/node-group:4xL4:48CPU-192GB, 0.0/1.0 anyscale/accelerator_shape:4xL4, 0.0/1.0 accelerator_type:L4)
Result logdir: /tmp/ray/session_2025-03-10_20-48-49_801322_2254/artifacts/2025-03-10_21-55-21/TorchTrainer_2025-03-10_21-55-21/driver_artifacts
Number of trials: 1/1 (1 PENDING)


== Status ==
Current time: 2025-03-10 21:55:26 (running for 00:00:05.13)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 4.0/4 GPUs (0.0/1.0 anyscale/node-group:head, 0.0/2.0 anyscale/provider:aws, 0.0/1.0 anyscale/cpu_only:true, 0.0/2.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/node-group:4xL4:48CPU-192GB, 0.0/1.0 anyscale/accelerator_shape:4xL4, 0.0/1.0 accelerator_type:L4)
Result logdir: /tmp/ray/session_2025-03-10_20-48-49_801322_2254/artifacts/2025-03-10_21-55-21/TorchTrainer_2025-03-10_21-55-21/driver_artifacts
Number of trials: 1/1 (1 RUNNING)

(RayTrainWorker pid=25530, ip=10.0.211.239) Setting up process group for: env:// [rank=0, world_size=4]
(TorchTrainer pid=25443, ip=10.0.211.239) Started distributed worker processes: 
(TorchTrainer pid=25443, ip=10.0.211.239) - (node_id=35eba17ece4a52de5138ed69f190e701d93372fa47b22ac14d9e0523, ip=10.0.211.239, pid=25530) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=25443, ip=10.0.211.239) - (node_id=35eba17ece4a52de5138ed69f190e701d93372fa47b22ac14d9e0523, ip=10.0.211.239, pid=25529) world_rank=1, local_rank=1, node_rank=0
(TorchTrainer pid=25443, ip=10.0.211.239) - (node_id=35eba17ece4a52de5138ed69f190e701d93372fa47b22ac14d9e0523, ip=10.0.211.239, pid=25532) world_rank=2, local_rank=2, node_rank=0
(TorchTrainer pid=25443, ip=10.0.211.239) - (node_id=35eba17ece4a52de5138ed69f190e701d93372fa47b22ac14d9e0523, ip=10.0.211.239, pid=25531) world_rank=3, local_rank=3, node_rank=0
(RayTrainWorker pid=25530, ip=10.0.211.239) Moving model to device: cuda:0
(RayTrainWorker pid=25530, ip=10.0.211.239) Wrapping provided model in DistributedDataParallel.

== Status ==
Current time: 2025-03-10 21:55:31 (running for 00:00:10.15)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 4.0/4 GPUs (0.0/1.0 anyscale/cpu_only:true, 0.0/1.0 anyscale/node-group:head, 0.0/2.0 anyscale/region:us-west-2, 0.0/2.0 anyscale/provider:aws, 0.0/1.0 anyscale/node-group:4xL4:48CPU-192GB, 0.0/1.0 accelerator_type:L4, 0.0/1.0 anyscale/accelerator_shape:4xL4)
Result logdir: /tmp/ray/session_2025-03-10_20-48-49_801322_2254/artifacts/2025-03-10_21-55-21/TorchTrainer_2025-03-10_21-55-21/driver_artifacts
Number of trials: 1/1 (1 RUNNING)

(SplitCoordinator pid=25864, ip=10.0.211.239) Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-03-10_20-48-49_801322_2254/logs/ray-data
(SplitCoordinator pid=25864, ip=10.0.211.239) Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[Map(preprocess_image)] -> OutputSplitter[split(4, equal=True)]

(Map(preprocess_image) pid=25996, ip=10.0.211.239) /tmp/ray/session_2025-03-10_20-48-49_801322_2254/runtime_resources/pip/eeb9e5d99a956859fc3b865940547678c4d9c242/virtualenv/lib/python3.12/site-packages/torchvision/transforms/functional.py:154: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:203.)
(Map(preprocess_image) pid=25996, ip=10.0.211.239)   img = torch.from_numpy(pic.transpose((2, 0, 1))).contiguous()
(RayTrainWorker pid=25529, ip=10.0.211.239) /tmp/ipykernel_20675/2152789653.py:29: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:254.)

== Status ==
Current time: 2025-03-10 21:55:36 (running for 00:00:15.17)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 4.0/4 GPUs (0.0/1.0 anyscale/cpu_only:true, 0.0/1.0 anyscale/node-group:head, 0.0/2.0 anyscale/region:us-west-2, 0.0/2.0 anyscale/provider:aws, 0.0/1.0 anyscale/node-group:4xL4:48CPU-192GB, 0.0/1.0 accelerator_type:L4, 0.0/1.0 anyscale/accelerator_shape:4xL4)
Result logdir: /tmp/ray/session_2025-03-10_20-48-49_801322_2254/artifacts/2025-03-10_21-55-21/TorchTrainer_2025-03-10_21-55-21/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2025-03-10 21:55:41 (running for 00:00:20.18)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 4.0/4 GPUs (0.0/2.0 anyscale/provider:aws, 0.0/1.0 anyscale/cpu_only:true, 0.0/2.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/node-group:head, 0.0/1.0 anyscale/accelerator_shape:4xL4, 0.0/1.0 accelerator_type:L4, 0.0/1.0 anyscale/node-group:4xL4:48CPU-192GB)
Result logdir: /tmp/ray/session_2025-03-10_20-48-49_801322_2254/artifacts/2025-03-10_21-55-21/TorchTrainer_2025-03-10_21-55-21/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2025-03-10 21:55:46 (running for 00:00:25.20)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 4.0/4 GPUs (0.0/2.0 anyscale/provider:aws, 0.0/1.0 anyscale/cpu_only:true, 0.0/2.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/node-group:head, 0.0/1.0 anyscale/accelerator_shape:4xL4, 0.0/1.0 accelerator_type:L4, 0.0/1.0 anyscale/node-group:4xL4:48CPU-192GB)
Result logdir: /tmp/ray/session_2025-03-10_20-48-49_801322_2254/artifacts/2025-03-10_21-55-21/TorchTrainer_2025-03-10_21-55-21/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2025-03-10 21:55:51 (running for 00:00:30.27)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 4.0/4 GPUs (0.0/2.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/node-group:head, 0.0/2.0 anyscale/provider:aws, 0.0/1.0 anyscale/cpu_only:true, 0.0/1.0 anyscale/accelerator_shape:4xL4, 0.0/1.0 anyscale/node-group:4xL4:48CPU-192GB, 0.0/1.0 accelerator_type:L4)
Result logdir: /tmp/ray/session_2025-03-10_20-48-49_801322_2254/artifacts/2025-03-10_21-55-21/TorchTrainer_2025-03-10_21-55-21/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2025-03-10 21:55:56 (running for 00:00:35.30)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 4.0/4 GPUs (0.0/2.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/node-group:head, 0.0/2.0 anyscale/provider:aws, 0.0/1.0 anyscale/cpu_only:true, 0.0/1.0 anyscale/accelerator_shape:4xL4, 0.0/1.0 anyscale/node-group:4xL4:48CPU-192GB, 0.0/1.0 accelerator_type:L4)
Result logdir: /tmp/ray/session_2025-03-10_20-48-49_801322_2254/artifacts/2025-03-10_21-55-21/TorchTrainer_2025-03-10_21-55-21/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2025-03-10 21:56:01 (running for 00:00:40.33)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 4.0/4 GPUs (0.0/2.0 anyscale/region:us-west-2, 0.0/2.0 anyscale/provider:aws, 0.0/1.0 anyscale/cpu_only:true, 0.0/1.0 anyscale/node-group:head, 0.0/1.0 anyscale/node-group:4xL4:48CPU-192GB, 0.0/1.0 accelerator_type:L4, 0.0/1.0 anyscale/accelerator_shape:4xL4)
Result logdir: /tmp/ray/session_2025-03-10_20-48-49_801322_2254/artifacts/2025-03-10_21-55-21/TorchTrainer_2025-03-10_21-55-21/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2025-03-10 21:56:07 (running for 00:00:45.37)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 4.0/4 GPUs (0.0/2.0 anyscale/region:us-west-2, 0.0/2.0 anyscale/provider:aws, 0.0/1.0 anyscale/cpu_only:true, 0.0/1.0 anyscale/node-group:head, 0.0/1.0 anyscale/node-group:4xL4:48CPU-192GB, 0.0/1.0 accelerator_type:L4, 0.0/1.0 anyscale/accelerator_shape:4xL4)
Result logdir: /tmp/ray/session_2025-03-10_20-48-49_801322_2254/artifacts/2025-03-10_21-55-21/TorchTrainer_2025-03-10_21-55-21/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2025-03-10 21:56:12 (running for 00:00:50.44)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 4.0/4 GPUs (0.0/2.0 anyscale/region:us-west-2, 0.0/2.0 anyscale/provider:aws, 0.0/1.0 anyscale/cpu_only:true, 0.0/1.0 anyscale/node-group:head, 0.0/1.0 accelerator_type:L4, 0.0/1.0 anyscale/accelerator_shape:4xL4, 0.0/1.0 anyscale/node-group:4xL4:48CPU-192GB)
Result logdir: /tmp/ray/session_2025-03-10_20-48-49_801322_2254/artifacts/2025-03-10_21-55-21/TorchTrainer_2025-03-10_21-55-21/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2025-03-10 21:56:17 (running for 00:00:55.53)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 4.0/4 GPUs (0.0/2.0 anyscale/region:us-west-2, 0.0/2.0 anyscale/provider:aws, 0.0/1.0 anyscale/cpu_only:true, 0.0/1.0 anyscale/node-group:head, 0.0/1.0 accelerator_type:L4, 0.0/1.0 anyscale/accelerator_shape:4xL4, 0.0/1.0 anyscale/node-group:4xL4:48CPU-192GB)
Result logdir: /tmp/ray/session_2025-03-10_20-48-49_801322_2254/artifacts/2025-03-10_21-55-21/TorchTrainer_2025-03-10_21-55-21/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2025-03-10 21:56:22 (running for 00:01:00.59)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 4.0/4 GPUs (0.0/2.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/node-group:head, 0.0/1.0 anyscale/cpu_only:true, 0.0/2.0 anyscale/provider:aws, 0.0/1.0 accelerator_type:L4, 0.0/1.0 anyscale/accelerator_shape:4xL4, 0.0/1.0 anyscale/node-group:4xL4:48CPU-192GB)
Result logdir: /tmp/ray/session_2025-03-10_20-48-49_801322_2254/artifacts/2025-03-10_21-55-21/TorchTrainer_2025-03-10_21-55-21/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2025-03-10 21:56:27 (running for 00:01:05.63)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 4.0/4 GPUs (0.0/2.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/node-group:head, 0.0/1.0 anyscale/cpu_only:true, 0.0/2.0 anyscale/provider:aws, 0.0/1.0 accelerator_type:L4, 0.0/1.0 anyscale/accelerator_shape:4xL4, 0.0/1.0 anyscale/node-group:4xL4:48CPU-192GB)
Result logdir: /tmp/ray/session_2025-03-10_20-48-49_801322_2254/artifacts/2025-03-10_21-55-21/TorchTrainer_2025-03-10_21-55-21/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2025-03-10 21:56:32 (running for 00:01:10.69)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 4.0/4 GPUs (0.0/2.0 anyscale/provider:aws, 0.0/1.0 anyscale/node-group:head, 0.0/2.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/cpu_only:true, 0.0/1.0 accelerator_type:L4, 0.0/1.0 anyscale/node-group:4xL4:48CPU-192GB, 0.0/1.0 anyscale/accelerator_shape:4xL4)
Result logdir: /tmp/ray/session_2025-03-10_20-48-49_801322_2254/artifacts/2025-03-10_21-55-21/TorchTrainer_2025-03-10_21-55-21/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2025-03-10 21:56:37 (running for 00:01:15.70)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 4.0/4 GPUs (0.0/2.0 anyscale/provider:aws, 0.0/1.0 anyscale/node-group:head, 0.0/2.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/cpu_only:true, 0.0/1.0 accelerator_type:L4, 0.0/1.0 anyscale/node-group:4xL4:48CPU-192GB, 0.0/1.0 anyscale/accelerator_shape:4xL4)
Result logdir: /tmp/ray/session_2025-03-10_20-48-49_801322_2254/artifacts/2025-03-10_21-55-21/TorchTrainer_2025-03-10_21-55-21/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2025-03-10 21:56:42 (running for 00:01:20.77)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 4.0/4 GPUs (0.0/1.0 anyscale/cpu_only:true, 0.0/2.0 anyscale/region:us-west-2, 0.0/2.0 anyscale/provider:aws, 0.0/1.0 anyscale/node-group:head, 0.0/1.0 accelerator_type:L4, 0.0/1.0 anyscale/node-group:4xL4:48CPU-192GB, 0.0/1.0 anyscale/accelerator_shape:4xL4)
Result logdir: /tmp/ray/session_2025-03-10_20-48-49_801322_2254/artifacts/2025-03-10_21-55-21/TorchTrainer_2025-03-10_21-55-21/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2025-03-10 21:56:47 (running for 00:01:25.86)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 4.0/4 GPUs (0.0/1.0 anyscale/cpu_only:true, 0.0/2.0 anyscale/region:us-west-2, 0.0/2.0 anyscale/provider:aws, 0.0/1.0 anyscale/node-group:head, 0.0/1.0 accelerator_type:L4, 0.0/1.0 anyscale/node-group:4xL4:48CPU-192GB, 0.0/1.0 anyscale/accelerator_shape:4xL4)
Result logdir: /tmp/ray/session_2025-03-10_20-48-49_801322_2254/artifacts/2025-03-10_21-55-21/TorchTrainer_2025-03-10_21-55-21/driver_artifacts
Number of trials: 1/1 (1 RUNNING)

2025-03-10 21:56:52,736	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/ray/ray_results/TorchTrainer_2025-03-10_21-55-21' in 0.0025s.
2025-03-10 21:56:52,738	INFO tune.py:1041 -- Total run time: 91.21 seconds (91.10 seconds for the tuning loop).

== Status ==
Current time: 2025-03-10 21:56:52 (running for 00:01:30.91)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 4.0/4 GPUs (0.0/2.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/cpu_only:true, 0.0/2.0 anyscale/provider:aws, 0.0/1.0 anyscale/node-group:head, 0.0/1.0 accelerator_type:L4, 0.0/1.0 anyscale/node-group:4xL4:48CPU-192GB, 0.0/1.0 anyscale/accelerator_shape:4xL4)
Result logdir: /tmp/ray/session_2025-03-10_20-48-49_801322_2254/artifacts/2025-03-10_21-55-21/TorchTrainer_2025-03-10_21-55-21/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2025-03-10 21:56:52 (running for 00:01:31.10)
Using FIFO scheduling algorithm.
Logical resource usage: 1.0/48 CPUs, 4.0/4 GPUs (0.0/2.0 anyscale/region:us-west-2, 0.0/1.0 anyscale/cpu_only:true, 0.0/2.0 anyscale/provider:aws, 0.0/1.0 anyscale/node-group:head, 0.0/1.0 accelerator_type:L4, 0.0/1.0 anyscale/node-group:4xL4:48CPU-192GB, 0.0/1.0 anyscale/accelerator_shape:4xL4)
Result logdir: /tmp/ray/session_2025-03-10_20-48-49_801322_2254/artifacts/2025-03-10_21-55-21/TorchTrainer_2025-03-10_21-55-21/driver_artifacts
Number of trials: 1/1 (1 TERMINATED)

下一步#

端到端：离线批量推理