微调 Hugging Face Transformers 模型#

本 Notebook 基于 Hugging Face 官方示例 如何微调文本分类模型。本 Notebook 展示了在无需修改训练逻辑的情况下,将原始 HF 转换到 Ray Train 的过程。

本 Notebook 包含以下步骤

  1. 设置 Ray

  2. 加载数据集

  3. 使用 Ray Data 预处理数据集

  4. 使用 Ray Train 运行训练

  5. 可选:与社区共享模型

取消注释并运行以下行以安装所有必需的依赖项。(本 Notebook 在 transformers==4.19.1 环境下测试。)

#! pip install "datasets" "transformers>=4.19.0" "torch>=1.10.0" "mlflow"

设置 Ray#

使用 ray.init() 初始化本地集群。默认情况下,此集群仅包含运行本 Notebook 的机器。你也可以在 Anyscale 集群上运行本 Notebook。

from pprint import pprint
import ray

ray.init()

检查集群包含的资源。如果你在本地机器或 Google Colab 上运行本 Notebook,应该能看到机器上可用的 CPU 核数和 GPU 数量。

pprint(ray.cluster_resources())
{'CPU': 48.0,
 'GPU': 4.0,
 'accelerator_type:None': 1.0,
 'memory': 206158430208.0,
 'node:10.0.27.125': 1.0,
 'node:__internal_head__': 1.0,
 'object_store_memory': 59052625920.0}

本 Notebook 使用 Ray Train 微调 HF Transformers 模型,用于 GLUE Benchmark 的一项文本分类任务。

你可以更改这两个变量,控制稍后进行的训练使用 CPU 还是 GPU,以及生成多少个 worker。每个 worker 占用一个 CPU 或 GPU。确保请求的资源不超过可用资源。默认情况下,训练使用一个 GPU worker 运行。

use_gpu = True  # set this to False to run on CPUs
num_workers = 1  # set this to number of GPUs or CPUs you want to use

微调文本分类模型#

GLUE Benchmark 是一组包含九个句子或句子对的分类任务。要了解更多信息,请参阅原始 Notebook

每个任务都有一个与其首字母缩略词相同的名称,其中 mnli-mm 表示它是 MNLI 的不匹配版本。每个任务都与 mnli 具有相同的训练集,但验证集和测试集不同。

GLUE_TASKS = [
    "cola",
    "mnli",
    "mnli-mm",
    "mrpc",
    "qnli",
    "qqp",
    "rte",
    "sst2",
    "stsb",
    "wnli",
]

本 Notebook 可以运行上述列表中的任何任务,并使用 模型 Hub 中的任何带有分类头的模型检查点。根据你使用的模型和 GPU,你可能需要调整批量大小以避免内存不足错误。设置这三个参数后,Notebook 的其余部分应该能顺利运行。

task = "cola"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

加载数据集#

使用 HF Datasets 库下载数据并获取用于评估和将模型与基准进行比较的指标。你可以使用 load_datasetload_metric 函数轻松完成此比较。

除了 mnli-mm 是特殊代码外,你可以直接将任务名称传递给这些函数。

运行正常的 HF Datasets 代码从 Hub 加载数据集。

from datasets import load_dataset

actual_task = "mnli" if task == "mnli-mm" else task
datasets = load_dataset("glue", actual_task)
Reusing dataset glue (/home/ray/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)

The dataset object itself is a DatasetDict, which contains one key for the training, validation, and test set, with more keys for the mismatched validation and test set in the special case of mnli.

使用 Ray Data 预处理数据#

在将这些文本输入到模型之前,你需要对其进行预处理。使用 HF Transformers 的 Tokenizer 进行预处理,它负责对输入进行标记化,包括将标记转换为预训练词汇表中的相应 ID,并将其转换为模型期望的格式。它还会生成模型所需的其他输入。

要完成所有这些预处理,请使用 AutoTokenizer.from_pretrained 方法实例化你的分词器,这可以确保你:

  • 获取与你想使用的模型架构相对应的分词器。

  • 下载预训练此特定检查点时使用的词汇表。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

在前面的调用中传递 use_fast=True,以使用 HF Tokenizers 库中基于 Rust 的快速分词器。这些快速分词器适用于几乎所有模型,但如果前面的调用出错,请移除该参数。

要预处理数据集,你需要包含句子列的名称。以下字典跟踪任务与列名的对应关系

task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

与其直接使用 HF Dataset 对象,不如将其转换为 Ray Data。两者都由 Arrow 表支持,因此转换非常简单。使用内置函数 from_huggingface()

import ray.data

ray_datasets = {
    "train": ray.data.from_huggingface(datasets["train"]),
    "validation": ray.data.from_huggingface(datasets["validation"]),
    "test": ray.data.from_huggingface(datasets["test"]),
}
ray_datasets
{'train': MaterializedDataset(
    num_blocks=1,
    num_rows=8551,
    schema={sentence: string, label: int64, idx: int32}
 ),
 'validation': MaterializedDataset(
    num_blocks=1,
    num_rows=1043,
    schema={sentence: string, label: int64, idx: int32}
 ),
 'test': MaterializedDataset(
    num_blocks=1,
    num_rows=1063,
    schema={sentence: string, label: int64, idx: int32}
 )}

然后你可以编写预处理样本的函数。将它们连同参数 truncation=True 一起输入给 tokenizer。此配置确保 tokenizer 将批量中超过模型所选长度的最长序列截断并填充至该长度。

import numpy as np
from typing import Dict


# Tokenize input sentences
def collate_fn(examples: Dict[str, np.array]):
    sentence1_key, sentence2_key = task_to_keys[task]
    if sentence2_key is None:
        outputs = tokenizer(
            list(examples[sentence1_key]),
            truncation=True,
            padding="longest",
            return_tensors="pt",
        )
    else:
        outputs = tokenizer(
            list(examples[sentence1_key]),
            list(examples[sentence2_key]),
            truncation=True,
            padding="longest",
            return_tensors="pt",
        )

    outputs["labels"] = torch.LongTensor(examples["label"])

    # Move all input tensors to GPU
    for key, value in outputs.items():
        outputs[key] = value.cuda()

    return outputs

使用 Ray Train 微调模型#

现在数据已准备就绪,下载预训练模型并对其进行微调。

由于所有任务都涉及句子分类,请使用 AutoModelForSequenceClassification 类。有关每个训练组件的更多详细信息,请参阅原始 notebook。原始 notebook 使用了与本 notebook 前面示例中用于编码数据集相同的分词器。

使用 Ray Train 的主要区别在于你需要将训练逻辑定义为一个函数(train_func)。你将此训练函数传递给每个 Ray worker 上的TorchTrainer。然后训练使用 PyTorch DDP 进行。

注意

请务必在函数内部初始化模型、指标和分词器。否则,你可能会遇到序列化错误。

import torch
import numpy as np

from datasets import load_metric
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

import ray.train
from ray.train.huggingface.transformers import prepare_trainer, RayTrainReportCallback

num_labels = 3 if task.startswith("mnli") else 1 if task == "stsb" else 2
metric_name = (
    "pearson"
    if task == "stsb"
    else "matthews_correlation"
    if task == "cola"
    else "accuracy"
)
model_name = model_checkpoint.split("/")[-1]
validation_key = (
    "validation_mismatched"
    if task == "mnli-mm"
    else "validation_matched"
    if task == "mnli"
    else "validation"
)
name = f"{model_name}-finetuned-{task}"

# Calculate the maximum steps per epoch based on the number of rows in the training dataset.
# Make sure to scale by the total number of training workers and the per device batch size.
max_steps_per_epoch = ray_datasets["train"].count() // (batch_size * num_workers)


def train_func(config):
    print(f"Is CUDA available: {torch.cuda.is_available()}")

    metric = load_metric("glue", actual_task)
    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_checkpoint, num_labels=num_labels
    )

    train_ds = ray.train.get_dataset_shard("train")
    eval_ds = ray.train.get_dataset_shard("eval")

    train_ds_iterable = train_ds.iter_torch_batches(
        batch_size=batch_size, collate_fn=collate_fn
    )
    eval_ds_iterable = eval_ds.iter_torch_batches(
        batch_size=batch_size, collate_fn=collate_fn
    )

    print("max_steps_per_epoch: ", max_steps_per_epoch)

    args = TrainingArguments(
        name,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_strategy="epoch",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        learning_rate=config.get("learning_rate", 2e-5),
        num_train_epochs=config.get("epochs", 2),
        weight_decay=config.get("weight_decay", 0.01),
        push_to_hub=False,
        max_steps=max_steps_per_epoch * config.get("epochs", 2),
        disable_tqdm=True,  # declutter the output a little
        no_cuda=not use_gpu,  # you need to explicitly set no_cuda if you want CPUs
        report_to="none",
    )

    def compute_metrics(eval_pred):
        predictions, labels = eval_pred
        if task != "stsb":
            predictions = np.argmax(predictions, axis=1)
        else:
            predictions = predictions[:, 0]
        return metric.compute(predictions=predictions, references=labels)

    trainer = Trainer(
        model,
        args,
        train_dataset=train_ds_iterable,
        eval_dataset=eval_ds_iterable,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
    )

    trainer.add_callback(RayTrainReportCallback())

    trainer = prepare_trainer(trainer)

    print("Starting training")
    trainer.train()
2023-09-06 14:25:28.144428: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-06 14:25:28.284936: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-09-06 14:25:29.025734: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-09-06 14:25:29.025801: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-09-06 14:25:29.025807: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
comet_ml is installed but `COMET_API_KEY` is not set.

完成 train_func 后,现在可以实例化TorchTrainer。除了调用函数外,还需要设置控制 worker 数量和所用资源的 scaling_config,以及用于训练和评估的 datasets

from ray.train.torch import TorchTrainer
from ray.train import RunConfig, ScalingConfig, CheckpointConfig

trainer = TorchTrainer(
    train_func,
    scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
    datasets={
        "train": ray_datasets["train"],
        "eval": ray_datasets["validation"],
    },
    run_config=RunConfig(
        checkpoint_config=CheckpointConfig(
            num_to_keep=1,
            checkpoint_score_attribute="eval_loss",
            checkpoint_score_order="min",
        ),
    ),
)

最后,调用 fit 方法开始使用 Ray Train 进行训练。将 Result 对象保存到变量中,以便你可以访问指标和检查点。

result = trainer.fit()

Tune 状态

当前时间2023-09-06 14:27:12
运行时间00:01:40.12
内存18.4/186.6 GiB

系统信息

使用 FIFO 调度算法。
逻辑资源使用:1.0/48 个 CPU,1.0/4 个 GPU (0.0/1.0 accelerator_type:None)

Trial 状态

Trial 名称状态位置迭代总时间 (秒)损失学习率周期
TorchTrainer_e8bd4_00000TERMINATED10.0.27.125:43821 2 76.62590.3866 0 1.5
(TrainTrainable pid=43821) 2023-09-06 14:25:35.638885: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
(TrainTrainable pid=43821) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(TrainTrainable pid=43821) 2023-09-06 14:25:35.782950: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(TrainTrainable pid=43821) 2023-09-06 14:25:36.501583: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(TrainTrainable pid=43821) 2023-09-06 14:25:36.501653: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(TrainTrainable pid=43821) 2023-09-06 14:25:36.501660: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(TrainTrainable pid=43821) comet_ml is installed but `COMET_API_KEY` is not set.
(TorchTrainer pid=43821) Starting distributed worker processes: ['43946 (10.0.27.125)']
(RayTrainWorker pid=43946) Setting up process group for: env:// [rank=0, world_size=1]
(RayTrainWorker pid=43946) 2023-09-06 14:25:42.756510: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
(RayTrainWorker pid=43946) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(RayTrainWorker pid=43946) 2023-09-06 14:25:42.903398: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(SplitCoordinator pid=44017) Auto configuring locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026']
(RayTrainWorker pid=43946) 2023-09-06 14:25:43.737476: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(RayTrainWorker pid=43946) 2023-09-06 14:25:43.737544: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(RayTrainWorker pid=43946) 2023-09-06 14:25:43.737554: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(RayTrainWorker pid=43946) comet_ml is installed but `COMET_API_KEY` is not set.
(RayTrainWorker pid=43946) Is CUDA available: True
(RayTrainWorker pid=43946) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight']
(RayTrainWorker pid=43946) - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
(RayTrainWorker pid=43946) - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
(RayTrainWorker pid=43946) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier.bias', 'pre_classifier.weight']
(RayTrainWorker pid=43946) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(SplitCoordinator pid=44016) Auto configuring locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026']
(RayTrainWorker pid=43946) max_steps_per_epoch:  534
(RayTrainWorker pid=43946) max_steps is given, it will override any value given in num_train_epochs
(RayTrainWorker pid=43946) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
(RayTrainWorker pid=43946)   warnings.warn(
(RayTrainWorker pid=43946) Starting training
(RayTrainWorker pid=43946) ***** Running training *****
(RayTrainWorker pid=43946)   Num examples = 17088
(RayTrainWorker pid=43946)   Num Epochs = 9223372036854775807
(RayTrainWorker pid=43946)   Instantaneous batch size per device = 16
(RayTrainWorker pid=43946)   Total train batch size (w. parallel, distributed & accumulation) = 16
(RayTrainWorker pid=43946)   Gradient Accumulation steps = 1
(RayTrainWorker pid=43946)   Total optimization steps = 1068
(RayTrainWorker pid=43946) /tmp/ipykernel_43503/4088900328.py:23: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)
(SplitCoordinator pid=44016) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
(SplitCoordinator pid=44016) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=44016) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
(RayTrainWorker pid=43946) [W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
(RayTrainWorker pid=43946) {'loss': 0.5414, 'learning_rate': 9.9812734082397e-06, 'epoch': 0.5}
(RayTrainWorker pid=43946) ***** Running Evaluation *****
(RayTrainWorker pid=43946)   Num examples: Unknown
(RayTrainWorker pid=43946)   Batch size = 16
(SplitCoordinator pid=44017) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
(SplitCoordinator pid=44017) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=44017) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
(RayTrainWorker pid=43946) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535
(RayTrainWorker pid=43946) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json
(RayTrainWorker pid=43946) {'eval_loss': 0.5018134117126465, 'eval_matthews_correlation': 0.4145623770066859, 'eval_runtime': 0.6595, 'eval_samples_per_second': 1581.584, 'eval_steps_per_second': 100.081, 'epoch': 0.5}
(RayTrainWorker pid=43946) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin
(RayTrainWorker pid=43946) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json
(RayTrainWorker pid=43946) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json
(RayTrainWorker pid=43946) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/ray_results/TorchTrainer_2023-09-06_14-25-31/TorchTrainer_e8bd4_00000_0_2023-09-06_14-25-32/checkpoint_000000)
(SplitCoordinator pid=44016) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
(SplitCoordinator pid=44016) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=44016) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
(RayTrainWorker pid=43946) {'loss': 0.3866, 'learning_rate': 0.0, 'epoch': 1.5}
(RayTrainWorker pid=43946) ***** Running Evaluation *****
(RayTrainWorker pid=43946)   Num examples: Unknown
(RayTrainWorker pid=43946)   Batch size = 16
(SplitCoordinator pid=44017) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
(SplitCoordinator pid=44017) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=44017) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
(RayTrainWorker pid=43946) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1068
(RayTrainWorker pid=43946) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1068/config.json
(RayTrainWorker pid=43946) {'eval_loss': 0.5527923107147217, 'eval_matthews_correlation': 0.44860917123689154, 'eval_runtime': 0.6646, 'eval_samples_per_second': 1569.42, 'eval_steps_per_second': 99.311, 'epoch': 1.5}
(RayTrainWorker pid=43946) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1068/pytorch_model.bin
(RayTrainWorker pid=43946) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1068/tokenizer_config.json
(RayTrainWorker pid=43946) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1068/special_tokens_map.json
(RayTrainWorker pid=43946) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/ray_results/TorchTrainer_2023-09-06_14-25-31/TorchTrainer_e8bd4_00000_0_2023-09-06_14-25-32/checkpoint_000001)
(RayTrainWorker pid=43946) 
(RayTrainWorker pid=43946) 
(RayTrainWorker pid=43946) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=43946) 
(RayTrainWorker pid=43946) 
(RayTrainWorker pid=43946) {'train_runtime': 66.0485, 'train_samples_per_second': 258.719, 'train_steps_per_second': 16.17, 'train_loss': 0.46413421630859375, 'epoch': 1.5}
2023-09-06 14:27:12,180	WARNING experiment_state.py:371 -- Experiment checkpoint syncing has been triggered multiple times in the last 30.0 seconds. A sync will be triggered whenever a trial has checkpointed more than `num_to_keep` times since last sync or if 300 seconds have passed since last sync. If you have set `num_to_keep` in your `CheckpointConfig`, consider increasing the checkpoint frequency or keeping more checkpoints. You can supress this warning by changing the `TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S` environment variable.
2023-09-06 14:27:12,184	INFO tune.py:1141 -- Total run time: 100.17 seconds (85.12 seconds for the tuning loop).

你可以使用返回的 Result 对象来访问指标以及与最后一次迭代关联的 Ray Train Checkpoint

result
Result(
  metrics={'loss': 0.3866, 'learning_rate': 0.0, 'epoch': 1.5, 'step': 1068, 'eval_loss': 0.5527923107147217, 'eval_matthews_correlation': 0.44860917123689154, 'eval_runtime': 0.6646, 'eval_samples_per_second': 1569.42, 'eval_steps_per_second': 99.311},
  path='/mnt/cluster_storage/ray_results/TorchTrainer_2023-09-06_14-25-31/TorchTrainer_e8bd4_00000_0_2023-09-06_14-25-32',
  filesystem='local',
  checkpoint=Checkpoint(filesystem=local, path=/mnt/cluster_storage/ray_results/TorchTrainer_2023-09-06_14-25-31/TorchTrainer_e8bd4_00000_0_2023-09-06_14-25-32/checkpoint_000001)
)

使用 Ray Tune 调优超参数#

要调优模型的任何超参数,请将你的 TorchTrainer 传入 Tuner 并定义搜索空间。

你还可以利用 Ray Tune 提供的高级搜索算法和调度器。此示例使用 ASHAScheduler 来积极终止表现不佳的 trials。

from ray import tune
from ray.tune import Tuner
from ray.tune.schedulers.async_hyperband import ASHAScheduler

tune_epochs = 4
tuner = Tuner(
    trainer,
    param_space={
        "train_loop_config": {
            "learning_rate": tune.grid_search([2e-5, 2e-4, 2e-3, 2e-2]),
            "epochs": tune_epochs,
        }
    },
    tune_config=tune.TuneConfig(
        metric="eval_loss",
        mode="min",
        num_samples=1,
        scheduler=ASHAScheduler(
            max_t=tune_epochs,
        ),
    ),
    run_config=RunConfig(
        name="tune_transformers",
        checkpoint_config=CheckpointConfig(
            num_to_keep=1,
            checkpoint_score_attribute="eval_loss",
            checkpoint_score_order="min",
        ),
    ),
)
2023-09-06 14:46:47,821	INFO tuner_internal.py:508 -- A `RunConfig` was passed to both the `Tuner` and the `TorchTrainer`. The run config passed to the `Tuner` is the one that will be used.
tune_results = tuner.fit()

Tune 状态

当前时间2023-09-06 14:49:04
运行时间00:02:16.18
内存19.6/186.6 GiB

系统信息

使用 AsyncHyperBand: num_stopped=4
Bracket: Iter 4.000: -0.6517604142427444 | Iter 1.000: -0.5936744660139084
逻辑资源使用:1.0/48 个 CPU,1.0/4 个 GPU (0.0/1.0 accelerator_type:None)

Trial 状态

Trial 名称状态位置train_loop_config/learning_rate迭代总时间 (秒)损失学习率周期
TorchTrainer_e1825_00000TERMINATED10.0.27.125:574962e-05 4 128.443 0.1934 0 3.25
TorchTrainer_e1825_00001TERMINATED10.0.27.125:574970.0002 1 41.24860.616 0.000149906 0.25
TorchTrainer_e1825_00002TERMINATED10.0.27.125:574980.002 1 41.13360.6699 0.00149906 0.25
TorchTrainer_e1825_00003TERMINATED10.0.27.125:574990.02 4 126.699 0.6073 0 3.25
(TrainTrainable pid=57498) 2023-09-06 14:46:52.049839: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
(TrainTrainable pid=57498) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(TrainTrainable pid=57498) 2023-09-06 14:46:52.195780: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(TrainTrainable pid=57498) 2023-09-06 14:46:52.944517: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(TrainTrainable pid=57498) 2023-09-06 14:46:52.944590: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(TrainTrainable pid=57498) 2023-09-06 14:46:52.944597: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(TrainTrainable pid=57498) comet_ml is installed but `COMET_API_KEY` is not set.
(TorchTrainer pid=57498) Starting distributed worker processes: ['57731 (10.0.27.125)']
(TrainTrainable pid=57499) 2023-09-06 14:46:52.229406: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA [repeated 3x across cluster]
(TrainTrainable pid=57499) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [repeated 3x across cluster]
(TrainTrainable pid=57499) 2023-09-06 14:46:52.378805: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. [repeated 3x across cluster]
(RayTrainWorker pid=57741) Setting up process group for: env:// [rank=0, world_size=1]
(TrainTrainable pid=57499) 2023-09-06 14:46:53.174151: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 [repeated 6x across cluster]
(TrainTrainable pid=57499) 2023-09-06 14:46:53.174160: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. [repeated 3x across cluster]
(TrainTrainable pid=57499) comet_ml is installed but `COMET_API_KEY` is not set. [repeated 3x across cluster]
(SplitCoordinator pid=57927) Auto configuring locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026']
(RayTrainWorker pid=57741) Is CUDA available: True
(RayTrainWorker pid=57741) max_steps_per_epoch:  534
(RayTrainWorker pid=57741) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias']
(RayTrainWorker pid=57741) - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
(RayTrainWorker pid=57741) - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
(RayTrainWorker pid=57741) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
(RayTrainWorker pid=57741) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(TorchTrainer pid=57499) Starting distributed worker processes: ['57746 (10.0.27.125)'] [repeated 3x across cluster]
(RayTrainWorker pid=57740) 2023-09-06 14:47:00.036649: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA [repeated 4x across cluster]
(RayTrainWorker pid=57740) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [repeated 4x across cluster]
(RayTrainWorker pid=57740) 2023-09-06 14:47:00.198894: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. [repeated 4x across cluster]
(RayTrainWorker pid=57746) Setting up process group for: env:// [rank=0, world_size=1] [repeated 3x across cluster]
(RayTrainWorker pid=57740) 2023-09-06 14:47:01.085704: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 [repeated 8x across cluster]
(RayTrainWorker pid=57740) 2023-09-06 14:47:01.085711: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. [repeated 4x across cluster]
(RayTrainWorker pid=57740) comet_ml is installed but `COMET_API_KEY` is not set. [repeated 4x across cluster]
(SplitCoordinator pid=57965) Auto configuring locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'] [repeated 7x across cluster]
(RayTrainWorker pid=57741) Starting training
(RayTrainWorker pid=57741) max_steps is given, it will override any value given in num_train_epochs
(RayTrainWorker pid=57741) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
(RayTrainWorker pid=57741)   warnings.warn(
(RayTrainWorker pid=57746) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.bias']
(RayTrainWorker pid=57746) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias']
(RayTrainWorker pid=57731) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']
(RayTrainWorker pid=57740) Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight']
(RayTrainWorker pid=57740) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
(RayTrainWorker pid=57741) ***** Running training *****
(RayTrainWorker pid=57741)   Num examples = 34176
(RayTrainWorker pid=57741)   Num Epochs = 9223372036854775807
(RayTrainWorker pid=57741)   Instantaneous batch size per device = 16
(RayTrainWorker pid=57741)   Total train batch size (w. parallel, distributed & accumulation) = 16
(RayTrainWorker pid=57741)   Gradient Accumulation steps = 1
(RayTrainWorker pid=57741)   Total optimization steps = 2136
(RayTrainWorker pid=57741) /tmp/ipykernel_43503/4088900328.py:23: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)
(SplitCoordinator pid=57927) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)]
(SplitCoordinator pid=57927) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=57927) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
(RayTrainWorker pid=57741) [W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
(RayTrainWorker pid=57741) {'loss': 0.5481, 'learning_rate': 1.4990636704119851e-05, 'epoch': 0.25}
(RayTrainWorker pid=57740) Is CUDA available: True [repeated 3x across cluster]
(RayTrainWorker pid=57740) max_steps_per_epoch:  534 [repeated 3x across cluster]
(RayTrainWorker pid=57740) Starting training [repeated 3x across cluster]
(RayTrainWorker pid=57741) ***** Running Evaluation *****
(RayTrainWorker pid=57741)   Num examples: Unknown
(RayTrainWorker pid=57741)   Batch size = 16
(RayTrainWorker pid=57740) - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). [repeated 3x across cluster]
(RayTrainWorker pid=57740) - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). [repeated 3x across cluster]
(RayTrainWorker pid=57731) Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
(RayTrainWorker pid=57740) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [repeated 3x across cluster]
(RayTrainWorker pid=57740) max_steps is given, it will override any value given in num_train_epochs [repeated 3x across cluster]
(RayTrainWorker pid=57740) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning [repeated 3x across cluster]
(RayTrainWorker pid=57740)   warnings.warn( [repeated 3x across cluster]
(RayTrainWorker pid=57740) ***** Running training ***** [repeated 3x across cluster]
(RayTrainWorker pid=57740)   Num examples = 34176 [repeated 3x across cluster]
(RayTrainWorker pid=57740)   Num Epochs = 9223372036854775807 [repeated 3x across cluster]
(RayTrainWorker pid=57740)   Instantaneous batch size per device = 16 [repeated 3x across cluster]
(RayTrainWorker pid=57740)   Total train batch size (w. parallel, distributed & accumulation) = 16 [repeated 3x across cluster]
(RayTrainWorker pid=57740)   Gradient Accumulation steps = 1 [repeated 3x across cluster]
(RayTrainWorker pid=57740)   Total optimization steps = 2136 [repeated 3x across cluster]
(RayTrainWorker pid=57740) /tmp/ipykernel_43503/4088900328.py:23: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.) [repeated 3x across cluster]
(SplitCoordinator pid=57965) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)] [repeated 3x across cluster]
(SplitCoordinator pid=57965) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False) [repeated 3x across cluster]
(SplitCoordinator pid=57965) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` [repeated 3x across cluster]
(RayTrainWorker pid=57740) [W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [repeated 3x across cluster]
(RayTrainWorker pid=57741) {'eval_loss': 0.5202918648719788, 'eval_matthews_correlation': 0.37321205597032797, 'eval_runtime': 0.7255, 'eval_samples_per_second': 1437.704, 'eval_steps_per_second': 90.976, 'epoch': 0.25}
(RayTrainWorker pid=57741) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535
(RayTrainWorker pid=57741) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json
(RayTrainWorker pid=57741) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin
(RayTrainWorker pid=57741) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json
(RayTrainWorker pid=57741) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json
(RayTrainWorker pid=57741) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/tune_transformers/TorchTrainer_e1825_00000_0_learning_rate=0.0000_2023-09-06_14-46-48/checkpoint_000000)
(RayTrainWorker pid=57746) {'loss': 0.6064, 'learning_rate': 0.009981273408239701, 'epoch': 1.25} [repeated 4x across cluster]
(RayTrainWorker pid=57740) {'eval_loss': 0.6181353330612183, 'eval_matthews_correlation': 0.0, 'eval_runtime': 0.7543, 'eval_samples_per_second': 1382.828, 'eval_steps_per_second': 87.504, 'epoch': 0.25} [repeated 3x across cluster]
(RayTrainWorker pid=57746) ***** Running Evaluation ***** [repeated 4x across cluster]
(RayTrainWorker pid=57746)   Num examples: Unknown [repeated 4x across cluster]
(RayTrainWorker pid=57746)   Batch size = 16 [repeated 4x across cluster]
(SplitCoordinator pid=57954) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)] [repeated 6x across cluster]
(SplitCoordinator pid=57954) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False) [repeated 6x across cluster]
(SplitCoordinator pid=57954) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` [repeated 6x across cluster]
(RayTrainWorker pid=57740) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-535 [repeated 3x across cluster]
(RayTrainWorker pid=57740) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/config.json [repeated 3x across cluster]
(RayTrainWorker pid=57740) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/pytorch_model.bin [repeated 3x across cluster]
(RayTrainWorker pid=57740) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/tokenizer_config.json [repeated 3x across cluster]
(RayTrainWorker pid=57740) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-535/special_tokens_map.json [repeated 3x across cluster]
(RayTrainWorker pid=57740) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/tune_transformers/TorchTrainer_e1825_00001_1_learning_rate=0.0002_2023-09-06_14-46-48/checkpoint_000000) [repeated 3x across cluster]
(RayTrainWorker pid=57746) {'loss': 0.6061, 'learning_rate': 0.004971910112359551, 'epoch': 2.25} [repeated 2x across cluster]
(RayTrainWorker pid=57741) {'eval_loss': 0.5246258974075317, 'eval_matthews_correlation': 0.489934557943789, 'eval_runtime': 0.6462, 'eval_samples_per_second': 1614.032, 'eval_steps_per_second': 102.134, 'epoch': 1.25} [repeated 2x across cluster]
(RayTrainWorker pid=57746) ***** Running Evaluation ***** [repeated 2x across cluster]
(RayTrainWorker pid=57746)   Num examples: Unknown [repeated 2x across cluster]
(RayTrainWorker pid=57746)   Batch size = 16 [repeated 2x across cluster]
(SplitCoordinator pid=57927) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)] [repeated 4x across cluster]
(SplitCoordinator pid=57927) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False) [repeated 4x across cluster]
(SplitCoordinator pid=57927) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` [repeated 4x across cluster]
(RayTrainWorker pid=57741) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1070 [repeated 2x across cluster]
(RayTrainWorker pid=57741) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/config.json [repeated 2x across cluster]
(RayTrainWorker pid=57741) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/pytorch_model.bin [repeated 2x across cluster]
(RayTrainWorker pid=57741) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/tokenizer_config.json [repeated 2x across cluster]
(RayTrainWorker pid=57741) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1070/special_tokens_map.json [repeated 2x across cluster]
(RayTrainWorker pid=57741) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/tune_transformers/TorchTrainer_e1825_00000_0_learning_rate=0.0000_2023-09-06_14-46-48/checkpoint_000001) [repeated 2x across cluster]
(RayTrainWorker pid=57746) {'loss': 0.6073, 'learning_rate': 0.0, 'epoch': 3.25} [repeated 2x across cluster]
(RayTrainWorker pid=57741) {'eval_loss': 0.6450843811035156, 'eval_matthews_correlation': 0.5259674254268325, 'eval_runtime': 0.6474, 'eval_samples_per_second': 1611.106, 'eval_steps_per_second': 101.949, 'epoch': 2.25} [repeated 2x across cluster]
(RayTrainWorker pid=57746) ***** Running Evaluation ***** [repeated 2x across cluster]
(RayTrainWorker pid=57746)   Num examples: Unknown [repeated 2x across cluster]
(RayTrainWorker pid=57746)   Batch size = 16 [repeated 2x across cluster]
(SplitCoordinator pid=57927) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(1, equal=True)] [repeated 4x across cluster]
(SplitCoordinator pid=57927) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=['84374908fd32ea9885fdd6d21aadf2ce3e296daf28a26522e7a8d026'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False) [repeated 4x across cluster]
(SplitCoordinator pid=57927) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` [repeated 4x across cluster]
(RayTrainWorker pid=57741) Saving model checkpoint to distilbert-base-uncased-finetuned-cola/checkpoint-1605 [repeated 2x across cluster]
(RayTrainWorker pid=57741) Configuration saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/config.json [repeated 2x across cluster]
(RayTrainWorker pid=57741) Model weights saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/pytorch_model.bin [repeated 2x across cluster]
(RayTrainWorker pid=57741) tokenizer config file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/tokenizer_config.json [repeated 2x across cluster]
(RayTrainWorker pid=57741) Special tokens file saved in distilbert-base-uncased-finetuned-cola/checkpoint-1605/special_tokens_map.json [repeated 2x across cluster]
(RayTrainWorker pid=57741) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/tune_transformers/TorchTrainer_e1825_00000_0_learning_rate=0.0000_2023-09-06_14-46-48/checkpoint_000002) [repeated 2x across cluster]
(RayTrainWorker pid=57746) 
(RayTrainWorker pid=57746) 
(RayTrainWorker pid=57746) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=57746) 
(RayTrainWorker pid=57746) 
(RayTrainWorker pid=57746) {'train_runtime': 115.5377, 'train_samples_per_second': 295.8, 'train_steps_per_second': 18.487, 'train_loss': 0.6787891173630618, 'epoch': 3.25}
2023-09-06 14:49:04,574	INFO tune.py:1141 -- Total run time: 136.19 seconds (136.17 seconds for the tuning loop).
(RayTrainWorker pid=57741) 
(RayTrainWorker pid=57741) 
(RayTrainWorker pid=57741) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=57741) 
(RayTrainWorker pid=57741) 
(RayTrainWorker pid=57741) {'train_runtime': 117.6791, 'train_samples_per_second': 290.417, 'train_steps_per_second': 18.151, 'train_loss': 0.3468295286657212, 'epoch': 3.25}

View the results of the tuning run as a dataframe, and find the best result.

tune_results.get_dataframe().sort_values("eval_loss")
损失 学习率 周期 步骤 评估损失 评估 Matthews 相关系数 评估运行时长 每秒评估样本数 每秒评估步数 时间戳 ... 总时间 (秒) PID 主机名 节点 IP 自恢复以来时间 自恢复以来迭代次数 检查点目录名称 config/train_loop_config/learning_rate config/train_loop_config/epochs logdir
1 0.6160 0.000150 0.25 535 0.618135 0.000000 0.7543 1382.828 87.504 1694036857 ... 41.248600 57497 ip-10-0-27-125 10.0.27.125 41.248600 1 checkpoint_000000 0.00020 4 e1825_00001
2 0.6699 0.001499 0.25 535 0.619657 0.000000 0.7449 1400.202 88.603 1694036856 ... 41.133609 57498 ip-10-0-27-125 10.0.27.125 41.133609 1 checkpoint_000000 0.00200 4 e1825_00002
3 0.6073 0.000000 3.25 2136 0.619694 0.000000 0.6329 1648.039 104.286 1694036942 ... 126.699238 57499 ip-10-0-27-125 10.0.27.125 126.699238 4 checkpoint_000003 0.02000 4 e1825_00003
0 0.1934 0.000000 3.25 2136 0.747960 0.520756 0.6530 1597.187 101.068 1694036944 ... 128.443495 57496 ip-10-0-27-125 10.0.27.125 128.443495 4 checkpoint_000003 0.00002 4 e1825_00000

4 行 × 26 列

best_result = tune_results.get_best_result()

共享模型#

要与社区共享模型,还需要以下几个步骤。

你在 Ray 集群上进行了训练,但想从本地环境共享模型。此配置允许你轻松进行身份验证。

首先,在 Hugging Face 网站上存储你的身份验证令牌。如果尚未注册,请在此注册。然后执行以下单元格并输入你的用户名和密码

from huggingface_hub import notebook_login

notebook_login()

然后你需要安装 Git-LFS。取消注释以下指令

# !apt install git-lfs

加载使用最佳表现检查点的模型

import os
from ray.train import Checkpoint

checkpoint: Checkpoint = best_result.checkpoint

with checkpoint.as_directory() as checkpoint_dir:
    checkpoint_path = os.path.join(checkpoint_dir, "checkpoint")
    model = AutoModelForSequenceClassification.from_pretrained(checkpoint_path)

现在可以将训练结果上传到 Hub。执行此指令

model.push_to_hub()

现在可以共享此模型。其他人可以使用标识符 "your-username/the-name-you-picked" 加载它。例如

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("sgugger/my-awesome-model")

另请参阅#