使用 Ray Train 和 DeepSpeed 微调 GPT-J-6B#

try-anyscale-quickstart

本示例展示了如何使用 Ray Train 进行 GPT-J 微调。GPT-J 是一个类似 GPT-2 的因果语言模型,在 Pile 数据集上进行训练。这个特定模型有 60 亿个参数。更多信息,请参阅GPT-J

本示例使用 Ray Train 🤗 Transformers 集成和 Hugging Face Hub 中的预训练模型。请注意,本示例适用于其他类似模型。

这是一个高级示例,重点关注 Ray Train 的性能和分布式计算方面。有关 Ray Train 🤗 Transformers 集成的入门级介绍,请参阅HuggingFace Transformers 基本示例

在开始本示例之前,请阅读Ray Train 关键概念Ray Data 集成用户指南

注意

要运行本示例,请确保您的 Ray 集群至少有一块具有 16 GB 或更多内存的 GPU。所需的内存量取决于模型。本 notebook 在 16 个 g4dn.4xlarge 实例(包括头部节点)上进行了测试。

本 notebook 包含以下步骤

  1. 设置 Ray

  2. 加载数据集

  3. 使用 Ray Data 预处理数据集

  4. 使用 Ray Train 运行训练

  5. 从 prompt 生成文本

取消注释并运行以下行以安装所有必要的依赖项(本 notebook 经过测试,使用 accelerate=0.18.0, transformers==4.26.0, deepspeed==0.12.3

! pip install -q "datasets" "evaluate" "accelerate==0.18.0" "transformers==4.26.0" "torch>=1.12.0" "deepspeed==0.12.3"
import numpy as np
import pandas as pd
import os

设置 Ray#

首先,让我们设置一些全局变量。我们将使用 16 个 worker,每个 worker 分配 1 个 GPU 和 8 个 CPU。

model_name = "EleutherAI/gpt-j-6B"
use_gpu = True
num_workers = 16
cpus_per_worker = 8

我们将使用 ray.init() 初始化一个本地集群。默认情况下,此集群仅包含您运行此 notebook 的机器。您也可以在 Anyscale 集群上运行此 notebook。

我们定义一个运行时环境,以确保 Ray worker 可以访问所有必要的包。如果您的集群中每个节点上都已安装了所有包,则可以省略 runtime_env 参数。

import ray

ray.init(
    runtime_env={
        "pip": [
            "datasets",
            "evaluate",
            # The latest combination accelerate==0.25.0, transformers==4.36.0, deepspeed==0.12.4
            # has issues with DeepSpeed process group initialization,
            # and will result in a batch_size validation problem.
            # TODO(ml-team): get rid of the pins once the issue is fixed.
            "accelerate==0.18.0",
            "transformers==4.26.0",
            "torch>=1.12.0",
            "deepspeed==0.12.3",
        ],
    },
)
隐藏代码单元格内容
# THIS SHOULD BE HIDDEN IN DOCS AND ONLY RAN IN CI
# Download the model from our S3 mirror as it's faster

import ray
import subprocess
import ray.util.scheduling_strategies


def force_on_node(node_id: str, remote_func_or_actor_class):
    scheduling_strategy = ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy(
        node_id=node_id, soft=False
    )
    options = {"scheduling_strategy": scheduling_strategy}
    return remote_func_or_actor_class.options(**options)


def run_on_every_node(remote_func_or_actor_class, **remote_kwargs):
    refs = []
    for node in ray.nodes():
        if node["Alive"] and node["Resources"].get("GPU", None):
            refs.append(
                force_on_node(node["NodeID"], remote_func_or_actor_class).remote(
                    **remote_kwargs
                )
            )
    return ray.get(refs)


@ray.remote(num_gpus=1)
def download_model():
    from transformers.utils.hub import TRANSFORMERS_CACHE

    path = os.path.expanduser(
        os.path.join(TRANSFORMERS_CACHE, "models--EleutherAI--gpt-j-6B")
    )
    subprocess.run(["mkdir", "-p", os.path.join(path, "snapshots", "main")])
    subprocess.run(["mkdir", "-p", os.path.join(path, "refs")])
    if os.path.exists(os.path.join(path, "refs", "main")):
        return
    subprocess.run(
        [
            "aws",
            "s3",
            "sync",
            "--no-sign-request",
            "s3://large-dl-models-mirror/models--EleutherAI--gpt-j-6B/main/",
            os.path.join(path, "snapshots", "main"),
        ]
    )
    with open(os.path.join(path, "snapshots", "main", "hash"), "r") as f:
        f_hash = f.read().strip()
    with open(os.path.join(path, "refs", "main"), "w") as f:
        f.write(f_hash)
    os.rename(
        os.path.join(path, "snapshots", "main"), os.path.join(path, "snapshots", f_hash)
    )


_ = run_on_every_node(download_model)

加载数据集#

我们将使用tiny_shakespeare 数据集对模型进行微调,该数据集包含莎士比亚各种戏剧中的 40,000 行文本。目标是让 GPT-J 模型更好地生成莎士比亚风格的文本。

from datasets import load_dataset

print("Loading tiny_shakespeare dataset")
current_dataset = load_dataset("tiny_shakespeare")
current_dataset

我们将使用Ray Data进行分布式预处理和数据摄取。通过使用ray.data.from_huggingface(),我们可以轻松地将从 Hugging Face Hub 获取的数据集转换为 Ray Data。

import ray.data

ray_datasets = {
    "train": ray.data.from_huggingface(current_dataset["train"]),
    "validation": ray.data.from_huggingface(current_dataset["validation"]),
}

ray_datasets
{'train': MaterializedDataset(num_blocks=1, num_rows=1, schema={text: string}),
 'validation': MaterializedDataset(num_blocks=1, num_rows=1, schema={text: string})}

请注意,数据集表示为单个大字符串行,需要进行一些预处理。为此,请使用map_batches() API 将转换函数应用于数据批次。

split_text 函数接收单个字符串并将其分割成独立的行,移除空行和以 ':' 结尾的角色名称(例如 'ROMEO:')。tokenize 函数接收这些行并使用与模型相关的 🤗 Tokenizer 对其进行分词,通过填充和截断确保每个条目具有相同的长度 (block_size)。此预处理对于训练是必需的。

注意

此预处理可以通过其他方式完成。一种常见模式是先进行分词,然后将获得的分词分割成大小相等的块。

block_size = 512
from transformers import AutoTokenizer


def split_text(batch: pd.DataFrame) -> pd.DataFrame:
    text = list(batch["text"])
    flat_text = "".join(text)
    split_text = [
        x.strip()
        for x in flat_text.split("\n")
        if x.strip() and not x.strip()[-1] == ":"
    ]
    return pd.DataFrame(split_text, columns=["text"])


def tokenize(batch: pd.DataFrame) -> dict:
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
    tokenizer.pad_token = tokenizer.eos_token
    ret = tokenizer(
        list(batch["text"]),
        truncation=True,
        max_length=block_size,
        padding="max_length",
        return_tensors="np",
    )
    ret["labels"] = ret["input_ids"].copy()
    return dict(ret)


processed_datasets = {
    key: (
        ds.map_batches(split_text, batch_format="pandas")
        .map_batches(tokenize, batch_format="pandas")
    )
    for key, ds in ray_datasets.items()
}
processed_datasets
{'train': MapBatches(tokenize)
 +- MapBatches(split_text)
    +- Dataset(num_blocks=1, num_rows=1, schema={text: string}),
 'validation': MapBatches(tokenize)
 +- MapBatches(split_text)
    +- Dataset(num_blocks=1, num_rows=1, schema={text: string})}

使用 Ray Train 微调模型#

配置 Ray Train 的TorchTrainer进行模型的分布式微调。指定一个 train_loop_per_worker 函数,该函数定义了由 Ray 使用分布式数据并行性(其内部使用 PyTorch 分布式后端)进行分发的训练逻辑。每个 worker 都有自己的模型副本,但处理不同的数据。在每个步骤结束时,所有 worker 同步梯度。

由于 GPT-J 是一个相对较大的模型,可能无法在较小的 GPU 类型(<=16 GB GRAM)上运行。为了解决这个问题,本示例使用了DeepSpeed,这是一个用于优化训练过程以及卸载和分割优化器和参数状态,从而减少 GRAM 使用的库。此外,DeepSpeed ZeRO Stage 3 可以在不耗尽内存的情况下加载大型模型。

🤗 Transformers 和 Ray Train 的集成让您可以轻松配置和使用 DDP 和 DeepSpeed。您只需在TrainingArguments 对象中指定 DeepSpeed 配置即可。

提示

DeepSpeed 有许多设置允许您权衡速度和内存使用。下面使用的设置针对所使用的集群配置(16 个 g4dn.4xlarge 节点)和每个设备的批量大小 16 进行了优化。一些注意事项:

  • 如果您的 GPU 支持 bfloat16,请使用它而不是 float16 混合精度,以获得更好的性能并防止溢出。在 TrainingArguments 中将 fp16=True 替换为 bf16=True

  • 如果您的 GRAM 不足:尝试减小批量大小(在下一个单元格下方定义),在 DeepSpeed 配置中设置 "overlap_comm": False

  • 如果您的 RAM 不足,请向您的集群添加更多节点,使用具有更多 RAM 的节点,在 DeepSpeed 配置中设置 "pin_memory": False,减小批量大小,并从 DeepSpeed 配置中移除 "offload_param"

有关 DeepSpeed 配置的更多信息,请参阅Hugging Face 文档DeepSpeed 文档

此外,如果您更喜欢低级 API,下面的逻辑可以表示为一个由 Ray Train TorchTrainer 分布的Accelerate 训练循环

训练速度#

由于本示例使用数据并行,每个 worker 在其自己的数据分片上操作。train_ds.iter_torch_batches 中设置的批量大小是每个设备的批量大小(每个 worker 的批量大小)。通过改变 worker 数量,您可以改变有效批量大小,从而改变训练完成所需的时间。有效批量大小计算为 每个设备的批量大小 * worker 数量 * 梯度累积步骤数量。随着您增加更多 worker,有效批量大小增加,因此完成一个完整 epoch 所需的时间减少。虽然由于额外的通信开销,加速不是完全线性的,但在许多情况下可以接近线性。

预处理后的数据集有 1348 个示例。我们将每个设备的批量大小设置为 16。

  • 使用 16 个 g4dn.4xlarge 节点时,有效批量大小为 256,相当于每个 epoch 85 个步骤。一个 epoch 耗时约 2440 秒(包括初始化时间)。

  • 使用 32 个 g4dn.4xlarge 节点时,有效批量大小为 512,相当于每个 epoch 43 个步骤。一个 epoch 耗时约 1280 秒(包括初始化时间)。

import evaluate
import torch
from transformers import (
    Trainer,
    TrainingArguments,
    GPTJForCausalLM,
    AutoTokenizer,
    default_data_collator,
)
from transformers.utils.logging import disable_progress_bar, enable_progress_bar

from ray import train
from ray.train.huggingface.transformers import prepare_trainer, RayTrainReportCallback


def train_func(config):
    # Use the actual number of CPUs assigned by Ray
    os.environ["OMP_NUM_THREADS"] = str(
        train.get_context().get_trial_resources().bundles[-1].get("CPU", 1)
    )
    # Enable tf32 for better performance
    torch.backends.cuda.matmul.allow_tf32 = True

    batch_size = config.get("batch_size", 4)
    epochs = config.get("epochs", 2)
    warmup_steps = config.get("warmup_steps", 0)
    learning_rate = config.get("learning_rate", 0.00002)
    weight_decay = config.get("weight_decay", 0.01)
    steps_per_epoch = config.get("steps_per_epoch")

    deepspeed = {
        "fp16": {
            "enabled": "auto",
            "initial_scale_power": 8,
            "hysteresis": 4,
            "consecutive_hysteresis": True,
        },
        "bf16": {"enabled": "auto"},
        "optimizer": {
            "type": "AdamW",
            "params": {
                "lr": "auto",
                "betas": "auto",
                "eps": "auto",
            },
        },
        "zero_optimization": {
            "stage": 3,
            "offload_optimizer": {
                "device": "cpu",
                "pin_memory": True,
            },
            "overlap_comm": True,
            "contiguous_gradients": True,
            "reduce_bucket_size": "auto",
            "stage3_prefetch_bucket_size": "auto",
            "stage3_param_persistence_threshold": "auto",
            "gather_16bit_weights_on_model_save": True,
            "round_robin_gradients": True,
        },
        "gradient_accumulation_steps": "auto",
        "gradient_clipping": "auto",
        "steps_per_print": 10,
        "train_batch_size": "auto",
        "train_micro_batch_size_per_gpu": "auto",
        "wall_clock_breakdown": False,
    }

    print("Preparing training arguments")
    training_args = TrainingArguments(
        "output",
        logging_steps=1,
        save_strategy="steps",
        save_steps=steps_per_epoch,
        max_steps=steps_per_epoch * epochs,
        per_device_train_batch_size=batch_size,
        gradient_accumulation_steps=1,
        learning_rate=learning_rate,
        weight_decay=weight_decay,
        warmup_steps=warmup_steps,
        label_names=["input_ids", "attention_mask"],
        push_to_hub=False,
        report_to="none",
        disable_tqdm=True,  # declutter the output a little
        fp16=True,
        gradient_checkpointing=True,
        deepspeed=deepspeed,
    )
    disable_progress_bar()

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token

    print("Loading model")

    model = GPTJForCausalLM.from_pretrained(model_name, use_cache=False)
    model.resize_token_embeddings(len(tokenizer))

    print("Model loaded")

    enable_progress_bar()

    metric = evaluate.load("accuracy")

    train_ds = train.get_dataset_shard("train")
    eval_ds = train.get_dataset_shard("validation")

    train_ds_iterable = train_ds.iter_torch_batches(
        batch_size=batch_size,
        local_shuffle_buffer_size=train.get_context().get_world_size() * batch_size,
    )
    eval_ds_iterable = eval_ds.iter_torch_batches(batch_size=batch_size)

    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_ds_iterable,
        eval_dataset=eval_ds_iterable,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
    )

    # Add callback to report checkpoints to Ray Train
    trainer.add_callback(RayTrainReportCallback())
    trainer = prepare_trainer(trainer)
    trainer.train()

定义训练函数后,实例化TorchTrainer。除了函数之外,设置 scaling_config 控制 worker 数量和使用的资源量,并设置 datasets(预处理后的 Ray Datasets)用于训练和评估。

注意

在多个节点上运行时,需要将检查点和其他输出持久化到外部存储,以便在训练完成后访问。您应该设置云存储或 NFS,然后将 storage_path 替换为您自己的云存储桶 URI 或 NFS 路径。

有关更多详细信息,请参阅配置和持久化存储

storage_path = "s3://your-bucket-here"  # TODO: Set up cloud storage
# storage_path="/mnt/path/to/nfs"     # TODO: Alternatively, set up NFS
batch_size = 16
train_ds_size = processed_datasets["train"].count()
steps_per_epoch = train_ds_size // (batch_size * num_workers)
from ray.train.torch import TorchTrainer
from ray.train import RunConfig, ScalingConfig

trainer = TorchTrainer(
    train_loop_per_worker=train_func,
    train_loop_config={
        "epochs": 1,
        "batch_size": batch_size,  # per device
        "steps_per_epoch": steps_per_epoch,
    },
    scaling_config=ScalingConfig(
        num_workers=num_workers,
        use_gpu=use_gpu,
        resources_per_worker={"GPU": 1, "CPU": cpus_per_worker},
    ),
    datasets=processed_datasets,
    run_config=RunConfig(storage_path=storage_path),
)

最后,调用fit() 方法开始使用 Ray Train 进行训练。将Result 对象保存到变量中,以便访问指标和检查点。

results = trainer.fit()
隐藏代码单元格输出

Tune 状态

当前时间2023-08-18 18:54:02
运行时间00:44:50.37
内存10.2/62.0 GiB

系统信息

使用 FIFO 调度算法。
逻辑资源使用:129.0/256 CPU,16.0/16 GPU

试验状态

试验名称状态lociter总时间 (s)损失学习率epoch
TorchTrainer_01ea5_00000已终止10.0.60.59:8839 1 2663.78 0.0692.38095e-07 1
(TrainTrainable pid=8839, ip=10.0.60.59) 2023-08-18 18:09:16.315108: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
(TrainTrainable pid=8839, ip=10.0.60.59) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(TrainTrainable pid=8839, ip=10.0.60.59) 2023-08-18 18:09:16.462944: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(TrainTrainable pid=8839, ip=10.0.60.59) 2023-08-18 18:09:17.336229: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(TrainTrainable pid=8839, ip=10.0.60.59) 2023-08-18 18:09:17.336299: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(TrainTrainable pid=8839, ip=10.0.60.59) 2023-08-18 18:09:17.336306: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(TrainTrainable pid=8839, ip=10.0.60.59) --------------------------------------------------------------------------
(TrainTrainable pid=8839, ip=10.0.60.59)                  Aim collects anonymous usage analytics.                 
(TrainTrainable pid=8839, ip=10.0.60.59)                         Read how to opt-out here:                         
(TrainTrainable pid=8839, ip=10.0.60.59)     https://aimstack.readthedocs.io/en/latest/community/telemetry.html    
(TrainTrainable pid=8839, ip=10.0.60.59) --------------------------------------------------------------------------
(TrainTrainable pid=8839, ip=10.0.60.59) comet_ml is installed but `COMET_API_KEY` is not set.
(TorchTrainer pid=8839, ip=10.0.60.59) Starting distributed worker processes: ['8911 (10.0.60.59)', '36675 (10.0.13.222)', '8880 (10.0.63.99)', '8867 (10.0.49.236)', '49329 (10.0.40.253)', '8845 (10.0.18.195)', '36249 (10.0.11.26)', '8858 (10.0.0.119)', '8857 (10.0.44.114)', '8885 (10.0.47.209)', '36311 (10.0.27.53)', '8830 (10.0.30.35)', '8875 (10.0.0.80)', '8851 (10.0.43.240)', '9631 (10.0.57.153)', '36262 (10.0.52.191)']
(RayTrainWorker pid=8911, ip=10.0.60.59) Setting up process group for: env:// [rank=0, world_size=16]
(RayTrainWorker pid=8911, ip=10.0.60.59) 2023-08-18 18:09:25.209122: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
(RayTrainWorker pid=8911, ip=10.0.60.59) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(RayTrainWorker pid=8911, ip=10.0.60.59) 2023-08-18 18:09:25.358493: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(RayTrainWorker pid=8911, ip=10.0.60.59) 2023-08-18 18:09:26.095161: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(RayTrainWorker pid=8911, ip=10.0.60.59) 2023-08-18 18:09:26.095229: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(RayTrainWorker pid=8911, ip=10.0.60.59) 2023-08-18 18:09:26.095236: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(SplitCoordinator pid=8980, ip=10.0.60.59) Auto configuring locality_with_output=['6002ded0aaa53ce9a0351d22a72b344ef411a422919132f41d9f937a', 'd3bbd390b6fe73f26202f96d75998946cf3e8b457528d426db0c6e07', 'fe6aaf54317ee630a02d23e0d49581b57b5cd51316eaf769e28bb045', 'f7de4694a4f764c05a9c51a6a4bd40ac33f3fced3b25127b25cd4ac3', '42866a2fba4ce2ab4b6645c4d731d486b762e2b23ac24cafccba7096', '8a7272830662c7e756a656de0a9b433a3a1f9b990768f692b6fe11a7', 'bba62e8b57552509c62a6b6b7fd67c1a2280b9d81b3d9c41eb4d1b9b', 'b40764f303538c24bc439106f2e7b2144d382bfed6c9fdec15ab828e', 'd1de4d4b6d44eff93857026df4ef0f70e24e3dc91e15d87015f2ed32', '4d6a9dc1aa7bfc80cb73d9f66f4e28041807f12769391f5643bce143', '8bcc7235f459b61be21fe158d0bae4fef2ec6de013ec60e7aaf7897a', '73c50b995811afa0ece70fd3d4466b7fd0dc85a97d6807128b2c47da', '03bf3d374a9f857b1cd1aebdbe028208f7904b077fb151790e03e9fe', '9f7fc101a7d6b3e17b72e57ca1c92f91d13aa385a6740f99d58ec016', '867844d104a8e9351a1dcc8bbd61d99906a8dc5b53e220c2ae2efbe1', '7677b344c59d6b30c3db451f48e346d61bb60cc798e5567aa4e0a1ea']
(RayTrainWorker pid=49329) comet_ml is installed but `COMET_API_KEY` is not set.
(RayTrainWorker pid=8867, ip=10.0.49.236) --------------------------------------------------------------------------
(RayTrainWorker pid=8867, ip=10.0.49.236)                  Aim collects anonymous usage analytics.                 
(RayTrainWorker pid=8867, ip=10.0.49.236)                         Read how to opt-out here:                         
(RayTrainWorker pid=8867, ip=10.0.49.236)     https://aimstack.readthedocs.io/en/latest/community/telemetry.html    
(RayTrainWorker pid=8867, ip=10.0.49.236) --------------------------------------------------------------------------
(SplitCoordinator pid=8980, ip=10.0.60.59) 2023-08-18 18:09:26.534936: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA [repeated 16x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.rayai.org.cn/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(SplitCoordinator pid=8980, ip=10.0.60.59) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [repeated 16x across cluster]
(SplitCoordinator pid=8980, ip=10.0.60.59) 2023-08-18 18:09:26.667181: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. [repeated 16x across cluster]
(RayTrainWorker pid=8885, ip=10.0.47.209) Preparing training arguments
(RayTrainWorker pid=36675, ip=10.0.13.222) Loading model
(autoscaler +3m53s) [workspace snapshot] New snapshot created successfully (size: 172.52 MB).
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:12:01,852] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 6.05B parameters
(RayTrainWorker pid=36675, ip=10.0.13.222) Preparing training arguments [repeated 15x across cluster]
(RayTrainWorker pid=8880, ip=10.0.63.99) Loading model [repeated 15x across cluster]
(RayTrainWorker pid=8851, ip=10.0.43.240) Model loaded
Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 22.1MB/s]
(SplitCoordinator pid=8980, ip=10.0.60.59) 2023-08-18 18:09:27.424862: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 [repeated 32x across cluster]
(SplitCoordinator pid=8980, ip=10.0.60.59) 2023-08-18 18:09:27.424869: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. [repeated 16x across cluster]
(RayTrainWorker pid=36311, ip=10.0.27.53) comet_ml is installed but `COMET_API_KEY` is not set. [repeated 15x across cluster]
(RayTrainWorker pid=36262, ip=10.0.52.191) -------------------------------------------------------------------------- [repeated 26x across cluster]
(RayTrainWorker pid=36262, ip=10.0.52.191)                  Aim collects anonymous usage analytics.                  [repeated 13x across cluster]
(RayTrainWorker pid=36262, ip=10.0.52.191)                         Read how to opt-out here:                          [repeated 13x across cluster]
(RayTrainWorker pid=36262, ip=10.0.52.191)     https://aimstack.readthedocs.io/en/latest/community/telemetry.html     [repeated 13x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) max_steps is given, it will override any value given in num_train_epochs
(RayTrainWorker pid=8911, ip=10.0.60.59) Using cuda_amp half precision backend
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:12:36,256] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.2, git-hash=unknown, git-branch=unknown
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:12:36,373] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
(RayTrainWorker pid=8858, ip=10.0.0.119) Using /home/ray/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
(RayTrainWorker pid=8858, ip=10.0.0.119) Creating extension directory /home/ray/.cache/torch_extensions/py39_cu118/cpu_adam...
Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 19.8MB/s] [repeated 15x across cluster]
(RayTrainWorker pid=8857, ip=10.0.44.114) max_steps is given, it will override any value given in num_train_epochs [repeated 15x across cluster]
(RayTrainWorker pid=8857, ip=10.0.44.114) Using cuda_amp half precision backend [repeated 15x across cluster]
(RayTrainWorker pid=49329) Detected CUDA files, patching ldflags
(RayTrainWorker pid=49329) Emitting ninja build file /home/ray/.cache/torch_extensions/py39_cu118/cpu_adam/build.ninja...
(RayTrainWorker pid=49329) Building extension module cpu_adam...
(RayTrainWorker pid=49329) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
(RayTrainWorker pid=8858, ip=10.0.0.119) [1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/TH -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/ray/anaconda3/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -c /home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
(RayTrainWorker pid=8830, ip=10.0.30.35) Model loaded [repeated 15x across cluster]
(RayTrainWorker pid=49329) [2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/TH -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/ray/anaconda3/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512__ -D__ENABLE_CUDA__ -c /home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
(RayTrainWorker pid=36675, ip=10.0.13.222) [1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/TH -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/ray/anaconda3/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -c /home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o  [repeated 15x across cluster]
(RayTrainWorker pid=49329) [3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/home/ray/anaconda3/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
(RayTrainWorker pid=49329) Time to load cpu_adam op: 31.202290058135986 seconds
(RayTrainWorker pid=49329) Loading extension module cpu_adam...
(RayTrainWorker pid=36675, ip=10.0.13.222) Using /home/ray/.cache/torch_extensions/py39_cu118 as PyTorch extensions root... [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Creating extension directory /home/ray/.cache/torch_extensions/py39_cu118/cpu_adam... [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Detected CUDA files, patching ldflags [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Emitting ninja build file /home/ray/.cache/torch_extensions/py39_cu118/cpu_adam/build.ninja... [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Building extension module cpu_adam... [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [repeated 15x across cluster]
(RayTrainWorker pid=49329) Adam Optimizer #0 is created with AVX512 arithmetic capability.
(RayTrainWorker pid=49329) Config: alpha=0.000020, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
(RayTrainWorker pid=49329) Building extension module utils...
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:13,196] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:13,212] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:13,212] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:13,212] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:13,212] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:13,520] [INFO] [utils.py:785:see_memory_usage] Stage 3 initialize beginning
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:13,521] [INFO] [utils.py:786:see_memory_usage] MA 0.11 GB         Max_MA 1.26 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:13,521] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 8.96 GB, percent = 14.4%
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:13,523] [INFO] [stage3.py:113:__init__] Reduce bucket size 16777216
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:13,523] [INFO] [stage3.py:114:__init__] Prefetch bucket size 15099494
(RayTrainWorker pid=49329) [1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/TH -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/THC -isystem /home/ray/anaconda3/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o 
(RayTrainWorker pid=36675, ip=10.0.13.222) [2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/TH -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/ray/anaconda3/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512__ -D__ENABLE_CUDA__ -c /home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o  [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) [3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/home/ray/anaconda3/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Time to load cpu_adam op: 34.29589319229126 seconds [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Adam Optimizer #0 is created with AVX512 arithmetic capability. [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Config: alpha=0.000020, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1 [repeated 15x across cluster]
(RayTrainWorker pid=49329) [2/2] c++ flatten_unflatten.o -shared -L/home/ray/anaconda3/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
(RayTrainWorker pid=49329) Time to load utils op: 15.381849527359009 seconds
(RayTrainWorker pid=49329) Loading extension module utils...
(RayTrainWorker pid=36675, ip=10.0.13.222) Loading extension module cpu_adam... [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Using /home/ray/.cache/torch_extensions/py39_cu118 as PyTorch extensions root... [repeated 16x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Creating extension directory /home/ray/.cache/torch_extensions/py39_cu118/utils... [repeated 16x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Emitting ninja build file /home/ray/.cache/torch_extensions/py39_cu118/utils/build.ninja... [repeated 16x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [repeated 16x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Building extension module utils... [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:29,490] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:29,491] [INFO] [utils.py:786:see_memory_usage] MA 0.11 GB         Max_MA 0.11 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:29,491] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 8.96 GB, percent = 14.5%
(RayTrainWorker pid=8911, ip=10.0.60.59) Parameter Offload: Total persistent parameters: 811008 in 114 params
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:29,763] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:29,764] [INFO] [utils.py:786:see_memory_usage] MA 0.11 GB         Max_MA 0.11 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:29,764] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 8.96 GB, percent = 14.5%
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:30,012] [INFO] [utils.py:785:see_memory_usage] Before creating fp16 partitions
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:30,013] [INFO] [utils.py:786:see_memory_usage] MA 0.11 GB         Max_MA 0.11 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:30,013] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 8.96 GB, percent = 14.5%
(RayTrainWorker pid=36675, ip=10.0.13.222) [1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/TH -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/THC -isystem /home/ray/anaconda3/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o  [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Loading extension module utils... [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) [2/2] c++ flatten_unflatten.o -shared -L/home/ray/anaconda3/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so [repeated 15x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) Time to load utils op: 16.94431161880493 seconds [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:31,872] [INFO] [utils.py:785:see_memory_usage] After creating fp16 partitions: 1
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:31,873] [INFO] [utils.py:786:see_memory_usage] MA 0.11 GB         Max_MA 0.11 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:31,873] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 9.98 GB, percent = 16.1%
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:32,120] [INFO] [utils.py:785:see_memory_usage] Before creating fp32 partitions
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:32,121] [INFO] [utils.py:786:see_memory_usage] MA 0.11 GB         Max_MA 0.11 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:32,121] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 9.98 GB, percent = 16.1%
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:32,624] [INFO] [utils.py:785:see_memory_usage] After creating fp32 partitions
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:32,624] [INFO] [utils.py:786:see_memory_usage] MA 0.11 GB         Max_MA 0.11 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:32,625] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 11.39 GB, percent = 18.4%
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:32,870] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:32,870] [INFO] [utils.py:786:see_memory_usage] MA 0.11 GB         Max_MA 0.11 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:32,871] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 11.39 GB, percent = 18.4%
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:34,834] [INFO] [utils.py:785:see_memory_usage] After initializing optimizer states
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:34,835] [INFO] [utils.py:786:see_memory_usage] MA 0.11 GB         Max_MA 0.11 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:34,835] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 16.25 GB, percent = 26.2%
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:34,835] [INFO] [stage3.py:392:_setup_for_real_optimizer] optimizer state initialized
(RayTrainWorker pid=8830, ip=10.0.30.35) Using /home/ray/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
(RayTrainWorker pid=8830, ip=10.0.30.35) No modifications detected for re-loaded extension module utils, skipping build step...
(RayTrainWorker pid=8830, ip=10.0.30.35) Loading extension module utils...
(RayTrainWorker pid=9631, ip=10.0.57.153) Loading extension module utils...
(RayTrainWorker pid=9631, ip=10.0.57.153) ***** Running training *****
(RayTrainWorker pid=9631, ip=10.0.57.153)   Num examples = 10752
(RayTrainWorker pid=9631, ip=10.0.57.153)   Num Epochs = 9223372036854775807
(RayTrainWorker pid=9631, ip=10.0.57.153)   Instantaneous batch size per device = 8
(RayTrainWorker pid=9631, ip=10.0.57.153)   Total train batch size (w. parallel, distributed & accumulation) = 128
(RayTrainWorker pid=9631, ip=10.0.57.153)   Gradient Accumulation steps = 1
(RayTrainWorker pid=9631, ip=10.0.57.153)   Total optimization steps = 84
(RayTrainWorker pid=9631, ip=10.0.57.153)   Number of trainable parameters = 0
(RayTrainWorker pid=8830, ip=10.0.30.35) Time to load utils op: 0.0005006790161132812 seconds
(RayTrainWorker pid=9631, ip=10.0.57.153) Time to load utils op: 0.0005137920379638672 seconds
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,692] [INFO] [utils.py:785:see_memory_usage] After initializing ZeRO optimizer
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,693] [INFO] [utils.py:786:see_memory_usage] MA 0.14 GB         Max_MA 0.91 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,693] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 17.3 GB, percent = 27.9%
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,694] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,694] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,694] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f50b45fbfd0>
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,694] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[2e-05], mom=[[0.9, 0.999]]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,695] [INFO] [config.py:955:print] DeepSpeedEngine configuration:
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   activation_checkpointing_config  {
(RayTrainWorker pid=8911, ip=10.0.60.59)     "partition_activations": false, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "contiguous_memory_optimization": false, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "cpu_checkpointing": false, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "number_checkpoints": null, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "synchronize_checkpoint_boundary": false, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "profile": false
(RayTrainWorker pid=8911, ip=10.0.60.59) }
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   amp_enabled .................. False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   amp_params ................... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   autotuning_config ............ {
(RayTrainWorker pid=8911, ip=10.0.60.59)     "enabled": false, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "start_step": null, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "end_step": null, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "metric_path": null, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "arg_mappings": null, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "metric": "throughput", 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "model_info": null, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "results_dir": "autotuning_results", 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "exps_dir": "autotuning_exps", 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "overwrite": true, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "fast": true, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "start_profile_step": 3, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "end_profile_step": 5, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "tuner_type": "gridsearch", 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "tuner_early_stopping": 5, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "tuner_num_trials": 50, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "model_info_path": null, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "mp_size": 1, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "max_train_batch_size": null, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "min_train_batch_size": 1, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "min_train_micro_batch_size_per_gpu": 1, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "num_tuning_micro_batch_sizes": 3
(RayTrainWorker pid=8911, ip=10.0.60.59) }
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   bfloat16_enabled ............. False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   checkpoint_parallel_write_pipeline  False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   checkpoint_tag_validation_enabled  True
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   checkpoint_tag_validation_fail  False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f50c6da2370>
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   communication_data_type ...... None
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   curriculum_enabled_legacy .... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   curriculum_params_legacy ..... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   data_efficiency_enabled ...... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,696] [INFO] [config.py:959:print]   dataloader_drop_last ......... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   disable_allgather ............ False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   dump_state ................... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   dynamic_loss_scale_args ...... {'init_scale': 256, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   eigenvalue_enabled ........... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   eigenvalue_gas_boundary_resolution  1
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   eigenvalue_layer_name ........ bert.encoder.layer
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   eigenvalue_layer_num ......... 0
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   eigenvalue_max_iter .......... 100
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   eigenvalue_stability ......... 1e-06
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   eigenvalue_tol ............... 0.01
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   eigenvalue_verbose ........... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   elasticity_enabled ........... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   flops_profiler_config ........ {
(RayTrainWorker pid=8911, ip=10.0.60.59)     "enabled": false, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "profile_step": 1, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "module_depth": -1, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "top_modules": 1, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "detailed": true, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "output_file": null
(RayTrainWorker pid=8911, ip=10.0.60.59) }
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   fp16_auto_cast ............... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   fp16_enabled ................. True
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   fp16_master_weights_and_gradients  False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   global_rank .................. 0
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   grad_accum_dtype ............. None
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   gradient_accumulation_steps .. 1
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   gradient_clipping ............ 1.0
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   gradient_predivide_factor .... 1.0
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   initial_dynamic_scale ........ 256
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   load_universal_checkpoint .... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   loss_scale ................... 0
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   memory_breakdown ............. False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   mics_hierarchial_params_gather  False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   mics_shard_size .............. -1
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   nebula_config ................ {
(RayTrainWorker pid=8911, ip=10.0.60.59)     "enabled": false, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "persistent_storage_path": null, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "persistent_time_interval": 100, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "num_of_version_in_retention": 2, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "enable_nebula_load": true, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "load_path": null
(RayTrainWorker pid=8911, ip=10.0.60.59) }
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   optimizer_legacy_fusion ...... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   optimizer_name ............... adamw
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   optimizer_params ............. {'lr': 2e-05, 'betas': [0.9, 0.999], 'eps': 1e-08}
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,697] [INFO] [config.py:959:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   pld_enabled .................. False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   pld_params ................... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   prescale_gradients ........... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   scheduler_name ............... None
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   scheduler_params ............. None
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   sparse_attention ............. None
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   sparse_gradients_enabled ..... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   steps_per_print .............. 10
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   train_batch_size ............. 128
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   train_micro_batch_size_per_gpu  8
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   use_node_local_storage ....... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   wall_clock_breakdown ......... False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   world_size ................... 16
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   zero_allow_untested_optimizer  False
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=16777216 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=15099494 param_persistence_threshold=40960 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=True mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   zero_enabled ................. True
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   zero_force_ds_cpu_optimizer .. True
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:959:print]   zero_optimization_stage ...... 3
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:13:40,698] [INFO] [config.py:945:print_user_config]   json = {
(RayTrainWorker pid=8911, ip=10.0.60.59)     "fp16": {
(RayTrainWorker pid=8911, ip=10.0.60.59)         "enabled": true, 
(RayTrainWorker pid=8911, ip=10.0.60.59)         "initial_scale_power": 8
(RayTrainWorker pid=8911, ip=10.0.60.59)     }, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "bf16": {
(RayTrainWorker pid=8911, ip=10.0.60.59)         "enabled": false
(RayTrainWorker pid=8911, ip=10.0.60.59)     }, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "optimizer": {
(RayTrainWorker pid=8911, ip=10.0.60.59)         "type": "AdamW", 
(RayTrainWorker pid=8911, ip=10.0.60.59)         "params": {
(RayTrainWorker pid=8911, ip=10.0.60.59)             "lr": 2e-05, 
(RayTrainWorker pid=8911, ip=10.0.60.59)             "betas": [0.9, 0.999], 
(RayTrainWorker pid=8911, ip=10.0.60.59)             "eps": 1e-08
(RayTrainWorker pid=8911, ip=10.0.60.59)         }
(RayTrainWorker pid=8911, ip=10.0.60.59)     }, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "zero_optimization": {
(RayTrainWorker pid=8911, ip=10.0.60.59)         "stage": 3, 
(RayTrainWorker pid=8911, ip=10.0.60.59)         "offload_optimizer": {
(RayTrainWorker pid=8911, ip=10.0.60.59)             "device": "cpu", 
(RayTrainWorker pid=8911, ip=10.0.60.59)             "pin_memory": true
(RayTrainWorker pid=8911, ip=10.0.60.59)         }, 
(RayTrainWorker pid=8911, ip=10.0.60.59)         "offload_param": {
(RayTrainWorker pid=8911, ip=10.0.60.59)             "device": "cpu", 
(RayTrainWorker pid=8911, ip=10.0.60.59)             "pin_memory": true
(RayTrainWorker pid=8911, ip=10.0.60.59)         }, 
(RayTrainWorker pid=8911, ip=10.0.60.59)         "overlap_comm": true, 
(RayTrainWorker pid=8911, ip=10.0.60.59)         "contiguous_gradients": true, 
(RayTrainWorker pid=8911, ip=10.0.60.59)         "reduce_bucket_size": 1.677722e+07, 
(RayTrainWorker pid=8911, ip=10.0.60.59)         "stage3_prefetch_bucket_size": 1.509949e+07, 
(RayTrainWorker pid=8911, ip=10.0.60.59)         "stage3_param_persistence_threshold": 4.096000e+04, 
(RayTrainWorker pid=8911, ip=10.0.60.59)         "gather_16bit_weights_on_model_save": true, 
(RayTrainWorker pid=8911, ip=10.0.60.59)         "round_robin_gradients": true
(RayTrainWorker pid=8911, ip=10.0.60.59)     }, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "gradient_accumulation_steps": 1, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "gradient_clipping": 1.0, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "steps_per_print": 10, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "train_batch_size": 128, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "train_micro_batch_size_per_gpu": 8, 
(RayTrainWorker pid=8911, ip=10.0.60.59)     "wall_clock_breakdown": false
(RayTrainWorker pid=8911, ip=10.0.60.59) }
(SplitCoordinator pid=8980, ip=10.0.60.59) Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(split_text)->MapBatches(tokenize)] -> OutputSplitter[split(16, equal=True)]
(SplitCoordinator pid=8980, ip=10.0.60.59) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=2000000000.0), locality_with_output=['6002ded0aaa53ce9a0351d22a72b344ef411a422919132f41d9f937a', 'd3bbd390b6fe73f26202f96d75998946cf3e8b457528d426db0c6e07', 'fe6aaf54317ee630a02d23e0d49581b57b5cd51316eaf769e28bb045', 'f7de4694a4f764c05a9c51a6a4bd40ac33f3fced3b25127b25cd4ac3', '42866a2fba4ce2ab4b6645c4d731d486b762e2b23ac24cafccba7096', '8a7272830662c7e756a656de0a9b433a3a1f9b990768f692b6fe11a7', 'bba62e8b57552509c62a6b6b7fd67c1a2280b9d81b3d9c41eb4d1b9b', 'b40764f303538c24bc439106f2e7b2144d382bfed6c9fdec15ab828e', 'd1de4d4b6d44eff93857026df4ef0f70e24e3dc91e15d87015f2ed32', '4d6a9dc1aa7bfc80cb73d9f66f4e28041807f12769391f5643bce143', '8bcc7235f459b61be21fe158d0bae4fef2ec6de013ec60e7aaf7897a', '73c50b995811afa0ece70fd3d4466b7fd0dc85a97d6807128b2c47da', '03bf3d374a9f857b1cd1aebdbe028208f7904b077fb151790e03e9fe', '9f7fc101a7d6b3e17b72e57ca1c92f91d13aa385a6740f99d58ec016', '867844d104a8e9351a1dcc8bbd61d99906a8dc5b53e220c2ae2efbe1', '7677b344c59d6b30c3db451f48e346d61bb60cc798e5567aa4e0a1ea'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=8980, ip=10.0.60.59) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
(MapBatches(split_text)->MapBatches(tokenize) pid=10097, ip=10.0.60.59) 2023-08-18 18:13:42.547741: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
(MapBatches(split_text)->MapBatches(tokenize) pid=10097, ip=10.0.60.59) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(MapBatches(split_text)->MapBatches(tokenize) pid=10097, ip=10.0.60.59) 2023-08-18 18:13:42.685843: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(MapBatches(split_text)->MapBatches(tokenize) pid=10097, ip=10.0.60.59) 2023-08-18 18:13:43.506819: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(MapBatches(split_text)->MapBatches(tokenize) pid=10097, ip=10.0.60.59) 2023-08-18 18:13:43.506880: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(MapBatches(split_text)->MapBatches(tokenize) pid=10097, ip=10.0.60.59) 2023-08-18 18:13:43.506887: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(RayTrainWorker pid=8911, ip=10.0.60.59) Time to load utils op: 0.0003864765167236328 seconds [repeated 14x across cluster]
(RayTrainWorker pid=8851, ip=10.0.43.240) {'loss': 12.1235, 'learning_rate': 1.9761904761904763e-05, 'epoch': 0.01}
(RayTrainWorker pid=8851, ip=10.0.43.240) {'loss': 6.7834, 'learning_rate': 1.9523809523809524e-05, 'epoch': 0.02} [repeated 16x across cluster]
(RayTrainWorker pid=8857, ip=10.0.44.114) {'loss': 2.2151, 'learning_rate': 1.928571428571429e-05, 'epoch': 0.04} [repeated 16x across cluster]
(RayTrainWorker pid=9631, ip=10.0.57.153) {'loss': 0.1739, 'learning_rate': 1.904761904761905e-05, 'epoch': 0.05} [repeated 16x across cluster]
(autoscaler +8m53s) [workspace snapshot] New snapshot created successfully (size: 172.58 MB).
(RayTrainWorker pid=8858, ip=10.0.0.119) {'loss': 0.121, 'learning_rate': 1.880952380952381e-05, 'epoch': 0.06} [repeated 16x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) {'loss': 0.1422, 'learning_rate': 1.8571428571428575e-05, 'epoch': 0.07} [repeated 16x across cluster]
(RayTrainWorker pid=36249, ip=10.0.11.26) {'loss': 0.1007, 'learning_rate': 1.8333333333333333e-05, 'epoch': 0.08} [repeated 16x across cluster]
(RayTrainWorker pid=8867, ip=10.0.49.236) {'loss': 0.1082, 'learning_rate': 1.8095238095238097e-05, 'epoch': 0.1} [repeated 16x across cluster]
(RayTrainWorker pid=8858, ip=10.0.0.119) {'loss': 0.094, 'learning_rate': 1.785714285714286e-05, 'epoch': 0.11} [repeated 16x across cluster]
(RayTrainWorker pid=8880, ip=10.0.63.99) {'loss': 0.0936, 'learning_rate': 1.761904761904762e-05, 'epoch': 0.12} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:18:36,553] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=0, lr=[1.761904761904762e-05], mom=[[0.9, 0.999]]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:18:36,554] [INFO] [timer.py:199:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=4.768458258762969, CurrSamplesPerSec=4.833942877725304, MemAllocated=0.16GB, MaxMemAllocated=8.93GB
(RayTrainWorker pid=8857, ip=10.0.44.114) {'loss': 0.0921, 'learning_rate': 1.7380952380952384e-05, 'epoch': 0.13} [repeated 16x across cluster]
(RayTrainWorker pid=8851, ip=10.0.43.240) {'loss': 0.0915, 'learning_rate': 1.7142857142857142e-05, 'epoch': 0.14} [repeated 16x across cluster]
(RayTrainWorker pid=49329) {'loss': 0.0883, 'learning_rate': 1.6904761904761906e-05, 'epoch': 0.15} [repeated 16x across cluster]
(RayTrainWorker pid=36311, ip=10.0.27.53) {'loss': 0.0868, 'learning_rate': 1.6666666666666667e-05, 'epoch': 0.17} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) {'loss': 0.0815, 'learning_rate': 1.642857142857143e-05, 'epoch': 0.18} [repeated 16x across cluster]
(autoscaler +13m58s) [workspace snapshot] New snapshot created successfully (size: 172.58 MB).
(RayTrainWorker pid=8875, ip=10.0.0.80) {'loss': 0.0825, 'learning_rate': 1.6190476190476193e-05, 'epoch': 0.19} [repeated 16x across cluster]
(RayTrainWorker pid=49329) {'loss': 0.0813, 'learning_rate': 1.5952380952380954e-05, 'epoch': 0.2} [repeated 16x across cluster]
(RayTrainWorker pid=8880, ip=10.0.63.99) {'loss': 0.0816, 'learning_rate': 1.5714285714285715e-05, 'epoch': 0.21} [repeated 16x across cluster]
(RayTrainWorker pid=8851, ip=10.0.43.240) {'loss': 0.0813, 'learning_rate': 1.5476190476190476e-05, 'epoch': 0.23} [repeated 16x across cluster]
(RayTrainWorker pid=8885, ip=10.0.47.209) {'loss': 0.0765, 'learning_rate': 1.523809523809524e-05, 'epoch': 0.24} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:23:03,756] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=0, lr=[1.523809523809524e-05], mom=[[0.9, 0.999]]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:23:03,756] [INFO] [timer.py:199:stop] epoch=0/micro_step=20/global_step=20, RunningAvgSamplesPerSec=4.781402482813706, CurrSamplesPerSec=4.7832870646183325, MemAllocated=0.16GB, MaxMemAllocated=8.93GB
(RayTrainWorker pid=8858, ip=10.0.0.119) {'loss': 0.0833, 'learning_rate': 1.5000000000000002e-05, 'epoch': 0.25} [repeated 16x across cluster]
(RayTrainWorker pid=8857, ip=10.0.44.114) {'loss': 0.084, 'learning_rate': 1.4761904761904763e-05, 'epoch': 0.26} [repeated 16x across cluster]
(RayTrainWorker pid=49329) {'loss': 0.0839, 'learning_rate': 1.4523809523809524e-05, 'epoch': 0.27} [repeated 16x across cluster]
(RayTrainWorker pid=8867, ip=10.0.49.236) {'loss': 0.0825, 'learning_rate': 1.4285714285714287e-05, 'epoch': 0.29} [repeated 16x across cluster]
(RayTrainWorker pid=36262, ip=10.0.52.191) {'loss': 0.0838, 'learning_rate': 1.4047619047619048e-05, 'epoch': 0.3} [repeated 16x across cluster]
(RayTrainWorker pid=8867, ip=10.0.49.236) {'loss': 0.0847, 'learning_rate': 1.3809523809523811e-05, 'epoch': 0.31} [repeated 16x across cluster]
(autoscaler +18m58s) [workspace snapshot] New snapshot created successfully (size: 172.58 MB).
(RayTrainWorker pid=8851, ip=10.0.43.240) {'loss': 0.0788, 'learning_rate': 1.3571428571428574e-05, 'epoch': 0.32} [repeated 16x across cluster]
(RayTrainWorker pid=8875, ip=10.0.0.80) {'loss': 0.0832, 'learning_rate': 1.3333333333333333e-05, 'epoch': 0.33} [repeated 16x across cluster]
(RayTrainWorker pid=8875, ip=10.0.0.80) {'loss': 0.0811, 'learning_rate': 1.3095238095238096e-05, 'epoch': 0.35} [repeated 16x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) {'loss': 0.0759, 'learning_rate': 1.2857142857142859e-05, 'epoch': 0.36} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:27:35,516] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=0, lr=[1.2857142857142859e-05], mom=[[0.9, 0.999]]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:27:35,517] [INFO] [timer.py:199:stop] epoch=0/micro_step=30/global_step=30, RunningAvgSamplesPerSec=4.756191577689035, CurrSamplesPerSec=4.775146730091594, MemAllocated=0.16GB, MaxMemAllocated=8.93GB
(RayTrainWorker pid=8875, ip=10.0.0.80) {'loss': 0.0774, 'learning_rate': 1.261904761904762e-05, 'epoch': 0.37} [repeated 16x across cluster]
(RayTrainWorker pid=8858, ip=10.0.0.119) {'loss': 0.0751, 'learning_rate': 1.2380952380952383e-05, 'epoch': 0.38} [repeated 16x across cluster]
(RayTrainWorker pid=36311, ip=10.0.27.53) {'loss': 0.0744, 'learning_rate': 1.2142857142857142e-05, 'epoch': 0.39} [repeated 16x across cluster]
(RayTrainWorker pid=8845, ip=10.0.18.195) {'loss': 0.0722, 'learning_rate': 1.1904761904761905e-05, 'epoch': 0.4} [repeated 16x across cluster]
(RayTrainWorker pid=8880, ip=10.0.63.99) {'loss': 0.0742, 'learning_rate': 1.1666666666666668e-05, 'epoch': 0.42} [repeated 16x across cluster]
(RayTrainWorker pid=8857, ip=10.0.44.114) {'loss': 0.0764, 'learning_rate': 1.1428571428571429e-05, 'epoch': 0.43} [repeated 16x across cluster]
(RayTrainWorker pid=8851, ip=10.0.43.240) {'loss': 0.0786, 'learning_rate': 1.1190476190476192e-05, 'epoch': 0.44} [repeated 16x across cluster]
(autoscaler +24m4s) [workspace snapshot] New snapshot created successfully (size: 172.58 MB).
(RayTrainWorker pid=9631, ip=10.0.57.153) {'loss': 0.0738, 'learning_rate': 1.0952380952380955e-05, 'epoch': 0.45} [repeated 16x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) {'loss': 0.0784, 'learning_rate': 1.0714285714285714e-05, 'epoch': 0.46} [repeated 16x across cluster]
(RayTrainWorker pid=8885, ip=10.0.47.209) {'loss': 0.0786, 'learning_rate': 1.0476190476190477e-05, 'epoch': 0.48} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:32:06,009] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=0, lr=[1.0476190476190477e-05], mom=[[0.9, 0.999]]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:32:06,009] [INFO] [timer.py:199:stop] epoch=0/micro_step=40/global_step=40, RunningAvgSamplesPerSec=4.750214082000028, CurrSamplesPerSec=4.781755388354574, MemAllocated=0.16GB, MaxMemAllocated=8.93GB
(RayTrainWorker pid=8845, ip=10.0.18.195) {'loss': 0.0714, 'learning_rate': 1.0238095238095238e-05, 'epoch': 0.49} [repeated 16x across cluster]
(RayTrainWorker pid=36311, ip=10.0.27.53) {'loss': 0.0739, 'learning_rate': 1e-05, 'epoch': 0.5} [repeated 16x across cluster]
(RayTrainWorker pid=8851, ip=10.0.43.240) {'loss': 0.0767, 'learning_rate': 9.761904761904762e-06, 'epoch': 0.51} [repeated 16x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) {'loss': 0.0827, 'learning_rate': 9.523809523809525e-06, 'epoch': 0.52} [repeated 16x across cluster]
(RayTrainWorker pid=36262, ip=10.0.52.191) {'loss': 0.0751, 'learning_rate': 9.285714285714288e-06, 'epoch': 0.54} [repeated 16x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) {'loss': 0.0737, 'learning_rate': 9.047619047619049e-06, 'epoch': 0.55} [repeated 16x across cluster]
(RayTrainWorker pid=8885, ip=10.0.47.209) {'loss': 0.0755, 'learning_rate': 8.80952380952381e-06, 'epoch': 0.56} [repeated 16x across cluster]
(RayTrainWorker pid=49329) {'loss': 0.0745, 'learning_rate': 8.571428571428571e-06, 'epoch': 0.57} [repeated 16x across cluster]
(RayTrainWorker pid=9631, ip=10.0.57.153) {'loss': 0.0753, 'learning_rate': 8.333333333333334e-06, 'epoch': 0.58} [repeated 16x across cluster]
(autoscaler +29m9s) [workspace snapshot] New snapshot created successfully (size: 172.59 MB).
(RayTrainWorker pid=8845, ip=10.0.18.195) {'loss': 0.0739, 'learning_rate': 8.095238095238097e-06, 'epoch': 0.6} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:36:34,033] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=0, lr=[8.095238095238097e-06], mom=[[0.9, 0.999]]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:36:34,033] [INFO] [timer.py:199:stop] epoch=0/micro_step=50/global_step=50, RunningAvgSamplesPerSec=4.75579745222066, CurrSamplesPerSec=4.705258125568294, MemAllocated=0.16GB, MaxMemAllocated=8.93GB
(RayTrainWorker pid=8885, ip=10.0.47.209) {'loss': 0.073, 'learning_rate': 7.857142857142858e-06, 'epoch': 0.61} [repeated 16x across cluster]
(RayTrainWorker pid=8830, ip=10.0.30.35) {'loss': 0.0721, 'learning_rate': 7.61904761904762e-06, 'epoch': 0.62} [repeated 16x across cluster]
(RayTrainWorker pid=36311, ip=10.0.27.53) {'loss': 0.0729, 'learning_rate': 7.380952380952382e-06, 'epoch': 0.63} [repeated 16x across cluster]
(RayTrainWorker pid=8880, ip=10.0.63.99) {'loss': 0.0714, 'learning_rate': 7.1428571428571436e-06, 'epoch': 0.64} [repeated 16x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) {'loss': 0.0745, 'learning_rate': 6.9047619047619055e-06, 'epoch': 0.65} [repeated 16x across cluster]
(RayTrainWorker pid=9631, ip=10.0.57.153) {'loss': 0.0726, 'learning_rate': 6.666666666666667e-06, 'epoch': 0.67} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) {'loss': 0.0699, 'learning_rate': 6.4285714285714295e-06, 'epoch': 0.68} [repeated 16x across cluster]
(RayTrainWorker pid=36262, ip=10.0.52.191) {'loss': 0.0732, 'learning_rate': 6.1904761904761914e-06, 'epoch': 0.69} [repeated 16x across cluster]
(RayTrainWorker pid=49329) {'loss': 0.0714, 'learning_rate': 5.9523809523809525e-06, 'epoch': 0.7} [repeated 16x across cluster]
(RayTrainWorker pid=8851, ip=10.0.43.240) {'loss': 0.0709, 'learning_rate': 5.7142857142857145e-06, 'epoch': 0.71} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:41:07,338] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=0, lr=[5.7142857142857145e-06], mom=[[0.9, 0.999]]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:41:07,338] [INFO] [timer.py:199:stop] epoch=0/micro_step=60/global_step=60, RunningAvgSamplesPerSec=4.74341422313603, CurrSamplesPerSec=4.640637786972311, MemAllocated=0.16GB, MaxMemAllocated=8.93GB
(autoscaler +34m9s) [workspace snapshot] New snapshot created successfully (size: 172.59 MB).
(RayTrainWorker pid=8875, ip=10.0.0.80) {'loss': 0.071, 'learning_rate': 5.476190476190477e-06, 'epoch': 0.73} [repeated 16x across cluster]
(RayTrainWorker pid=36311, ip=10.0.27.53) {'loss': 0.0714, 'learning_rate': 5.2380952380952384e-06, 'epoch': 0.74} [repeated 16x across cluster]
(RayTrainWorker pid=8875, ip=10.0.0.80) {'loss': 0.0703, 'learning_rate': 5e-06, 'epoch': 0.75} [repeated 16x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) {'loss': 0.0733, 'learning_rate': 4.761904761904762e-06, 'epoch': 0.76} [repeated 16x across cluster]
(RayTrainWorker pid=8845, ip=10.0.18.195) {'loss': 0.0686, 'learning_rate': 4.523809523809524e-06, 'epoch': 0.77} [repeated 16x across cluster]
(RayTrainWorker pid=8851, ip=10.0.43.240) {'loss': 0.068, 'learning_rate': 4.2857142857142855e-06, 'epoch': 0.79} [repeated 16x across cluster]
(RayTrainWorker pid=36311, ip=10.0.27.53) {'loss': 0.071, 'learning_rate': 4.047619047619048e-06, 'epoch': 0.8} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) {'loss': 0.0708, 'learning_rate': 3.80952380952381e-06, 'epoch': 0.81} [repeated 16x across cluster]
(RayTrainWorker pid=36311, ip=10.0.27.53) {'loss': 0.0766, 'learning_rate': 3.5714285714285718e-06, 'epoch': 0.82} [repeated 16x across cluster]
(RayTrainWorker pid=8858, ip=10.0.0.119) {'loss': 0.0743, 'learning_rate': 3.3333333333333333e-06, 'epoch': 0.83} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:45:31,965] [INFO] [logging.py:96:log_dist] [Rank 0] step=70, skipped=0, lr=[3.3333333333333333e-06], mom=[[0.9, 0.999]]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:45:31,965] [INFO] [timer.py:199:stop] epoch=0/micro_step=70/global_step=70, RunningAvgSamplesPerSec=4.757168325507401, CurrSamplesPerSec=4.8146031804109555, MemAllocated=0.16GB, MaxMemAllocated=8.93GB
(RayTrainWorker pid=8830, ip=10.0.30.35) {'loss': 0.0752, 'learning_rate': 3.3333333333333333e-06, 'epoch': 0.85} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:45:58,184] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 256, but hysteresis is 2. Reducing hysteresis to 1
(autoscaler +39m14s) [workspace snapshot] New snapshot created successfully (size: 172.59 MB).
(RayTrainWorker pid=8845, ip=10.0.18.195) {'loss': 0.0717, 'learning_rate': 3.0952380952380957e-06, 'epoch': 0.86} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:46:26,433] [WARNING] [stage3.py:1826:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
(RayTrainWorker pid=36675, ip=10.0.13.222) {'loss': 0.0695, 'learning_rate': 2.8571428571428573e-06, 'epoch': 0.87} [repeated 16x across cluster]
(RayTrainWorker pid=36249, ip=10.0.11.26) {'loss': 0.0709, 'learning_rate': 2.6190476190476192e-06, 'epoch': 0.88} [repeated 16x across cluster]
(RayTrainWorker pid=8885, ip=10.0.47.209) {'loss': 0.0729, 'learning_rate': 2.380952380952381e-06, 'epoch': 0.89} [repeated 16x across cluster]
(RayTrainWorker pid=8880, ip=10.0.63.99) {'loss': 0.0752, 'learning_rate': 2.1428571428571427e-06, 'epoch': 0.9} [repeated 16x across cluster]
(RayTrainWorker pid=8845, ip=10.0.18.195) {'loss': 0.0712, 'learning_rate': 1.904761904761905e-06, 'epoch': 0.92} [repeated 16x across cluster]
(RayTrainWorker pid=9631, ip=10.0.57.153) {'loss': 0.0708, 'learning_rate': 1.6666666666666667e-06, 'epoch': 0.93} [repeated 16x across cluster]
(RayTrainWorker pid=36249, ip=10.0.11.26) {'loss': 0.0723, 'learning_rate': 1.4285714285714286e-06, 'epoch': 0.94} [repeated 16x across cluster]
(RayTrainWorker pid=8845, ip=10.0.18.195) {'loss': 0.0689, 'learning_rate': 1.1904761904761906e-06, 'epoch': 0.95} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:50:01,494] [INFO] [logging.py:96:log_dist] [Rank 0] step=80, skipped=1, lr=[1.1904761904761906e-06], mom=[[0.9, 0.999]]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:50:01,494] [INFO] [timer.py:199:stop] epoch=0/micro_step=80/global_step=80, RunningAvgSamplesPerSec=4.756310378443122, CurrSamplesPerSec=4.758170892979721, MemAllocated=0.16GB, MaxMemAllocated=8.93GB
(RayTrainWorker pid=36675, ip=10.0.13.222) {'loss': 0.0715, 'learning_rate': 9.523809523809525e-07, 'epoch': 0.96} [repeated 16x across cluster]
(RayTrainWorker pid=8880, ip=10.0.63.99) {'loss': 0.07, 'learning_rate': 7.142857142857143e-07, 'epoch': 0.98} [repeated 16x across cluster]
(RayTrainWorker pid=9631, ip=10.0.57.153) {'loss': 0.0716, 'learning_rate': 4.7619047619047623e-07, 'epoch': 0.99} [repeated 16x across cluster]
(autoscaler +44m19s) [workspace snapshot] New snapshot created successfully (size: 172.60 MB).
(RayTrainWorker pid=8880, ip=10.0.63.99) {'loss': 0.069, 'learning_rate': 2.3809523809523811e-07, 'epoch': 1.0} [repeated 16x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) Saving model checkpoint to output/checkpoint-84
(RayTrainWorker pid=8911, ip=10.0.60.59) Configuration saved in output/checkpoint-84/config.json
(RayTrainWorker pid=8911, ip=10.0.60.59) Configuration saved in output/checkpoint-84/generation_config.json
(RayTrainWorker pid=8911, ip=10.0.60.59) Using /home/ray/.cache/torch_extensions/py39_cu118 as PyTorch extensions root... [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) No modifications detected for re-loaded extension module utils, skipping build step... [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) Loading extension module utils... [repeated 14x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) ***** Running training ***** [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59)   Num examples = 10752 [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59)   Num Epochs = 9223372036854775807 [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59)   Instantaneous batch size per device = 8 [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59)   Total train batch size (w. parallel, distributed & accumulation) = 128 [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59)   Gradient Accumulation steps = 1 [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59)   Total optimization steps = 84 [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59)   Number of trainable parameters = 0 [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) Model weights saved in output/checkpoint-84/pytorch_model.bin
(RayTrainWorker pid=8911, ip=10.0.60.59) tokenizer config file saved in output/checkpoint-84/tokenizer_config.json
(RayTrainWorker pid=8911, ip=10.0.60.59) Special tokens file saved in output/checkpoint-84/special_tokens_map.json
(RayTrainWorker pid=49329) [2023-08-18 18:52:12,213] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step84 is ready now!
(RayTrainWorker pid=36249, ip=10.0.11.26) {'loss': 0.069, 'learning_rate': 2.3809523809523811e-07, 'epoch': 1.0} [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:52:12,213] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step84 is about to be saved!
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:52:12,213] [INFO] [engine.py:3337:save_16bit_model] Saving model weights to output/checkpoint-84/pytorch_model.bin, tag: global_step84
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:52:12,213] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving output/checkpoint-84/pytorch_model.bin...
(RayTrainWorker pid=49329) /home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.ac.cn/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
(RayTrainWorker pid=49329)   warnings.warn(
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:52:27,660] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved output/checkpoint-84/pytorch_model.bin.
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:52:27,673] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step84 is about to be saved!
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:52:27,684] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: output/checkpoint-84/global_step84/zero_pp_rank_0_mp_rank_00_model_states.pt
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:52:27,685] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving output/checkpoint-84/global_step84/zero_pp_rank_0_mp_rank_00_model_states.pt...
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:52:27,660] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step84 is ready now! [repeated 15x across cluster]
(RayTrainWorker pid=36262, ip=10.0.52.191) [2023-08-18 18:52:27,685] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving output/checkpoint-84/global_step84/zero_pp_rank_15_mp_rank_00_model_states.pt...
(RayTrainWorker pid=9631, ip=10.0.57.153) [2023-08-18 18:52:32,337] [INFO] [engine.py:3228:_save_zero_checkpoint] zero checkpoint saved output/checkpoint-84/global_step84/zero_pp_rank_14_mp_rank_00_optim_states.pt
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:52:36,011] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved output/checkpoint-84/global_step84/zero_pp_rank_0_mp_rank_00_optim_states.pt. [repeated 32x across cluster]
(RayTrainWorker pid=36675, ip=10.0.13.222) [2023-08-18 18:52:27,684] [INFO] [logging.py:96:log_dist] [Rank 1] Saving model checkpoint: output/checkpoint-84/global_step84/zero_pp_rank_1_mp_rank_00_model_states.pt
(RayTrainWorker pid=8867, ip=10.0.49.236) [2023-08-18 18:52:27,873] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving output/checkpoint-84/global_step84/zero_pp_rank_3_mp_rank_00_optim_states.pt... [repeated 30x across cluster]
(RayTrainWorker pid=36311, ip=10.0.27.53) [2023-08-18 18:52:36,193] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step84 is ready now!
(RayTrainWorker pid=8885, ip=10.0.47.209) 
(RayTrainWorker pid=8885, ip=10.0.47.209) 
(RayTrainWorker pid=8885, ip=10.0.47.209) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=8885, ip=10.0.47.209) 
(RayTrainWorker pid=8885, ip=10.0.47.209) 
(RayTrainWorker pid=8867, ip=10.0.49.236) /home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.ac.cn/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. [repeated 15x across cluster]
(RayTrainWorker pid=8867, ip=10.0.49.236)   warnings.warn( [repeated 15x across cluster]
2023-08-18 18:53:44,782	WARNING syncer.py:853 -- Ray AIR no longer supports the synchronization of checkpoints and other artifacts from worker nodes to the head node. This means that the checkpoints and artifacts saved by trials scheduled on worker nodes will not be accessible during the run (e.g., resuming from a checkpoint after a failure) or after the run (e.g., loading the checkpoint of a trial that ran on an already terminated worker node).

To fix this issue, configure AIR to use either:
(1) Cloud storage: `RunConfig(storage_path='s3://your/bucket')`
(2) A network filesystem mounted on all nodes: `RunConfig(storage_path='/mnt/path/to/nfs_storage')`
See this Github issue for more details on transitioning to cloud storage/NFS as well as an explanation on why this functionality is being removed: https://github.com/ray-project/ray/issues/37177
If you are already using NFS, you can ignore this warning message.

Other temporary workarounds:
- If you want to avoid errors/warnings and continue running with syncing explicitly turned off, set `RunConfig(SyncConfig(syncer=None))`
- Or, to re-enable the head node syncing behavior, set the environment variable RAY_AIR_REENABLE_DEPRECATED_SYNC_TO_HEAD_NODE=1
  - **Note that this functionality will tentatively be hard-deprecated in Ray 2.7.** See the linked issue for the latest information.
(RayTrainWorker pid=36262, ip=10.0.52.191) {'train_runtime': 2355.3551, 'train_samples_per_second': 4.565, 'train_steps_per_second': 0.036, 'train_loss': 0.32820896875290645, 'epoch': 1.0}
(RayTrainWorker pid=8911, ip=10.0.60.59) [2023-08-18 18:52:36,012] [INFO] [engine.py:3228:_save_zero_checkpoint] zero checkpoint saved output/checkpoint-84/global_step84/zero_pp_rank_0_mp_rank_00_optim_states.pt [repeated 15x across cluster]
(RayTrainWorker pid=8875, ip=10.0.0.80) [2023-08-18 18:52:36,193] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step84 is ready now! [repeated 15x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59)  [repeated 60x across cluster]
(RayTrainWorker pid=8911, ip=10.0.60.59) Training completed. Do not forget to share your model on huggingface.co/models =) [repeated 15x across cluster]
2023-08-18 18:54:02,594	INFO tune.py:1146 -- Total run time: 2691.03 seconds (2676.82 seconds for the tuning loop).

使用返回的Result 对象访问指标和与上次迭代关联的 Ray Train Checkpoint

checkpoint = results.checkpoint
checkpoint
Checkpoint(filesystem=<pyarrow._s3fs.S3FileSystem object at 0x7f8c59d311b0>, path=anyscale-staging-data-cld-kvedzwag2qa8i5bjxuevf5i7/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/yunxuan__xiao/gptj-deepspeed-finetune/TorchTrainer_2023-08-18_18-09-11/TorchTrainer_01ea5_00000_0_2023-08-18_18-09-12/checkpoint_000000)

从 prompt 生成文本#

首先,在本地下载持久化的 Ray Train 检查点,并从检查点加载微调后的模型权重和分词器。然后使用 🤗 Transformers pipeline 从微调后的模型生成预测。

提示

有关大规模批量推理,请参阅端到端:离线批量推理

import os

os.system(f"aws s3 sync s3://{checkpoint.path} /mnt/local_storage/")

task 设置为 "text-generation",并将 device_map="auto" 设置为 Ray Train 自动将模型放置在正确的设备上。

from transformers import pipeline, AutoTokenizer, GPTJForCausalLM

model = GPTJForCausalLM.from_pretrained("/mnt/local_storage/checkpoint")
tokenizer = AutoTokenizer.from_pretrained("/mnt/local_storage/checkpoint")

pipe = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    torch_dtype=torch.float16,
    device_map="auto",
)
# Generate from prompts!
for sentence in pipe(
    ["Romeo and Juliet", "Romeo", "Juliet"], do_sample=True, min_length=20
):
    print(sentence)
[{'generated_text': 'Romeo and Juliet. This very night shall they come. A word with you, sir.'}]
[{'generated_text': 'Romeo! I know thee not. Lord Mercutio, is it you! Signior Montague.'}]
[{'generated_text': 'Juliet, look up in the vault, and there shalt find a grave; within the monument there is a table:'}]