在此示例中，我们将演示如何使用 Ray Train PyTorch Lightning 与 DeepSpeed ZeRO-3 策略集成，对 `vicuna-13b-v1.3` 模型进行完整微调。

DeepSpeed 是一个用于 PyTorch 的开源深度学习优化库。它旨在减少计算能力和内存使用，并通过利用 ZeRO、3D-Parallelism、DeepSpeed-MoE 和 ZeRO-Infinity 等最先进的创新来训练大型分布式模型。

PyTorch Lightning 提供 DeepSpeed 集成，它提供了一个简单的接口来配置 DeepSpeed 的选项，并使用 DeepSpeed Engine 自动触发你的训练过程。
Ray TorchTrainer 允许你在 Ray 集群中的多个节点上轻松扩展你的 PyTorch Lightning 作业，而无需担心底层的集群管理、自动伸缩和分布式进程组设置。
我们的演示旨在说明如何有效地结合这三个工具来微调 Vicuna-13B 模型，利用各自的优势创建一个高效且高性能的深度学习解决方案。

注意

这是使用 Ray Train 进行大语言模型微调的高级示例。如果你是初学者或刚接触 Ray Train 和我们的 Lightning 集成概念，建议先查阅以下入门文档以建立基础理解。

Ray Train 关键概念

计算实例#

在此示例中，我们在 AWS 上设置了一个 Ray 集群，其配置如下：

数量

	实例类型	每节点 GPU 数量	GPU 内存	CPU 内存	头节点
g5.16xlarge	1	1 x A10G	24 GB	256 GB	工作节点
g5.4xlarge	15	64 GB	24 GB	256 GB	在此示例中，我们使用 16 块 A10G GPU 进行模型训练，并针对此配置调整了 DeepSpeed 设置。如果你的集群配置不同，或者 GPU 内存容量较低，可能需要修改 DeepSpeed 配置和批处理大小，以使模型适应 GPU。

提示

我们为头节点选择了一个带有额外 CPU 内存的 GPU 实例，以演示单节点离线推理。如果仅进行训练，头节点仍可选择 g5.4xlarge 实例。

云存储#

此外，由于这个 130 亿参数模型的检查点大小可能很大（约 140GB），我们选择将检查点存储在 AWS S3 中。得益于 Ray 2.5 中新引入的分布式检查点功能，每个 worker 可以单独将其分片上传到 S3 存储桶，大大减少了检查点同步的延迟和网络流量。

本地存储#

为了演示离线推理，我们需要将模型检查点下载并合并到头节点上。此操作需要大约 200GB 的磁盘存储空间。因此，我们将 g5 实例提供的 NVMe SSD（位于 `/dev/nvme1n1`）挂载到 `/mnt/local_storage`，并将检查点保存在此文件夹中。

更多详细信息，请参见 Linux 实例上的 Amazon EBS 和 NVMe。

设置 Ray 环境#

我们定义了一个运行时环境，以确保 Ray worker 可以访问所有必需的包。如果你已将这些依赖项包含在 Docker 镜像中或已在每个节点上安装，则可以忽略 `runtime_env` 参数。

请注意，transformers、accelerate 和 deepspeed 的代码库都在快速变化，因此我们在此固定了包版本以确保测试稳定性。你可以尝试其他版本组合，并随时报告遇到的任何问题。

加载和预处理数据集#

import ray

NUM_WORKERS = 16
BATCH_SIZE_PER_WORKER = 8
MODEL_NAME = "lmsys/vicuna-13b-v1.3"

ray.init(
    runtime_env={
        "pip": [
            "datasets==2.13.1",
            "torch>=1.13.0",
            "deepspeed==0.12.3",
            "accelerate==0.20.3",
            "transformers==4.30.2",
            "lightning==2.0.3",
        ],
    }
)

我们对 LLM 的零样本文本生成能力印象深刻，但由于训练语料库中缺乏代码，一些 LLM 在代码生成方面表现可能不佳。CMU CoNaLa（代码/自然语言挑战）旨在测试从自然语言生成程序片段的系统。每个数据记录包含一个意图句子和一个单行代码片段。目标是在此数据集上微调 Vicuna 模型，使模型能够生成正确且可运行的代码片段，从而实现自然语言意图。以下是一些示例：

意图

代码片段	“将整数列表转换为单个整数”
r = int(''.join(map(str, x)))	`“按行规范化 pandas 数据框 df”`
df.div(df.sum(axis=1), axis=0)	`“将字符串‘03:55’转换为 datetime.time 对象”`
datetime.datetime.strptime('03:55', '%H:%M').time()	`CoNaLa 团队发布了一个从 Stack Overflow 抓取、自动过滤并由标注员整理的数据集，分为 2379 个训练示例和 500 个测试示例。此外，他们还包含了一个自动挖掘的数据集，包含 60 万个示例。在此演示中，我们取所有整理过的数据以及前 5000 个挖掘的数据进行微调。`

这里我们使用 Ray Data 预处理 CoNaLa 数据集。你也可以使用 HuggingFace Datasets 并直接将其传递给 LightningConfigBuilder.fit_params()。

定义 Lightning Module#

import re
import ray
import json
from transformers import AutoTokenizer
from datasets import concatenate_datasets, load_dataset

# Combine the curated dataset and automatically-mined dataset
hf_dataset_curated = load_dataset("neulab/conala")
hf_dataset_mined = load_dataset("neulab/conala", "mined", split="train[:5000]")
hf_dataset_merged = concatenate_datasets(
    [hf_dataset_curated["train"], hf_dataset_mined]
)
print(hf_dataset_merged)

# Convert it into Ray Dataset
ray_ds = ray.data.from_huggingface(hf_dataset_merged)

# Build a prompt template for Vicuna-13b model
PROMPT_TEMPLATE = "Intent: {intent}\nOne-line code snippet: {snippet}"


def fill_prompt(batch):
    batch["input_sentence"] = batch.apply(
        lambda row: PROMPT_TEMPLATE.format(
            intent=row["rewritten_intent"]
            if row["rewritten_intent"]
            else row["intent"],
            snippet=f"`{row['snippet']}`",
        )
        + "</s>",
        axis=1,
    )
    return batch[["input_sentence"]]


# Tokenize input sentences to tensors
def tokenize(batch):
    tokenizer = AutoTokenizer.from_pretrained(
        MODEL_NAME, padding_side="left", use_fast=False
    )
    tokenizer.pad_token = tokenizer.eos_token
    ret = tokenizer(
        list(batch["input_sentence"]),
        truncation=True,
        max_length=128,
        padding="max_length",
        return_tensors="np",
    )
    ret["labels"] = ret["input_ids"].copy()
    return dict(ret)

# Preprocess train dataset
processed_ds = ray_ds.map_batches(fill_prompt, batch_format="pandas").map_batches(tokenize, batch_format="pandas")

Dataset({
    features: ['question_id', 'intent', 'rewritten_intent', 'snippet', 'parent_answer_post_id', 'prob', 'id'],
    num_rows: 7379
})

这里我们从 HuggingFace Model Hub 加载预训练模型权重，并将其封装到 `pl.LightningModule` 中。我们采用了 Lightning-transformers 中介绍的高效模型初始化技术，以避免不必要的完整权重加载。

DeepSpeed 配置#

import torch
import transformers
import lightning.pytorch as pl
from transformers import AutoTokenizer, AutoModelForCausalLM
from deepspeed.ops.adam import DeepSpeedCPUAdam


class ZeRO3Config:
    def __init__(self, pl_module):
        self.config = pl_module.trainer.strategy.config

    def __call__(self, *args, **kwargs):
        return self

    def is_zero3(self) -> bool:
        return True


def enable_transformers_pretrained_deepspeed_sharding(
    pl_module: "pl.LightningModule",
) -> None:
    transformers.deepspeed._hf_deepspeed_config_weak_ref = ZeRO3Config(pl_module)


class Vicuna13BModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        # Enable tf32 for better performance
        torch.backends.cuda.matmul.allow_tf32 = True

    def setup(self, stage) -> None:
        # Defer model initialization to inject deepspeed configs to HF.
        # During initialization, HF transformers can immediately partition 
        # the model across all gpus avoid the overhead in time and memory 
        # copying it on CPU or each GPU first.
        enable_transformers_pretrained_deepspeed_sharding(self)
        self.model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
        if self.global_rank == 0:
            print("DeepSpeed Configs: ", self.trainer.strategy.config)
            print("Model Archetecture: ", self.model)

    def forward(self, batch):
        outputs = self.model(
            batch["input_ids"],
            labels=batch["labels"],
            attention_mask=batch["attention_mask"],
        )
        return outputs.loss

    def training_step(self, batch, batch_idx):
        loss = self.forward(batch)
        self.log("train_loss", loss, prog_bar=True, on_step=True, sync_dist=True)
        return loss

    def configure_optimizers(self):
        return DeepSpeedCPUAdam(self.parameters(), lr=2e-5, weight_decay=0.01)

[2023-06-30 17:39:35,109] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)

在训练之前，让我们计算微调 `vicuna-13b` 模型所需的内存使用量。假设我们使用 FP16 混合精度训练，并且优化器是带有 FP32 状态的 Adam。

模型参数：13（亿参数）* 2（FP16）≈ 26GB

优化器状态：13（亿参数）* 2（每个参数的动量）* 4（FP32）≈ 52GB
如我们所见，模型参数本身就需要 26GB，这无法放入单个 A10G GPU，更不用说激活和优化器状态了。在这里，我们使用 ZeRO Stage-3 将模型、梯度和优化器状态分布到 16 个节点上。此外，我们采用优化器 CPU 卸载来减少 GRAM 使用并增加更大批处理大小的吞吐量。我们还禁用了参数卸载和激活检查点，以提高训练速度。

关于其他选项，如 reduce_bucket_size、stage3_prefetch_bucket_size 和 stage3_param_persistence_threshold，我们保留了 HuggingFace 中的默认值。你可以根据需要进一步调整它们，以加快训练过程。

定义你的训练函数#

from transformers import AutoConfig

config = AutoConfig.from_pretrained(MODEL_NAME)
HIDDEN_SIZE = config.hidden_size

deepspeed_configs = {
    "zero_allow_untested_optimizer": True,
    "bf16": {"enabled": True},
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {"device": "cpu", "pin_memory": True},
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": HIDDEN_SIZE * HIDDEN_SIZE,
        "stage3_prefetch_bucket_size": 0.9 * HIDDEN_SIZE * HIDDEN_SIZE,
        "stage3_param_persistence_threshold": 10 * HIDDEN_SIZE,
    },
}

最后，定义将在多个 worker 上启动的训练函数。训练函数通常与纯 PyTorch Lightning 训练代码相同，但增加了 Ray Train 工具类：

RayDeepSpeedStrategy：与 Lightning DeepSpeedStrategy 参数列表相同，但集成了 Ray Train。

RayLightningEnvironment：用于 Ray 集群的 Lightning 环境。
RayTrainReportCallback：在每个 epoch 结束时，它将每个 worker 的检查点报告给 ray train（分布式检查点）。
prepare_trainer()：验证你的 lightning Trainer 配置。
对于 Ray Data 摄取，我们使用 get_dataset_shard() 获取预处理并分片的数据集，并使用 iter_torch_batches() 创建了一个 dataloader。它返回一个自定义迭代器，替代了 Torch DataLoader。

模型微调#

import ray.train
from ray.train import CheckpointConfig, RunConfig, ScalingConfig
from ray.train.torch import TorchTrainer
from ray.train.lightning import (
    prepare_trainer,
    RayDeepSpeedStrategy, 
    RayLightningEnvironment, 
    RayTrainReportCallback
)


def train_func(config):
    """Training function for each worker."""

    # Unpack the `train_loop_config`
    max_epochs = config["max_epochs"]
    batch_size = config["batch_size"]
    accumulate_grad_batches = config["accumulate_grad_batches"]

    model = Vicuna13BModel()
    
    # Prepare Ray Data Ingestion
    train_ds = ray.train.get_dataset_shard("train")
    train_dataloader = train_ds.iter_torch_batches(batch_size=batch_size)
    
    pl_trainer = pl.Trainer(
        devices="auto",
        accelerator="auto",
        strategy=RayDeepSpeedStrategy(config=deepspeed_configs),
        plugins=[RayLightningEnvironment()],
        callbacks=[RayTrainReportCallback()],
        enable_checkpointing=False, # RayTrainReportCallback will save the checkpoints
        max_epochs=max_epochs,
        precision="bf16-mixed",
        accumulate_grad_batches=accumulate_grad_batches,
    )
    pl_trainer = prepare_trainer(pl_trainer)

    pl_trainer.fit(model, train_dataloaders=train_dataloader)
    

trainer = TorchTrainer(
    train_loop_per_worker=train_func,
    train_loop_config={
        "max_epochs": 1,
        "batch_size": BATCH_SIZE_PER_WORKER,
        "accumulate_grad_batches": 2
    },
    run_config=RunConfig(
        name="vicuna-13b-finetune",
        storage_path="s3://anyscale-staging-data-cld-kvedzwag2qa8i5bjxuevf5i7/air-release-tests",
        checkpoint_config=CheckpointConfig(num_to_keep=1),
    ),
    scaling_config=ScalingConfig(
        num_workers=NUM_WORKERS,
        use_gpu=True,
        resources_per_worker={"CPU": 15, "GPU": 1},
    ),
    datasets={"train": processed_ds},
)

在 TorchTrainer 中配置好一切后，训练变得简单。只需调用 `trainer.fit()`，你的工作负载就会扩展到 Ray 集群，启动 ZeRO-3 并行训练。

Tune 状态

result = trainer.fit()

当前时间

运行时间	2023-06-30 18:21:59
内存	00:42:22.75
10.7/249.1 GiB	系统信息

使用 FIFO 调度算法。

逻辑资源使用情况：241.0/304 CPUs, 16.0/16 GPUs (0.0/16.0 accelerator_type:A10G)
Trial 状态

Trial 名称

状态	位置	迭代次数	总时间 (s)	训练损失	epoch	步骤	LightningTrainer_c1544_00000
已终止	总结	10.0.55.20:134103	1	2473.94	0.523438	0	29

(pid=134103) [2023-06-30 17:39:41,637] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)

(LightningTrainer pid=134103) The `preprocessor` arg to Trainer is deprecated. Apply preprocessor transformations ahead of time by calling `preprocessor.transform(ds)`. Support for the preprocessor arg will be dropped in a future release.
(LightningTrainer pid=134103) Important: Ray Data requires schemas for all datasets in Ray 2.5. This means that standalone Python objects are no longer supported. In addition, the default batch format is fixed to NumPy. To revert to legacy behavior temporarily, set the environment variable RAY_DATA_STRICT_MODE=0 on all cluster processes.
(LightningTrainer pid=134103) 
(LightningTrainer pid=134103) Learn more here: https://docs.rayai.org.cn/en/master/data/faq.html#migrating-to-strict-mode
(LightningTrainer pid=134103) Starting distributed worker processes: ['134267 (10.0.55.20)', '74152 (10.0.63.141)', '75476 (10.0.51.205)', '75547 (10.0.42.158)', '74711 (10.0.45.211)', '75132 (10.0.20.140)', '74502 (10.0.60.86)', '75695 (10.0.53.69)', '74457 (10.0.47.2)', '74569 (10.0.33.23)', '74341 (10.0.29.61)', '74274 (10.0.36.152)', '74561 (10.0.35.16)', '74427 (10.0.16.236)', '74273 (10.0.54.55)', '74996 (10.0.9.249)']
(RayTrainWorker pid=134267) Setting up process group for: env:// [rank=0, world_size=16]
(LightningTrainer pid=134103) Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(BatchMapper._transform_pandas)->MapBatches(BatchMapper._transform_pandas)] -> AllToAllOperator[RandomizeBlockOrder]
(LightningTrainer pid=134103) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(LightningTrainer pid=134103) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

Downloading (…)okenizer_config.json: 100%|██████████| 727/727 [00:00<00:00, 8.86MB/s]m_pandas) pid=74329, ip=10.0.54.55) 
Downloading tokenizer.model: 100%|██████████| 500k/500k [00:00<00:00, 18.2MB/s]ansform_pandas) pid=74329, ip=10.0.54.55) 
Downloading (…)cial_tokens_map.json: 100%|██████████| 435/435 [00:00<00:00, 3.33MB/s]m_pandas) pid=74329, ip=10.0.54.55) 

(RayTrainWorker pid=74152, ip=10.0.63.141) [2023-06-30 17:39:54,612] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)

Downloading (…)okenizer_config.json: 100%|██████████| 727/727 [00:00<00:00, 7.86MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 727/727 [00:00<00:00, 7.57MB/s]
(RayTrainWorker pid=134267) GPU available: True (cuda), used: True
(RayTrainWorker pid=134267) TPU available: False, using: 0 TPU cores
(RayTrainWorker pid=134267) IPU available: False, using: 0 IPUs
(RayTrainWorker pid=134267) HPU available: False, using: 0 HPUs
(RayTrainWorker pid=134267) `Trainer(limit_val_batches=1)` was configured so 1 batch will be used.
Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]
Downloading tokenizer.model: 100%|██████████| 500k/500k [00:00<00:00, 14.9MB/s]
(RayTrainWorker pid=134267) initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/16
(RayTrainWorker pid=74273, ip=10.0.54.55) Missing logger folder: /home/ray/ray_results/vicuna-13b-relation-extraction/LightningTrainer_c1544_00000_0_2023-06-30_17-39-36/rank_all/lightning_logs

(RayTrainWorker pid=134267) [2023-06-30 17:39:55,589] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented

Downloading tokenizer.model: 100%|██████████| 500k/500k [00:00<00:00, 18.2MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 435/435 [00:00<00:00, 6.49MB/s]
Downloading (…)lve/main/config.json:   0%|          | 0.00/585 [00:00<?, ?B/s]
Downloading (…)lve/main/config.json: 100%|██████████| 585/585 [00:00<00:00, 7.81MB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 585/585 [00:00<00:00, 7.09MB/s]
Downloading (…)model.bin.index.json: 100%|██████████| 33.4k/33.4k [00:00<00:00, 35.1MB/s]
Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]
(RayTrainWorker pid=75547, ip=10.0.42.158) 
Downloading (…)l-00001-of-00003.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s]
Downloading (…)l-00001-of-00003.bin:   0%|          | 21.0M/9.95G [00:00<00:59, 167MB/s]
Downloading (…)l-00001-of-00003.bin:   0%|          | 41.9M/9.95G [00:00<00:58, 170MB/s] 
Downloading (…)okenizer_config.json: 100%|██████████| 727/727 [00:00<00:00, 8.33MB/s] [repeated 9x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.rayai.org.cn/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]
Downloading tokenizer.model: 100%|██████████| 500k/500k [00:00<00:00, 17.5MB/s] [repeated 8x across cluster]
(RayTrainWorker pid=74561, ip=10.0.35.16) initializing deepspeed distributed: GLOBAL_RANK: 12, MEMBER: 13/16 [repeated 15x across cluster]
(RayTrainWorker pid=74561, ip=10.0.35.16) Missing logger folder: /home/ray/ray_results/vicuna-13b-relation-extraction/LightningTrainer_c1544_00000_0_2023-06-30_17-39-36/rank_all/lightning_logs [repeated 15x across cluster]
Downloading tokenizer.model: 100%|██████████| 500k/500k [00:00<00:00, 8.85MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 435/435 [00:00<00:00, 5.23MB/s] [repeated 10x across cluster]
Downloading (…)lve/main/config.json: 100%|██████████| 585/585 [00:00<00:00, 7.03MB/s] [repeated 13x across cluster]
Downloading (…)model.bin.index.json: 100%|██████████| 33.4k/33.4k [00:00<00:00, 87.9MB/s] [repeated 15x across cluster]
Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s] [repeated 15x across cluster]
(RayTrainWorker pid=74341, ip=10.0.29.61)  [repeated 650x across cluster]
Downloading (…)l-00001-of-00003.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s] [repeated 15x across cluster]
Downloading (…)l-00001-of-00003.bin:  13%|█▎        | 1.31G/9.95G [00:05<00:36, 239MB/s] [repeated 636x across cluster]
Downloading (…)l-00001-of-00003.bin:   1%|          | 105M/9.95G [00:00<00:41, 239MB/s]  [repeated 17x across cluster]
(RayTrainWorker pid=74711, ip=10.0.45.211)  [repeated 640x across cluster]
Downloading (…)l-00001-of-00003.bin:  26%|██▌       | 2.58G/9.95G [00:10<00:28, 256MB/s] [repeated 635x across cluster]
(RayTrainWorker pid=74502, ip=10.0.60.86)  [repeated 638x across cluster]
Downloading (…)l-00001-of-00003.bin:  37%|███▋      | 3.70G/9.95G [00:15<00:26, 238MB/s] [repeated 638x across cluster]
(RayTrainWorker pid=74274, ip=10.0.36.152)  [repeated 643x across cluster]
Downloading (…)l-00001-of-00003.bin:  51%|█████▏    | 5.12G/9.95G [00:20<00:18, 255MB/s] [repeated 649x across cluster]
(RayTrainWorker pid=75476, ip=10.0.51.205)  [repeated 638x across cluster]
Downloading (…)l-00001-of-00003.bin:  65%|██████▌   | 6.48G/9.95G [00:25<00:14, 246MB/s] [repeated 633x across cluster]
(RayTrainWorker pid=74457, ip=10.0.47.2)  [repeated 645x across cluster]
Downloading (…)l-00001-of-00003.bin:  76%|███████▌  | 7.52G/9.95G [00:29<00:09, 247MB/s] [repeated 644x across cluster]
Downloading (…)l-00001-of-00003.bin:  91%|█████████▏| 9.10G/9.95G [00:34<00:03, 263MB/s]
Downloading (…)l-00001-of-00003.bin:  92%|█████████▏| 9.13G/9.95G [00:34<00:03, 257MB/s]
(RayTrainWorker pid=74711, ip=10.0.45.211)  [repeated 634x across cluster]
Downloading (…)l-00001-of-00003.bin:  82%|████████▏ | 8.17G/9.95G [00:35<00:07, 228MB/s] [repeated 628x across cluster]
Downloading (…)l-00001-of-00003.bin: 100%|██████████| 9.95G/9.95G [00:37<00:00, 262MB/s]
Downloading shards:  33%|███▎      | 1/3 [00:38<01:16, 38.09s/it]
Downloading (…)l-00002-of-00003.bin:   0%|          | 0.00/9.90G [00:00<?, ?B/s]
Downloading (…)l-00002-of-00003.bin:   1%|▏         | 126M/9.90G [00:00<00:35, 273MB/s] 
Downloading (…)l-00001-of-00003.bin:  93%|█████████▎| 9.27G/9.95G [00:39<00:02, 228MB/s] [repeated 394x across cluster]
(RayTrainWorker pid=75547, ip=10.0.42.158)  [repeated 633x across cluster]
Downloading (…)l-00002-of-00003.bin:   2%|▏         | 241M/9.90G [00:01<00:38, 252MB/s] [repeated 213x across cluster]
Downloading (…)l-00001-of-00003.bin: 100%|██████████| 9.95G/9.95G [00:40<00:00, 243MB/s] [repeated 8x across cluster]
Downloading shards:  33%|███▎      | 1/3 [00:42<01:25, 42.77s/it] [repeated 15x across cluster]
Downloading (…)l-00002-of-00003.bin:   0%|          | 0.00/9.90G [00:00<?, ?B/s] [repeated 15x across cluster]
Downloading (…)l-00002-of-00003.bin:   1%|          | 115M/9.90G [00:00<00:46, 209MB/s]  [repeated 16x across cluster]
Downloading (…)l-00001-of-00003.bin: 100%|██████████| 9.95G/9.95G [00:42<00:00, 233MB/s] [repeated 50x across cluster]
(RayTrainWorker pid=74341, ip=10.0.29.61)  [repeated 636x across cluster]
Downloading (…)l-00002-of-00003.bin:  19%|█▊        | 1.86G/9.90G [00:06<00:29, 275MB/s] [repeated 589x across cluster]
(RayTrainWorker pid=74996, ip=10.0.9.249)  [repeated 649x across cluster]
Downloading (…)l-00002-of-00003.bin:  18%|█▊        | 1.75G/9.90G [00:07<00:34, 234MB/s] [repeated 643x across cluster]
(RayTrainWorker pid=74502, ip=10.0.60.86)  [repeated 645x across cluster]
Downloading (…)l-00002-of-00003.bin:  41%|████▏     | 4.09G/9.90G [00:15<00:21, 271MB/s] [repeated 644x across cluster]
(RayTrainWorker pid=74273, ip=10.0.54.55)  [repeated 652x across cluster]
Downloading (…)l-00002-of-00003.bin:  53%|█████▎    | 5.25G/9.90G [00:21<00:19, 242MB/s] [repeated 656x across cluster]
(RayTrainWorker pid=74152, ip=10.0.63.141)  [repeated 647x across cluster]
Downloading (…)l-00002-of-00003.bin:  67%|██████▋   | 6.66G/9.90G [00:25<00:13, 246MB/s] [repeated 646x across cluster]
(RayTrainWorker pid=75132, ip=10.0.20.140)  [repeated 629x across cluster]
Downloading (…)l-00002-of-00003.bin:  84%|████████▍ | 8.30G/9.90G [00:31<00:06, 234MB/s] [repeated 627x across cluster]
Downloading (…)l-00002-of-00003.bin:  91%|█████████▏| 9.06G/9.90G [00:34<00:03, 241MB/s]
(RayTrainWorker pid=74457, ip=10.0.47.2)  [repeated 627x across cluster]
Downloading (…)l-00002-of-00003.bin:  89%|████████▉ | 8.84G/9.90G [00:36<00:04, 228MB/s] [repeated 567x across cluster]
Downloading (…)l-00002-of-00003.bin: 100%|██████████| 9.90G/9.90G [00:38<00:00, 257MB/s]
Downloading shards:  67%|██████▋   | 2/3 [01:16<00:38, 38.38s/it]
Downloading (…)l-00003-of-00003.bin:   0%|          | 0.00/6.18G [00:00<?, ?B/s]
Downloading (…)l-00003-of-00003.bin:   2%|▏         | 126M/6.18G [00:00<00:22, 266MB/s] 
Downloading (…)l-00002-of-00003.bin:  98%|█████████▊| 9.69G/9.90G [00:38<00:00, 236MB/s] [repeated 310x across cluster]
(RayTrainWorker pid=75476, ip=10.0.51.205)  [repeated 629x across cluster]
Downloading (…)l-00003-of-00003.bin:   2%|▏         | 94.4M/6.18G [00:00<00:24, 247MB/s] [repeated 275x across cluster]
Downloading (…)l-00002-of-00003.bin: 100%|██████████| 9.90G/9.90G [00:39<00:00, 253MB/s] [repeated 10x across cluster]
Downloading shards:  67%|██████▋   | 2/3 [01:20<00:40, 40.01s/it] [repeated 13x across cluster]
Downloading (…)l-00003-of-00003.bin:   0%|          | 0.00/6.18G [00:00<?, ?B/s] [repeated 13x across cluster]
Downloading (…)l-00003-of-00003.bin:   2%|▏         | 126M/6.18G [00:00<00:24, 243MB/s]  [repeated 13x across cluster]
Downloading (…)l-00002-of-00003.bin: 100%|█████████▉| 9.88G/9.90G [00:41<00:00, 242MB/s] [repeated 122x across cluster]
(RayTrainWorker pid=74273, ip=10.0.54.55)  [repeated 638x across cluster]
Downloading (…)l-00003-of-00003.bin:  21%|██        | 1.31G/6.18G [00:05<00:20, 243MB/s] [repeated 569x across cluster]
Downloading (…)l-00002-of-00003.bin: 100%|██████████| 9.90G/9.90G [00:40<00:00, 242MB/s] [repeated 2x across cluster]
Downloading shards:  67%|██████▋   | 2/3 [01:23<00:41, 41.78s/it] [repeated 2x across cluster]
Downloading (…)l-00003-of-00003.bin:   0%|          | 0.00/6.18G [00:00<?, ?B/s] [repeated 2x across cluster]
Downloading (…)l-00003-of-00003.bin:   2%|▏         | 105M/6.18G [00:00<00:24, 248MB/s]  [repeated 2x across cluster]
Downloading (…)l-00002-of-00003.bin: 100%|█████████▉| 9.87G/9.90G [00:40<00:00, 260MB/s] [repeated 3x across cluster]
(RayTrainWorker pid=74274, ip=10.0.36.152)  [repeated 638x across cluster]
Downloading (…)l-00003-of-00003.bin:  41%|████▏     | 2.56G/6.18G [00:10<00:14, 256MB/s] [repeated 635x across cluster]
(RayTrainWorker pid=74152, ip=10.0.63.141)  [repeated 629x across cluster]
Downloading (…)l-00003-of-00003.bin:  62%|██████▏   | 3.84G/6.18G [00:15<00:08, 279MB/s] [repeated 627x across cluster]
Downloading (…)l-00003-of-00003.bin:  92%|█████████▏| 5.66G/6.18G [00:22<00:01, 268MB/s]
Downloading (…)l-00003-of-00003.bin:  92%|█████████▏| 5.69G/6.18G [00:22<00:01, 265MB/s]
Downloading (…)l-00003-of-00003.bin:  93%|█████████▎| 5.73G/6.18G [00:22<00:01, 268MB/s]
Downloading (…)l-00003-of-00003.bin:  93%|█████████▎| 5.76G/6.18G [00:22<00:01, 270MB/s]
(RayTrainWorker pid=75547, ip=10.0.42.158)  [repeated 644x across cluster]
Downloading (…)l-00003-of-00003.bin:  85%|████████▌ | 5.25G/6.18G [00:20<00:03, 270MB/s] [repeated 618x across cluster]
Downloading (…)l-00003-of-00003.bin: 100%|██████████| 6.18G/6.18G [00:24<00:00, 257MB/s]
Downloading shards: 100%|██████████| 3/3 [01:40<00:00, 33.61s/it]
Downloading (…)l-00003-of-00003.bin:  98%|█████████▊| 6.03G/6.18G [00:23<00:00, 269MB/s] [repeated 166x across cluster]
(RayTrainWorker pid=74274, ip=10.0.36.152)  [repeated 426x across cluster]
Downloading (…)l-00003-of-00003.bin:  86%|████████▌ | 5.30G/6.18G [00:21<00:03, 246MB/s] [repeated 222x across cluster]
Downloading (…)l-00003-of-00003.bin: 100%|██████████| 6.18G/6.18G [00:25<00:00, 239MB/s] [repeated 7x across cluster]
Downloading shards: 100%|██████████| 3/3 [01:45<00:00, 35.27s/it] [repeated 11x across cluster]
Downloading (…)l-00003-of-00003.bin:  98%|█████████▊| 6.04G/6.18G [00:25<00:00, 231MB/s] [repeated 98x across cluster]
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
(RayTrainWorker pid=74274, ip=10.0.36.152)  [repeated 74x across cluster]
Downloading (…)l-00003-of-00003.bin:  91%|█████████ | 5.63G/6.18G [00:23<00:02, 242MB/s] [repeated 23x across cluster]
Downloading (…)l-00003-of-00003.bin: 100%|██████████| 6.18G/6.18G [00:24<00:00, 249MB/s]
Downloading shards: 100%|██████████| 3/3 [01:49<00:00, 36.47s/it] [repeated 4x across cluster]
Downloading (…)l-00003-of-00003.bin: 100%|██████████| 6.18G/6.18G [00:25<00:00, 241MB/s] [repeated 5x across cluster]
Loading checkpoint shards:  33%|███▎      | 1/3 [00:12<00:24, 12.11s/it]
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s] [repeated 15x across cluster]
Loading checkpoint shards:  33%|███▎      | 1/3 [00:18<00:37, 18.54s/it] [repeated 15x across cluster]
Loading checkpoint shards:  67%|██████▋   | 2/3 [00:30<00:15, 15.63s/it]
Loading checkpoint shards:  67%|██████▋   | 2/3 [00:30<00:15, 15.71s/it]
Loading checkpoint shards:  67%|██████▋   | 2/3 [00:35<00:17, 17.73s/it] [repeated 14x across cluster]
Loading checkpoint shards: 100%|██████████| 3/3 [00:40<00:00, 13.47s/it]
Downloading (…)neration_config.json: 100%|██████████| 132/132 [00:00<00:00, 458kB/s]
Loading checkpoint shards: 100%|██████████| 3/3 [00:45<00:00, 15.29s/it] [repeated 15x across cluster]
(RayTrainWorker pid=74996, ip=10.0.9.249) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Downloading (…)neration_config.json: 100%|██████████| 132/132 [00:00<00:00, 542kB/s] [repeated 14x across cluster]

(RayTrainWorker pid=134267) DeepSpeed Configs:  {'zero_allow_untested_optimizer': True, 'bf16': {'enabled': True}, 'zero_optimization': {'stage': 3, 'offload_optimizer': {'device': 'cpu', 'pin_memory': True}, 'overlap_comm': True, 'contiguous_gradients': True, 'reduce_bucket_size': 26214400, 'stage3_prefetch_bucket_size': 23592960.0, 'stage3_param_persistence_threshold': 51200}, 'gradient_accumulation_steps': 2, 'train_micro_batch_size_per_gpu': 1, 'gradient_clipping': 0.0}
(RayTrainWorker pid=134267) Model Archetecture:  LlamaForCausalLM(
(RayTrainWorker pid=134267)   (model): LlamaModel(
(RayTrainWorker pid=134267)     (embed_tokens): Embedding(32000, 5120, padding_idx=0)
(RayTrainWorker pid=134267)     (layers): ModuleList(
(RayTrainWorker pid=134267)       (0-39): 40 x LlamaDecoderLayer(
(RayTrainWorker pid=134267)         (self_attn): LlamaAttention(
(RayTrainWorker pid=134267)           (q_proj): Linear(in_features=5120, out_features=5120, bias=False)
(RayTrainWorker pid=134267)           (k_proj): Linear(in_features=5120, out_features=5120, bias=False)
(RayTrainWorker pid=134267)           (v_proj): Linear(in_features=5120, out_features=5120, bias=False)
(RayTrainWorker pid=134267)           (o_proj): Linear(in_features=5120, out_features=5120, bias=False)
(RayTrainWorker pid=134267)           (rotary_emb): LlamaRotaryEmbedding()
(RayTrainWorker pid=134267)         )
(RayTrainWorker pid=134267)         (mlp): LlamaMLP(
(RayTrainWorker pid=134267)           (gate_proj): Linear(in_features=5120, out_features=13824, bias=False)
(RayTrainWorker pid=134267)           (down_proj): Linear(in_features=13824, out_features=5120, bias=False)
(RayTrainWorker pid=134267)           (up_proj): Linear(in_features=5120, out_features=13824, bias=False)
(RayTrainWorker pid=134267)           (act_fn): SiLUActivation()
(RayTrainWorker pid=134267)         )
(RayTrainWorker pid=134267)         (input_layernorm): LlamaRMSNorm()
(RayTrainWorker pid=134267)         (post_attention_layernorm): LlamaRMSNorm()
(RayTrainWorker pid=134267)       )
(RayTrainWorker pid=134267)     )
(RayTrainWorker pid=134267)     (norm): LlamaRMSNorm()
(RayTrainWorker pid=134267)   )
(RayTrainWorker pid=134267)   (lm_head): Linear(in_features=5120, out_features=32000, bias=False)
(RayTrainWorker pid=134267) )
(RayTrainWorker pid=74274, ip=10.0.36.152) [2023-06-30 17:39:54,688] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [repeated 15x across cluster]
(RayTrainWorker pid=74561, ip=10.0.35.16) [2023-06-30 17:39:56,220] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [repeated 15x across cluster]
(RayTrainWorker pid=134267) ninja: no work to do.
(RayTrainWorker pid=134267) Time to load cpu_adam op: 2.403524875640869 seconds

(RayTrainWorker pid=134267) Using /home/ray/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
(RayTrainWorker pid=134267) Detected CUDA files, patching ldflags
(RayTrainWorker pid=134267) Emitting ninja build file /home/ray/.cache/torch_extensions/py310_cu118/cpu_adam/build.ninja...
(RayTrainWorker pid=134267) Building extension module cpu_adam...
(RayTrainWorker pid=134267) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
(RayTrainWorker pid=134267) Loading extension module cpu_adam...
(RayTrainWorker pid=74502, ip=10.0.60.86) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] [repeated 15x across cluster]
Downloading (…)neration_config.json: 100%|██████████| 132/132 [00:00<00:00, 1.72MB/s]
(RayTrainWorker pid=74996, ip=10.0.9.249) Building extension module utils...
(RayTrainWorker pid=74152, ip=10.0.63.141) Loading extension module utils...

(RayTrainWorker pid=74152, ip=10.0.63.141) Time to load utils op: 0.0775597095489502 seconds
(RayTrainWorker pid=134267) Parameter Offload: Total persistent parameters: 414720 in 81 params

(RayTrainWorker pid=74152, ip=10.0.63.141) No modifications detected for re-loaded extension module utils, skipping build step...
(RayTrainWorker pid=74152, ip=10.0.63.141) Using /home/ray/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... [repeated 32x across cluster]
(RayTrainWorker pid=74561, ip=10.0.35.16) Detected CUDA files, patching ldflags [repeated 15x across cluster]
(RayTrainWorker pid=134267) Emitting ninja build file /home/ray/.cache/torch_extensions/py310_cu118/utils/build.ninja... [repeated 31x across cluster]
(RayTrainWorker pid=74561, ip=10.0.35.16) Building extension module cpu_adam... [repeated 15x across cluster]
(RayTrainWorker pid=134267) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [repeated 31x across cluster]
(RayTrainWorker pid=75132, ip=10.0.20.140) Loading extension module cpu_adam... [repeated 15x across cluster]
(RayTrainWorker pid=134267) Building extension module utils... [repeated 15x across cluster]
(RayTrainWorker pid=74152, ip=10.0.63.141) Loading extension module utils... [repeated 16x across cluster]

(RayTrainWorker pid=134267) ninja: no work to do. [repeated 31x across cluster]
(RayTrainWorker pid=75132, ip=10.0.20.140) Time to load cpu_adam op: 2.3851447105407715 seconds [repeated 15x across cluster]
(RayTrainWorker pid=74152, ip=10.0.63.141) Time to load utils op: 0.0005815029144287109 seconds [repeated 16x across cluster]

(RayTrainWorker pid=134267) 
(RayTrainWorker pid=134267)   | Name  | Type             | Params | Params per Device
(RayTrainWorker pid=134267) ---------------------------------------------------------------
(RayTrainWorker pid=134267) 0 | model | LlamaForCausalLM | 13.0 B | 813 M            
(RayTrainWorker pid=134267) ---------------------------------------------------------------
(RayTrainWorker pid=134267) 13.0 B    Trainable params
(RayTrainWorker pid=134267) 0         Non-trainable params
(RayTrainWorker pid=134267) 13.0 B    Total params
(RayTrainWorker pid=134267) 52,063.457Total estimated model params size (MB)

Epoch 0:   0%|          | 0/57 [00:00<?, ?it/s]

(RayTrainWorker pid=134267) /home/ray/anaconda3/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:432: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 64 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
(RayTrainWorker pid=134267)   rank_zero_warn(

Epoch 0:   2%|▏         | 1/57 [00:38<35:42, 38.26s/it, v_num=0, train_loss=11.50]
(RayTrainWorker pid=134267) Time to load utils op: 0.00030732154846191406 seconds [repeated 15x across cluster]
(RayTrainWorker pid=134267) [2023-06-30 17:44:33,395] [WARNING] [stage3.py:1851:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:   4%|▎         | 2/57 [01:19<36:23, 39.69s/it, v_num=0, train_loss=10.70]
Epoch 0:   5%|▌         | 3/57 [01:52<33:52, 37.65s/it, v_num=0, train_loss=1.710]
(RayTrainWorker pid=134267) [2023-06-30 17:45:48,054] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:   7%|▋         | 4/57 [02:34<34:01, 38.51s/it, v_num=0, train_loss=1.610]
Epoch 0:   9%|▉         | 5/57 [03:08<32:35, 37.60s/it, v_num=0, train_loss=0.914]
(RayTrainWorker pid=134267) [2023-06-30 17:47:03,011] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  11%|█         | 6/57 [03:49<32:26, 38.17s/it, v_num=0, train_loss=0.973]
Epoch 0:  12%|█▏        | 7/57 [04:24<31:30, 37.81s/it, v_num=0, train_loss=0.801]
(RayTrainWorker pid=134267) [2023-06-30 17:48:19,362] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  14%|█▍        | 8/57 [05:05<31:10, 38.17s/it, v_num=0, train_loss=0.844]
Epoch 0:  16%|█▌        | 9/57 [05:39<30:12, 37.75s/it, v_num=0, train_loss=0.652]
(RayTrainWorker pid=134267) [2023-06-30 17:49:36,571] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  18%|█▊        | 10/57 [06:22<29:58, 38.26s/it, v_num=0, train_loss=0.633]
Epoch 0:  19%|█▉        | 11/57 [06:59<29:13, 38.12s/it, v_num=0, train_loss=0.629]

/arrow/cpp/src/arrow/filesystem/s3fs.cc:663: CompletedMultipartUpload got error embedded in a 200 OK response: InternalError ("We encountered an internal error. Please try again."), retry = 1

(RayTrainWorker pid=134267) [2023-06-30 17:50:54,177] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  21%|██        | 12/57 [07:40<28:45, 38.35s/it, v_num=0, train_loss=0.609]
Epoch 0:  23%|██▎       | 13/57 [08:14<27:53, 38.04s/it, v_num=0, train_loss=0.680]
(RayTrainWorker pid=134267) [2023-06-30 17:52:10,002] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  25%|██▍       | 14/57 [08:55<27:26, 38.29s/it, v_num=0, train_loss=0.648]
Epoch 0:  26%|██▋       | 15/57 [09:29<26:33, 37.95s/it, v_num=0, train_loss=0.645]
(RayTrainWorker pid=134267) [2023-06-30 17:53:23,209] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  28%|██▊       | 16/57 [10:09<26:01, 38.08s/it, v_num=0, train_loss=0.664]
Epoch 0:  30%|██▉       | 17/57 [10:43<25:13, 37.83s/it, v_num=0, train_loss=0.625]
(RayTrainWorker pid=134267) [2023-06-30 17:54:36,660] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  32%|███▏      | 18/57 [11:22<24:39, 37.93s/it, v_num=0, train_loss=0.617]
Epoch 0:  33%|███▎      | 19/57 [11:56<23:53, 37.71s/it, v_num=0, train_loss=0.609]
(RayTrainWorker pid=134267) [2023-06-30 17:55:51,289] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  35%|███▌      | 20/57 [12:37<23:20, 37.86s/it, v_num=0, train_loss=0.602]
Epoch 0:  37%|███▋      | 21/57 [13:11<22:36, 37.69s/it, v_num=0, train_loss=0.590]
(RayTrainWorker pid=134267) [2023-06-30 17:57:07,919] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  39%|███▊      | 22/57 [13:53<22:06, 37.91s/it, v_num=0, train_loss=0.555]
Epoch 0:  40%|████      | 23/57 [14:27<21:22, 37.72s/it, v_num=0, train_loss=0.598]
(RayTrainWorker pid=134267) [2023-06-30 17:58:22,349] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  42%|████▏     | 24/57 [15:08<20:48, 37.85s/it, v_num=0, train_loss=0.625]
Epoch 0:  44%|████▍     | 25/57 [15:43<20:07, 37.74s/it, v_num=0, train_loss=0.625]
Epoch 0:  44%|████▍     | 25/57 [15:43<20:07, 37.74s/it, v_num=0, train_loss=0.582]
(RayTrainWorker pid=134267) [2023-06-30 17:59:40,125] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  46%|████▌     | 26/57 [16:26<19:35, 37.93s/it, v_num=0, train_loss=0.535]
Epoch 0:  47%|████▋     | 27/57 [17:02<18:56, 37.88s/it, v_num=0, train_loss=0.578]
(RayTrainWorker pid=134267) [2023-06-30 18:00:58,164] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  49%|████▉     | 28/57 [17:44<18:22, 38.01s/it, v_num=0, train_loss=0.582]
Epoch 0:  51%|█████     | 29/57 [18:20<17:42, 37.93s/it, v_num=0, train_loss=0.578]
(RayTrainWorker pid=134267) [2023-06-30 18:02:15,097] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  53%|█████▎    | 30/57 [19:01<17:06, 38.04s/it, v_num=0, train_loss=0.598]
Epoch 0:  54%|█████▍    | 31/57 [19:36<16:26, 37.95s/it, v_num=0, train_loss=0.586]
(RayTrainWorker pid=134267) [2023-06-30 18:03:30,632] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  56%|█████▌    | 32/57 [20:16<15:50, 38.02s/it, v_num=0, train_loss=0.605]
Epoch 0:  58%|█████▊    | 33/57 [20:49<15:08, 37.87s/it, v_num=0, train_loss=0.594]
(RayTrainWorker pid=134267) [2023-06-30 18:04:45,362] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  60%|█████▉    | 34/57 [21:31<14:33, 37.98s/it, v_num=0, train_loss=0.598]
Epoch 0:  61%|██████▏   | 35/57 [22:08<13:54, 37.95s/it, v_num=0, train_loss=0.574]
(RayTrainWorker pid=134267) [2023-06-30 18:06:02,727] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  63%|██████▎   | 36/57 [22:48<13:18, 38.02s/it, v_num=0, train_loss=0.586]
Epoch 0:  65%|██████▍   | 37/57 [23:23<12:38, 37.94s/it, v_num=0, train_loss=0.562]
(RayTrainWorker pid=134267) [2023-06-30 18:07:19,126] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  67%|██████▋   | 38/57 [24:05<12:02, 38.03s/it, v_num=0, train_loss=0.535]
Epoch 0:  68%|██████▊   | 39/57 [24:38<11:22, 37.91s/it, v_num=0, train_loss=0.598]
(RayTrainWorker pid=134267) [2023-06-30 18:08:36,683] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  70%|███████   | 40/57 [25:22<10:47, 38.07s/it, v_num=0, train_loss=0.562]
Epoch 0:  72%|███████▏  | 41/57 [25:57<10:07, 37.98s/it, v_num=0, train_loss=0.555]
(RayTrainWorker pid=134267) [2023-06-30 18:09:52,426] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  74%|███████▎  | 42/57 [26:38<09:30, 38.06s/it, v_num=0, train_loss=0.555]
Epoch 0:  75%|███████▌  | 43/57 [27:13<08:51, 37.99s/it, v_num=0, train_loss=0.547]
(RayTrainWorker pid=134267) [2023-06-30 18:11:08,855] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  77%|███████▋  | 44/57 [27:54<08:14, 38.06s/it, v_num=0, train_loss=0.562]
Epoch 0:  79%|███████▉  | 45/57 [28:29<07:35, 37.98s/it, v_num=0, train_loss=0.535]
(RayTrainWorker pid=134267) [2023-06-30 18:12:25,181] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  81%|████████  | 46/57 [29:11<06:58, 38.07s/it, v_num=0, train_loss=0.531]
Epoch 0:  82%|████████▏ | 47/57 [29:45<06:19, 37.99s/it, v_num=0, train_loss=0.504]
(RayTrainWorker pid=134267) [2023-06-30 18:13:40,300] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  84%|████████▍ | 48/57 [30:26<05:42, 38.05s/it, v_num=0, train_loss=0.520]
Epoch 0:  86%|████████▌ | 49/57 [31:01<05:03, 37.99s/it, v_num=0, train_loss=0.523]
(RayTrainWorker pid=134267) [2023-06-30 18:14:55,542] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  88%|████████▊ | 50/57 [31:41<04:26, 38.03s/it, v_num=0, train_loss=0.520]
Epoch 0:  89%|████████▉ | 51/57 [32:16<03:47, 37.98s/it, v_num=0, train_loss=0.527]
(RayTrainWorker pid=134267) [2023-06-30 18:16:12,131] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  91%|█████████ | 52/57 [32:58<03:10, 38.04s/it, v_num=0, train_loss=0.562]
Epoch 0:  93%|█████████▎| 53/57 [33:34<02:32, 38.00s/it, v_num=0, train_loss=0.539]
(RayTrainWorker pid=134267) [2023-06-30 18:17:29,752] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  95%|█████████▍| 54/57 [34:15<01:54, 38.07s/it, v_num=0, train_loss=0.535]
Epoch 0:  96%|█████████▋| 55/57 [34:50<01:16, 38.01s/it, v_num=0, train_loss=0.512]
(RayTrainWorker pid=134267) [2023-06-30 18:18:45,986] [WARNING] [stage3.py:1851:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0:  98%|█████████▊| 56/57 [35:31<00:38, 38.07s/it, v_num=0, train_loss=0.516]
Epoch 0: 100%|██████████| 57/57 [36:06<00:00, 38.00s/it, v_num=0, train_loss=0.461]
(RayTrainWorker pid=134267) [2023-06-30 18:20:01,817] [WARNING] [stage3.py:1851:step] 3 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
Epoch 0: : 58it [36:47, 38.07s/it, v_num=0, train_loss=0.523]                      

(RayTrainWorker pid=74427, ip=10.0.16.236) /home/ray/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.ac.cn/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
(RayTrainWorker pid=74427, ip=10.0.16.236)   warnings.warn(
(RayTrainWorker pid=134267) No modifications detected for re-loaded extension module utils, skipping build step... [repeated 15x across cluster]
(RayTrainWorker pid=134267) Using /home/ray/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... [repeated 15x across cluster]
(RayTrainWorker pid=134267) Loading extension module utils... [repeated 15x across cluster]
(RayTrainWorker pid=134267) Uploading checkpoint files from worker rank 0 to cloud URI s3://anyscale-staging-data-cld-kvedzwag2qa8i5bjxuevf5i7/yunxuanx-test/vicuna-13b-test/vicuna-13b-relation-extraction/LightningTrainer_c1544_00000_0_2023-06-30_17-39-36/checkpoint_000000.
(RayTrainWorker pid=134267) /home/ray/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.ac.cn/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. [repeated 15x across cluster]
(RayTrainWorker pid=134267)   warnings.warn( [repeated 15x across cluster]
(RayTrainWorker pid=75547, ip=10.0.42.158) Uploading checkpoint files from worker rank 3 to cloud URI s3://anyscale-staging-data-cld-kvedzwag2qa8i5bjxuevf5i7/yunxuanx-test/vicuna-13b-test/vicuna-13b-relation-extraction/LightningTrainer_c1544_00000_0_2023-06-30_17-39-36/checkpoint_000000.
(RayTrainWorker pid=74152, ip=10.0.63.141) Uploading checkpoint files from worker rank 1 to cloud URI s3://anyscale-staging-data-cld-kvedzwag2qa8i5bjxuevf5i7/yunxuanx-test/vicuna-13b-test/vicuna-13b-relation-extraction/LightningTrainer_c1544_00000_0_2023-06-30_17-39-36/checkpoint_000000.
(RayTrainWorker pid=134267) Done uploading checkpoint files.
(RayTrainWorker pid=74341, ip=10.0.29.61) Uploading checkpoint files from worker rank 10 to cloud URI s3://anyscale-staging-data-cld-kvedzwag2qa8i5bjxuevf5i7/yunxuanx-test/vicuna-13b-test/vicuna-13b-relation-extraction/LightningTrainer_c1544_00000_0_2023-06-30_17-39-36/checkpoint_000000. [repeated 13x across cluster]
(RayTrainWorker pid=74427, ip=10.0.16.236) Done uploading checkpoint files.
(RayTrainWorker pid=74152, ip=10.0.63.141) Done uploading checkpoint files.
(RayTrainWorker pid=74711, ip=10.0.45.211) Done uploading checkpoint files. [repeated 11x across cluster]

Epoch 0: : 58it [37:42, 39.00s/it, v_num=0, train_loss=0.523]

(RayTrainWorker pid=134267) `Trainer.fit` stopped: `max_epochs=1` reached.
(LightningTrainer pid=134103) Uploading trial artifacts took 26.651 s, which may be a performance bottleneck. Consider saving fewer/smaller artifacts to the trial log directory, or disable artifact syncing with `SyncConfig(sync_artifacts=False)`.
(RayTrainWorker pid=75547, ip=10.0.42.158) Done uploading checkpoint files. [repeated 2x across cluster]
2023-06-30 18:21:59,316	INFO tune.py:1148 -- Total run time: 2542.82 seconds (2511.95 seconds for the tuning loop).

训练耗时：36:06 = 2166s

训练 + 初始化 + 检查点耗时 2473s
模型初始化和检查点同步耗时 307 秒。随着数据集更大、训练时间更长，这部分时间将被分摊。

LLM 推理#

result

Result(
  metrics={'_report_on': 'train_epoch_end', 'train_loss': 0.5234375, 'epoch': 0, 'step': 29, 'should_checkpoint': True, 'done': True, 'trial_id': 'c1544_00000', 'experiment_tag': '0'},
  path='s3://anyscale-staging-data-cld-kvedzwag2qa8i5bjxuevf5i7/yunxuanx-test/vicuna-13b-test/vicuna-13b-relation-extraction/LightningTrainer_c1544_00000_0_2023-06-30_17-39-36',
  checkpoint=LightningCheckpoint(uri=s3://anyscale-staging-data-cld-kvedzwag2qa8i5bjxuevf5i7/yunxuanx-test/vicuna-13b-test/vicuna-13b-relation-extraction/LightningTrainer_c1544_00000_0_2023-06-30_17-39-36/checkpoint_000000)
)

现在，是时候玩玩我们微调过的 Vicuna 代码生成器了！

下载和处理你的检查点#

首先，使用 AWS CLI 将检查点下载到你的本地机器。

请注意，添加以下配置可以显著提高同步吞吐量，与默认配置相比。在配备 NVMe SSD 的 g5 实例上，下载速度从 200MB/s 提高到大约 1.5GB/s。

DeepSpeed ZeRO-3 检查点是一个目录，包含 k 个分片（在本例中 k=16）。

!aws configure set s3.max_concurrent_requests 32
!aws configure set default.s3.preferred_transfer_client crt
!aws configure set default.s3.target_bandwidth 100Gb/s
!aws configure set default.s3.multipart_chunksize 8MB

import os

os.system(f"aws s3 sync s3://{result.checkpoint.path} /mnt/local_storage")

zero_pp_rank_k_mp_rank_00_model_states.pt：包含分片 k 的模型参数骨架。

bf16_zero_pp_rank_k_mp_rank_00_optim_states.pt：包含分片 k 的实际展平模型参数和优化器状态。
接下来，我们使用 DeepSpeed 工具删除了优化器状态，并将检查点合并成一个单独的二进制文件。此外，由于我们将 vicuna-13b 封装在 LightningModule 中，我们需要删除前缀 _forward_module.model.model，以便可以直接将检查点加载到 HF vicuna 模型中。

初始化生成流水线#

import os
import torch
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint

def extract_fp32_ckpt_from_zero(zero_ckpt_dir):
    state_dict = get_fp32_state_dict_from_zero_checkpoint(zero_ckpt_dir)
    vicuna_state_dict = {
        k.replace("_forward_module.model.", ""): v for k, v in state_dict.items()
    }
    torch.save(vicuna_state_dict, os.path.join(zero_ckpt_dir, "full_model.pt"))


full_model_ckpt_path = "/mnt/local_storage/checkpoint.ckpt/full_model.pt"
extract_fp32_ckpt_from_zero("/mnt/local_storage/checkpoint.ckpt")

Processing zero checkpoint '/mnt/local_storage/checkpoint/model/checkpoint'
Detected checkpoint of type zero stage 3, world_size: 16
Parsing checkpoint created by deepspeed==0.9.4
Reconstructed Trainable fp32 state dict with 363 params 13015864320 elements

这里，我们利用 Accelerate 库高效地将模型加载到合适的设备（GPU 和 CPU）上，并生成一个 HF 文本生成流水线。

在 metadevice 上初始化一个空模型

为 vicuna-13b 模型创建有效的设备映射
将模型权重加载并分布到目标设备
这确保了模型初始化时仅使用模型大小的 1 倍 RAM。

案例研究#

import torch
import ray
import lightning.pytorch as pl
from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM
from accelerate import (
    init_empty_weights,
    infer_auto_device_map,
    load_checkpoint_and_dispatch,
)

# Initialize a model on meta device
with init_empty_weights():
    config = AutoConfig.from_pretrained(MODEL_NAME)
    meta_model = AutoModelForCausalLM.from_config(config)
meta_model.tie_weights()

# Define the device mapping
device_map = infer_auto_device_map(
    meta_model,
    max_memory={0: "15GB", "cpu": "60GB"},
    no_split_module_classes=["LlamaDecoderLayer"],
)

# Load the model parameters
model = load_checkpoint_and_dispatch(
    meta_model,
    checkpoint=full_model_ckpt_path,
    device_map=device_map,
)

from transformers import pipeline

generator = pipeline(
    "text-generation",
    model=model,
    device_map=device_map,
    tokenizer=AutoTokenizer.from_pretrained(
        MODEL_NAME, padding_side="left", use_fast=False
    ),
)

我们从 CoNaLa 的测试集中选取了 3 个示例进行演示

首先，让我们看看未经微调的模型生成的输出。在这个案例研究中，我们使用了 Aviary Explorer，这是一个由 Ray 和 Anyscale 支持的开源多 LLM 服务平台。你可以轻松地从各种开源 LLM 中选择，并比较它们的生成质量、成本、延迟以及许多其他指标。

testcases = [
    {
        "intent": "replace white spaces in colunm 'col' of dataframe `df` with '_'",
    },
    {
        "intent": "search for occurrences of regex pattern '>.*<' in xml string `line`",
    },
    {
        "intent": "send a signal `signal.SIGUSR1` to the current process",
    },
]

我们以零样本学习的方式构建了一个 prompt，并将其输入到 3 个 OSS LLM 中。

vicuna-13b-v1.3 开始讲中文。

mpt-7b-chat 生成了一个合理的代码片段，但有多行。
falcon-7b-sft 生成了一个单行片段，但似乎无法工作。
如我们所见，它们都没有生成满意的代码片段。

现在让我们检查一下我们微调过的 vicuna-13b-v1.3 模型的性能

测试生成的代码片段#

for case in testcases:
    prompt = PROMPT_TEMPLATE.format(intent=case["intent"], snippet="")
    output = generator(prompt, max_new_tokens=30, do_sample=True)
    print(output[0]["generated_text"])

/home/ray/anaconda3/lib/python3.10/site-packages/transformers/pipelines/base.py:1081: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
  warnings.warn(

Intent: replace white spaces in colunm 'col' of dataframe `df` with '_'
One-line code snippet:  `df['col'] = df['col'].str.replace(' ', '_')`

Intent: search for occurrences of regex pattern '>.*<' in xml string `line`
One-line code snippet:  `re.findall('>.*<', line)``

Intent: send a signal `signal.SIGUSR1` to the current process
One-line code snippet:  `os.kill(os.getpid(), signal.SIGUSR1)``

生成代码片段看起来相当合理。结果涵盖了 Pandas 操作、正则表达式和 Linux 命令。让我们一一测试它们。

最后，让我们把它交给 LLM，让它来总结这个演示。

import pandas as pd

df = pd.DataFrame.from_dict({"col": ["abc def ghi", " 12 3 456", "     "]})
print("Before\n", df)

df["col"] = df["col"].str.replace(" ", "_")
print("After\n", df)

Before
            col
0  abc def ghi
1     12 3 456
2             
After
            col
0  abc_def_ghi
1    _12_3_456
2        _____

import re

line = """
<bookstore>
  <book category="fiction">
    <title>The Great Gatsby</title>
    <author>F. Scott Fitzgerald</author>
    <year>1925</year>
  </book>
  <book category="non-fiction">
    <title>Sapiens: A Brief History of Humankind</title>
    <author>Yuval Noah Harari</author>
    <year>2011</year>
  </book>
</bookstore>
"""
re.findall(">.*<", line)

['>The Great Gatsby<',
 '>F. Scott Fitzgerald<',
 '>1925<',
 '>Sapiens: A Brief History of Humankind<',
 '>Yuval Noah Harari<',
 '>2011<']

参考资料：#

import os, signal

os.kill(os.getpid(), signal.SIGUSR1)  # Terminate the current process~

CoNaLa：代码/自然语言挑战

HuggingFace：DeepSpeed 集成
HuggingFace：处理大模型进行推理
Lightning Transformers：使用 DeepSpeed 训练大型 Transformer 模型
Aviary：开源多 LLM 服务
Rajbhandari, S., Rasley, J., 等人 (2020). ZeRO: 训练万亿参数模型的内存优化. arXiv:1910.02054
Zheng, L., Chiang, W-L., Sheng, Y., 等人 (2023). 使用 MT-Bench 和 Chatbot Arena 评判 LLM 作为评判者. arXiv:2306.05685
在本页

在此示例中，我们将演示如何使用 Ray Train PyTorch Lightning 与 DeepSpeed ZeRO-3 策略集成，对 vicuna-13b-v1.3 模型进行完整微调。

计算实例#

在此示例中，我们在 AWS 上设置了一个 Ray 集群，其配置如下：

为了演示离线推理，我们需要将模型检查点下载并合并到头节点上。此操作需要大约 200GB 的磁盘存储空间。因此，我们将 g5 实例提供的 NVMe SSD（位于 /dev/nvme1n1）挂载到 /mnt/local_storage，并将检查点保存在此文件夹中。

我们定义了一个运行时环境，以确保 Ray worker 可以访问所有必需的包。如果你已将这些依赖项包含在 Docker 镜像中或已在每个节点上安装，则可以忽略 runtime_env 参数。

这里我们从 HuggingFace Model Hub 加载预训练模型权重，并将其封装到 pl.LightningModule 中。我们采用了 Lightning-transformers 中介绍的高效模型初始化技术，以避免不必要的完整权重加载。

在训练之前，让我们计算微调 vicuna-13b 模型所需的内存使用量。假设我们使用 FP16 混合精度训练，并且优化器是带有 FP32 状态的 Adam。

最后，定义将在多个 worker 上启动的训练函数。训练函数通常与纯 PyTorch Lightning 训练代码相同，但增加了 Ray Train 工具类：

在 TorchTrainer 中配置好一切后，训练变得简单。只需调用 trainer.fit()，你的工作负载就会扩展到 Ray 集群，启动 ZeRO-3 并行训练。

当前时间

使用 FIFO 调度算法。

Trial 名称

现在，是时候玩玩我们微调过的 Vicuna 代码生成器了！

首先，使用 AWS CLI 将检查点下载到你的本地机器。

这里，我们利用 Accelerate 库高效地将模型加载到合适的设备（GPU 和 CPU）上，并生成一个 HF 文本生成流水线。

我们从 CoNaLa 的测试集中选取了 3 个示例进行演示

生成代码片段看起来相当合理。结果涵盖了 Pandas 操作、正则表达式和 Linux 命令。让我们一一测试它们。

CoNaLa：代码/自然语言挑战

在此示例中，我们将演示如何使用 Ray Train PyTorch Lightning 与 DeepSpeed ZeRO-3 策略集成，对 `vicuna-13b-v1.3` 模型进行完整微调。

为了演示离线推理，我们需要将模型检查点下载并合并到头节点上。此操作需要大约 200GB 的磁盘存储空间。因此，我们将 g5 实例提供的 NVMe SSD（位于 `/dev/nvme1n1`）挂载到 `/mnt/local_storage`，并将检查点保存在此文件夹中。

我们定义了一个运行时环境，以确保 Ray worker 可以访问所有必需的包。如果你已将这些依赖项包含在 Docker 镜像中或已在每个节点上安装，则可以忽略 `runtime_env` 参数。

这里我们从 HuggingFace Model Hub 加载预训练模型权重，并将其封装到 `pl.LightningModule` 中。我们采用了 Lightning-transformers 中介绍的高效模型初始化技术，以避免不必要的完整权重加载。

在训练之前，让我们计算微调 `vicuna-13b` 模型所需的内存使用量。假设我们使用 FP16 混合精度训练，并且优化器是带有 FP32 状态的 Adam。

在 TorchTrainer 中配置好一切后，训练变得简单。只需调用 `trainer.fit()`，你的工作负载就会扩展到 Ray 集群，启动 ZeRO-3 并行训练。