分布式 XGBoost 模型训练#

本教程将执行一个分布式训练工作负载，该工作负载连接了以下具有异构计算需求的步骤：

使用 Ray Data 预处理数据集
使用 Ray Train 分布式训练 XGBoost 模型
使用 MLflow 将模型构件保存到模型注册表

注意：本教程不包括模型调优。有关实验执行和超参数调优，请参阅 Ray Tune。

https://raw.githubusercontent.com/anyscale/e2e-xgboost/refs/heads/main/images/distributed_training.png

依赖项#

要安装依赖项，请运行以下命令：

pip install -r requirements.txt

设置#

导入必要的模块

%load_ext autoreload
%autoreload all

# Enable importing from dist_xgboost module.
import os
import sys

sys.path.append(os.path.abspath(".."))

# Enable Ray Train v2. This is the default in an upcoming release.
os.environ["RAY_TRAIN_V2_ENABLED"] = "1"
# Now it's safe to import from ray.train

import ray

from dist_xgboost.constants import storage_path, preprocessor_path

# Make Ray data less verbose.
ray.data.DataContext.get_current().enable_progress_bars = False
ray.data.DataContext.get_current().print_on_execution_start = False

数据集准备#

本示例使用了 Breast Cancer Wisconsin (Diagnostic) 数据集，该数据集包含从乳腺肿块细胞核数字化图像计算出的特征。

将数据拆分为

70% 用于训练
15% 用于验证
15% 用于测试

from ray.data import Dataset


def prepare_data() -> tuple[Dataset, Dataset, Dataset]:
    """Load and split the dataset into train, validation, and test sets."""
    # Load the dataset from S3.
    dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")
    seed = 42

    # Split 70% for training.
    train_dataset, rest = dataset.train_test_split(test_size=0.3, shuffle=True, seed=seed)
    # Split the remaining 30% into 15% validation and 15% testing.
    valid_dataset, test_dataset = rest.train_test_split(test_size=0.5, shuffle=True, seed=seed)
    return train_dataset, valid_dataset, test_dataset

# Load and split the dataset.
train_dataset, valid_dataset, _test_dataset = prepare_data()
train_dataset.take(1)

2025-04-16 21:01:53,956	INFO worker.py:1660 -- Connecting to existing Ray cluster at address: 10.0.23.200:6379...
2025-04-16 21:01:53,966	INFO worker.py:1843 -- Connected to Ray cluster. View the dashboard at https://session-1kebpylz8tcjd34p4sv2h1f9tg.i.anyscaleuserdata.com 
2025-04-16 21:01:53,972	INFO packaging.py:575 -- Creating a file package for local module '/home/ray/default/e2e-xgboost/dist_xgboost'.
2025-04-16 21:01:53,975	INFO packaging.py:367 -- Pushing file package 'gcs://_ray_pkg_aa0e5fd0ec6b8edc.zip' (0.02MiB) to Ray cluster...
2025-04-16 21:01:53,976	INFO packaging.py:380 -- Successfully pushed file package 'gcs://_ray_pkg_aa0e5fd0ec6b8edc.zip'.
2025-04-16 21:01:53,977	INFO packaging.py:367 -- Pushing file package 'gcs://_ray_pkg_38ec1ca756a7ccf23a0c590d356f26fc87860d8a.zip' (0.07MiB) to Ray cluster...
2025-04-16 21:01:53,978	INFO packaging.py:380 -- Successfully pushed file package 'gcs://_ray_pkg_38ec1ca756a7ccf23a0c590d356f26fc87860d8a.zip'.

(autoscaler +11s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +11s) [autoscaler] [8CPU-32GB] Attempting to add 1 node(s) to the cluster (increasing from 0 to 1).
(autoscaler +11s) [autoscaler] [8CPU-32GB] Launched 1 instances.

2025-04-16 21:03:12,957	INFO dataset.py:2809 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format.

[{'mean radius': 19.16,
  'mean texture': 26.6,
  'mean perimeter': 126.2,
  'mean area': 1138.0,
  'mean smoothness': 0.102,
  'mean compactness': 0.1453,
  'mean concavity': 0.1921,
  'mean concave points': 0.09664,
  'mean symmetry': 0.1902,
  'mean fractal dimension': 0.0622,
  'radius error': 0.6361,
  'texture error': 1.001,
  'perimeter error': 4.321,
  'area error': 69.65,
  'smoothness error': 0.007392,
  'compactness error': 0.02449,
  'concavity error': 0.03988,
  'concave points error': 0.01293,
  'symmetry error': 0.01435,
  'fractal dimension error': 0.003446,
  'worst radius': 23.72,
  'worst texture': 35.9,
  'worst perimeter': 159.8,
  'worst area': 1724.0,
  'worst smoothness': 0.1782,
  'worst compactness': 0.3841,
  'worst concavity': 0.5754,
  'worst concave points': 0.1872,
  'worst symmetry': 0.3258,
  'worst fractal dimension': 0.0972,
  'target': 0}]

查看输出，您会发现数据集中包含表征乳腺肿块细胞核的特征，例如半径、纹理、周长、面积、光滑度、紧凑度、凹度、对称性等。

数据预处理#

请注意，这些特征的量级和范围各不相同。虽然像 XGBoost 这样的基于树的模型对这些差异不太敏感，但在某些情况下，特征缩放仍然可以提高数值稳定性。

Ray Data 具有内置的预处理器，可以简化常见的特征预处理任务，尤其适用于表格数据。您可以将这些预处理器与 Ray Datasets 集成，以容错和分布式的方式进行数据预处理。

本示例使用 Ray 的内置 StandardScaler 来零均值化和标准化特征。

from ray.data.preprocessors import StandardScaler


def train_preprocessor(train_dataset: ray.data.Dataset) -> StandardScaler:
    # Pick some dataset columns to scale.
    columns_to_scale = [c for c in train_dataset.columns() if c != "target"]

    # Initialize the preprocessor.
    preprocessor = StandardScaler(columns=columns_to_scale)
    # Train the preprocessor on the training set.
    preprocessor.fit(train_dataset)

    return preprocessor


preprocessor = train_preprocessor(train_dataset)

现在您已经拟合了预处理器，将其保存到文件中。稍后在 MLflow 中注册此构件，以便在下游管道中重复使用。

import pickle

with open(preprocessor_path, "wb") as f:
    pickle.dump(preprocessor, f)

接下来，使用拟合的预处理器转换数据集。请注意，transform() 操作是惰性的。Ray Data 不会立即将其应用于数据，直到 Ray Train 工作进程需要数据为止。

train_dataset = preprocessor.transform(train_dataset)
valid_dataset = preprocessor.transform(valid_dataset)
train_dataset.take(1)

[{'mean radius': 1.3883915483364895,
  'mean texture': 1.6582900738074817,
  'mean perimeter': 1.3686612092802328,
  'mean area': 1.3271629358408426,
  'mean smoothness': 0.3726369329455741,
  'mean compactness': 0.7709391453349583,
  'mean concavity': 1.2156484038771678,
  'mean concave points': 1.1909841981870102,
  'mean symmetry': 0.33295997290846857,
  'mean fractal dimension': -0.07207903519571106,
  'radius error': 0.8074600624242092,
  'texture error': -0.3842391069975234,
  'perimeter error': 0.6925593054563496,
  'area error': 0.5852832746827147,
  'smoothness error': 0.13331319500721583,
  'compactness error': -0.03934175265392654,
  'concavity error': 0.22009334597724586,
  'concave points error': 0.16570998568362863,
  'symmetry error': -0.7220900323187186,
  'fractal dimension error': -0.13670701917436776,
  'worst radius': 1.5076654048043645,
  'worst texture': 1.6169142713721316,
  'worst perimeter': 1.5267353447826646,
  'worst area': 1.4332237868207693,
  'worst smoothness': 1.993402211865443,
  'worst compactness': 0.8646836438651355,
  'worst concavity': 1.3882655471454963,
  'worst concave points': 1.0898377217385602,
  'worst symmetry': 0.5707716568830431,
  'worst fractal dimension': 0.7444861349012516,
  'target': 0}]

使用 take()，您会看到 Ray Data 已零均值化并重缩放了值，使其大致在 -1 和 1 之间。

数据处理说明:
有关更高级的数据加载和预处理技术，请参阅综合指南。如果需要，Ray Data 还支持高性能的连接、过滤、聚合和其他操作，以进行更结构化的数据处理。

使用 XGBoost 进行模型训练#

检查点配置#

检查点是一项强大的功能，可在中断时从最后一个检查点恢复训练。检查点对于长时间运行的训练会话尤其有用。

XGBoostTrainer 开箱即用地实现了检查点功能。通过配置 CheckpointConfig 来设置检查点频率。

from ray.train import CheckpointConfig, Result, RunConfig, ScalingConfig

# Configure checkpointing to save progress during training.
run_config = RunConfig(
    checkpoint_config=CheckpointConfig(
        # Checkpoint every 10 iterations.
        checkpoint_frequency=10,
        # Only keep the latest checkpoint.
        num_to_keep=1,
    ),
    ## For multi-node clusters, configure storage that's accessible
    ## across all worker nodes with `storage_path="s3://..."`.
    storage_path=storage_path,
)

注意：启用检查点后，您可以按照本指南来启用容错。

使用 XGBoost 进行训练#

将训练参数作为字典传递，类似于原始的 xgboost.train() 函数。

import xgboost
from ray.train.xgboost import RayTrainReportCallback, XGBoostTrainer

NUM_WORKERS = 4
USE_GPU = True


def train_fn_per_worker(config: dict):
    """Training function that runs on each worker.

    This function:
    1. Gets the dataset shard for this worker
    2. Converts to pandas for XGBoost
    3. Separates features and labels
    4. Creates DMatrix objects
    5. Trains the model using distributed communication
    """
    # Get this worker's dataset shard.
    train_ds, val_ds = (
        ray.train.get_dataset_shard("train"),
        ray.train.get_dataset_shard("validation"),
    )

    # Materialize the data and convert to pandas.
    train_ds = train_ds.materialize().to_pandas()
    val_ds = val_ds.materialize().to_pandas()

    # Separate the labels from the features.
    train_X, train_y = train_ds.drop("target", axis=1), train_ds["target"]
    eval_X, eval_y = val_ds.drop("target", axis=1), val_ds["target"]

    # Convert the data into DMatrix format for XGBoost.
    dtrain = xgboost.DMatrix(train_X, label=train_y)
    deval = xgboost.DMatrix(eval_X, label=eval_y)

    # Do distributed data-parallel training.
    # Ray Train sets up the necessary coordinator processes and
    # environment variables for workers to communicate with each other.
    _booster = xgboost.train(
        config["xgboost_params"],
        dtrain=dtrain,
        evals=[(dtrain, "train"), (deval, "validation")],
        num_boost_round=10,
        # Handles metric logging and checkpointing.
        callbacks=[RayTrainReportCallback()],
    )


# Parameters for the XGBoost model.
model_config = {
    "xgboost_params": {
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    }
}

trainer = XGBoostTrainer(
    train_fn_per_worker,
    train_loop_config=model_config,
    # Register the data subsets.
    datasets={"train": train_dataset, "validation": valid_dataset},
    # See "Scaling strategies" for more details.
    scaling_config=ScalingConfig(
        # Number of workers for data parallelism.
        num_workers=NUM_WORKERS,
        # Set to True to use GPU acceleration.
        use_gpu=USE_GPU,
    ),
    run_config=run_config,
)

Ray Train 的优势:

多节点编排：自动处理多节点、多 GPU 设置，无需手动 SSH 或 hostfile 配置。

内置容错：支持自动重试失败的工作进程，并可从最后一个检查点继续。

灵活的训练策略：支持除数据并行训练以外的各种并行策略。

异构集群支持：定义每个工作进程的资源需求，并在混合硬件上运行。

Ray Train 与 PyTorch、TensorFlow、XGBoost 等流行框架集成。对于企业级需求，RayTurbo Train 提供了弹性训练、高级监控和性能优化等附加功能。

接下来，训练模型。

result: Result = trainer.fit()
result

(TrainController pid=19121) Attempting to start training worker group of size 5 with the following resources: [{'GPU': 1}] * 5

(autoscaler +1m31s) [autoscaler] [8xA10G:192CPU-768GB] Attempting to add 1 node(s) to the cluster (increasing from 0 to 1).
(autoscaler +1m31s) [autoscaler] Launching instances failed: NewInstances[g5.48xlarge;num:1;all:false]: could not launch any instances: api error Unsupported: Instance type g5.48xlarge is not supported in zone us-west-2d.
(autoscaler +1m31s) [autoscaler] [1xA10G:16CPU-64GB] Attempting to add 5 node(s) to the cluster (increasing from 0 to 5).
(autoscaler +1m31s) [autoscaler] Launching instances failed: NewInstances[g5.4xlarge;num:5;all:false]: could not launch any instances: api error Unsupported: Instance type g5.4xlarge is not supported in zone us-west-2d.
(autoscaler +1m31s) [autoscaler] [1xA10G:32CPU-128GB] Attempting to add 5 node(s) to the cluster (increasing from 0 to 5).
(autoscaler +1m36s) [autoscaler] Launching instances failed: NewInstances[g5.8xlarge;num:5;all:false]: could not launch any instances: api error Unsupported: Instance type g5.8xlarge is not supported in zone us-west-2d.
(autoscaler +1m36s) [autoscaler] [1xL4:4CPU-16GB] Attempting to add 1 node(s) to the cluster (increasing from 0 to 1).
(autoscaler +1m36s) [autoscaler] [4xL4:48CPU-192GB] Attempting to add 1 node(s) to the cluster (increasing from 0 to 1).
(autoscaler +1m36s) [autoscaler] [4xL4:48CPU-192GB] Launched 1 instances.
(autoscaler +1m36s) [autoscaler] [1xL4:4CPU-16GB] Launched 1 instances.

(TrainController pid=19121) Retrying the launch of the training worker group. The previous launch attempt encountered the following failure:
(TrainController pid=19121) The worker group startup timed out after 30.0 seconds waiting for 5 workers. Potential causes include: (1) temporary insufficient cluster resources while waiting for autoscaling (ignore this warning in this case), (2) infeasible resource request where the provided `ScalingConfig` cannot be satisfied), and (3) transient network issues. Set the RAY_TRAIN_WORKER_GROUP_START_TIMEOUT_S environment variable to increase the timeout.
(TrainController pid=19121) Attempting to start training worker group of size 5 with the following resources: [{'GPU': 1}] * 5

(autoscaler +2m21s) [autoscaler] Cluster upscaled to {12 CPU, 1 GPU}.

(TrainController pid=19121) Retrying the launch of the training worker group. The previous launch attempt encountered the following failure:
(TrainController pid=19121) The worker group startup timed out after 30.0 seconds waiting for 5 workers. Potential causes include: (1) temporary insufficient cluster resources while waiting for autoscaling (ignore this warning in this case), (2) infeasible resource request where the provided `ScalingConfig` cannot be satisfied), and (3) transient network issues. Set the RAY_TRAIN_WORKER_GROUP_START_TIMEOUT_S environment variable to increase the timeout.
(TrainController pid=19121) Attempting to start training worker group of size 5 with the following resources: [{'GPU': 1}] * 5

(autoscaler +2m31s) [autoscaler] Cluster upscaled to {60 CPU, 5 GPU}.
(raylet) WARNING: 4 PYTHON worker processes have been started on node: dc30e171b93f61245644ba4d0147f8b27f64e9e1eaf34d1bb63c9c99 with address: 10.0.23.200. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).

(RayTrainWorker pid=3285, ip=10.0.223.105) [21:04:38] Task [xgboost.ray-rank=00000002]:fa43387771ebd5738fd50b6303000000 got rank 2
(TrainController pid=19121) [21:04:42] [0]	train-logloss:0.44514	train-error:0.04051	validation-logloss:0.43997	validation-error:0.04706
(TrainController pid=19121) [21:04:44] [1]	train-logloss:0.31649	train-error:0.01772	validation-logloss:0.31594	validation-error:0.04706
(RayTrainWorker pid=2313, ip=10.0.223.33) [21:04:38] Task [xgboost.ray-rank=00000004]:a6ed8004330660f5a370531f03000000 got rank 4 [repeated 4x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.rayai.org.cn/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(TrainController pid=19121) [21:04:46] [2]	train-logloss:0.23701	train-error:0.01266	validation-logloss:0.24072	validation-error:0.02353
(TrainController pid=19121) [21:04:48] [3]	train-logloss:0.18165	train-error:0.00759	validation-logloss:0.19038	validation-error:0.01176
(TrainController pid=19121) [21:04:50] [4]	train-logloss:0.14258	train-error:0.00759	validation-logloss:0.14917	validation-error:0.01176
(TrainController pid=19121) [21:04:52] [5]	train-logloss:0.11360	train-error:0.00759	validation-logloss:0.12113	validation-error:0.01176
(TrainController pid=19121) [21:04:54] [6]	train-logloss:0.09207	train-error:0.00759	validation-logloss:0.10018	validation-error:0.01176
(TrainController pid=19121) [21:04:56] [7]	train-logloss:0.07616	train-error:0.00506	validation-logloss:0.08632	validation-error:0.01176
(TrainController pid=19121) [21:04:58] [8]	train-logloss:0.06419	train-error:0.00506	validation-logloss:0.07705	validation-error:0.01176
(TrainController pid=19121) [21:05:00] [9]	train-logloss:0.05463	train-error:0.00506	validation-logloss:0.06741	validation-error:0.01176
(RayTrainWorker pid=3284, ip=10.0.223.105) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/user_storage/ray_train_run-2025-04-16_21-03-13/checkpoint_2025-04-16_21-05-00.160991)

Result(metrics=OrderedDict({'train-logloss': 0.05463397157248817, 'train-error': 0.00506329113924051, 'validation-logloss': 0.06741214815308066, 'validation-error': 0.01176470588235294}), checkpoint=Checkpoint(filesystem=local, path=/mnt/user_storage/ray_train_run-2025-04-16_21-03-13/checkpoint_2025-04-16_21-05-00.160991), error=None, path='/mnt/user_storage/ray_train_run-2025-04-16_21-03-13', metrics_dataframe=   train-logloss  train-error  validation-logloss  validation-error
0       0.054634     0.005063            0.067412          0.011765, best_checkpoints=[(Checkpoint(filesystem=local, path=/mnt/user_storage/ray_train_run-2025-04-16_21-03-13/checkpoint_2025-04-16_21-05-00.160991), OrderedDict({'train-logloss': 0.05463397157248817, 'train-error': 0.00506329113924051, 'validation-logloss': 0.06741214815308066, 'validation-error': 0.01176470588235294}))], _storage_filesystem=<pyarrow._fs.LocalFileSystem object at 0x7ea450adb130>)

在训练作业开始时，Ray 开始请求 GPU 节点，以满足训练作业对五个 GPU 工作进程的需求。

Ray Train 返回一个 ray.train.Result 对象，其中包含重要的属性，如指标、检查点信息和错误详细信息。

metrics = result.metrics
metrics

OrderedDict([('train-logloss', 0.05463397157248817),
             ('train-error', 0.00506329113924051),
             ('validation-logloss', 0.06741214815308066),
             ('validation-error', 0.01176470588235294)])

预期的输出类似于以下内容：

OrderedDict([('train-logloss', 0.05463397157248817),
             ('train-error', 0.00506329113924051),
             ('validation-logloss', 0.06741214815308066),
             ('validation-error', 0.01176470588235294)])

可以看到，Ray Train 根据您在 eval_metric 和 evals 中配置的值记录了指标。

您还可以从检查点目录中重建训练好的模型。

booster = RayTrainReportCallback.get_model(result.checkpoint)
booster

<xgboost.core.Booster at 0x7ea4531beea0>

模型注册表#

训练好模型后，将其保存到模型注册表以供将来使用。由于这是一个分布式训练工作负载，模型注册表存储需要从集群中的所有工作进程中可访问。此存储可以是 S3、NFS 或其他网络附加解决方案。Anyscale 通过在每个集群节点上自动创建和挂载共享存储选项来简化此过程，确保模型构件在分布式环境中可读写。

MLflow 跟踪服务器将实验元数据和模型构件存储在共享存储位置，以便将来进行模型服务、评估或重新训练。Ray 还与其他实验跟踪器集成。

import shutil
from tempfile import TemporaryDirectory

import mlflow

from dist_xgboost.constants import (
    experiment_name,
    model_fname,
    model_registry,
    preprocessor_fname,
)


def clean_up_old_runs():
    # Clean up old MLflow runs.
    os.path.isdir(model_registry) and shutil.rmtree(model_registry)
    # mlflow.delete_experiment(experiment_name)
    os.makedirs(model_registry, exist_ok=True)


def log_run_to_mlflow(model_config, result, preprocessor_path):
    # Create a model registry in user storage.
    mlflow.set_tracking_uri(f"file:{model_registry}")

    # Create a new experiment and log metrics and artifacts.
    mlflow.set_experiment(experiment_name)
    with mlflow.start_run(description="xgboost breast cancer classifier on all features"):
        mlflow.log_params(model_config)
        mlflow.log_metrics(result.metrics)

        # Selectively log just the preprocessor and model weights.
        with TemporaryDirectory() as tmp_dir:
            shutil.copy(
                os.path.join(result.checkpoint.path, model_fname),
                os.path.join(tmp_dir, model_fname),
            )
            shutil.copy(
                preprocessor_path,
                os.path.join(tmp_dir, preprocessor_fname),
            )

            mlflow.log_artifacts(tmp_dir)


clean_up_old_runs()
log_run_to_mlflow(model_config, result, preprocessor_path)

2025/04/16 21:07:07 INFO mlflow.tracking.fluent: Experiment with name 'breast_cancer_all_features' does not exist. Creating a new experiment.

启动 MLflow 服务器以查看实验。

mlflow server -h 0.0.0.0 -p 8080 --backend-store-uri {model_registry}

要查看仪表板，请转到 **Overview tab** > **Open Ports** > 8080。

https://raw.githubusercontent.com/anyscale/e2e-xgboost/refs/heads/main/images/mlflow.png

您还可以查看 Ray Dashboard 和 Train 工作负载仪表板。

https://raw.githubusercontent.com/anyscale/e2e-xgboost/refs/heads/main/images/train_metrics.png

您可以从注册表中检索最佳模型。

from dist_xgboost.data import get_best_model_from_registry

best_model, artifacts_dir = get_best_model_from_registry()
artifacts_dir

'/mnt/user_storage/mlflow/290203875164933232/eb2666ca6cee4792bfda41a02b194d87/artifacts'

生产部署#

您可以将训练工作负载包装为生产级的 Anyscale Job。有关更多详细信息，请参阅 API 参考。

from dist_xgboost.constants import root_dir

os.environ["WORKING_DIR"] = root_dir

然后使用 anyscale CLI 命令提交作业。

%%bash

# Production batch job -- note that this is a bash cell
! anyscale job submit --name=train-xboost-breast-cancer-model \
  --containerfile="${WORKING_DIR}/containerfile" \
  --working-dir="${WORKING_DIR}" \
  --exclude="" \
  --max-retries=0 \
  --wait \
  -- cd notebooks && jupyter nbconvert --to script 01-Distributed_Training.ipynb && ipython 01-Distributed_Training.py

Output
(anyscale +0.9s) Submitting job with config JobConfig(name='train-xboost-breast-cancer-model', image_uri=None, compute_config=None, env_vars=None, py_modules=None, py_executable=None, cloud=None, project=None, ray_version=None, job_queue_config=None).
(anyscale +2.6s) Uploading local dir '/home/ray/default/e2e-xgboost' to cloud storage.
(anyscale +3.8s) Including workspace-managed pip dependencies.
(anyscale +4.2s) Job 'train-xboost-breast-cancer-model' submitted, ID: 'prodjob_bkbpnmhytt3ljt8ftlnyumjxdj'.
(anyscale +4.2s) View the job in the UI: https://console.anyscale.com/jobs/prodjob_bkbpnmhytt3ljt8ftlnyumjxdj
(anyscale +4.2s) Use `--wait` to wait for the job to run and stream logs.

该 containerfile 定义了依赖项，但您也可以使用预构建的镜像。

您可以将计算需求指定为计算配置或内联在作业配置中。

当从工作空间启动且未指定计算时，将默认使用工作空间的计算配置。

扩展策略#

Ray Train 的一个主要优势是其能够轻松扩展训练工作负载。通过调整 ScalingConfig，您可以优化资源利用率并缩短训练时间。

扩展示例#

多节点 CPU 示例： 4 个节点，每个节点 8 个 CPU。

scaling_config = ScalingConfig(
    num_workers=4,
    resources_per_worker={"CPU": 8},
)

单节点多 GPU 示例： 1 个节点，8 个 CPU 和 4 个 GPU。

scaling_config = ScalingConfig(
    num_workers=4,
    use_gpu=True,
)

多节点多 GPU 示例： 4 个节点，每个节点 8 个 CPU 和 4 个 GPU。

scaling_config = ScalingConfig(
    num_workers=16,
    use_gpu=True,
)

重要提示：对于多节点集群，您必须在 run_config 中指定一个共享存储位置，例如云存储或 NFS。使用本地路径会在检查点期间引发错误。
trainer = XGBoostTrainer(
    ..., run_config=ray.train.RunConfig(storage_path="s3://...")
)

工作进程配置指南#

最佳工作进程数量取决于工作负载和集群设置。

对于仅 CPU 训练，通常每个节点使用一个工作进程。XGBoost 可以通过线程利用多个 CPU。
对于多 GPU 训练，每个 GPU 使用一个工作进程。
对于异构集群，请考虑 CPU 数量的最大公约数。

GPU 加速#

要使用 GPU 进行训练：

使用 use_gpu=True 启动每个 GPU 一个 Actor。
设置与 GPU 兼容的参数，例如，XGBoost 的 tree_method="gpu_hist"。
将 CPU 平均分配给每台机器上的 Actor。

示例：#

trainer = XGBoostTrainer(
    scaling_config=ScalingConfig(
        # Number of workers to use for data parallelism.
        num_workers=2,
        # Whether to use GPU acceleration.
        use_gpu=True,
    ),
    params={
        # XGBoost specific params.
        "tree_method": "gpu_hist",  # GPU-specific parameter
        "eval_metric": ["logloss", "error"],
    },
    ...
)

有关更高级的主题，请参阅：

Ray Tune 用于超参数优化。
Ray Serve 用于模型部署。
Ray Data 用于更高级的数据处理。