入门#

注意

Ray 2.40 默认使用 RLlib 的新 API 堆栈。Ray 团队已基本完成将算法、示例脚本和文档迁移到新的代码库。

如果您仍在使用旧 API 堆栈,请参阅新 API 堆栈迁移指南以了解如何迁移的详细信息。

60 分钟了解 RLlib#

../_images/rllib-index-header.svg

在本教程中,您将学习如何从头开始设计、定制和运行一个端到端的 RLlib 学习实验。这包括选择和配置一个 Algorithm,运行几次训练迭代,不时保存 Algorithm 的状态,运行一个单独的评估循环,最后利用其中一个检查点将训练好的模型部署到 RLlib 之外的环境并计算动作。

您还将学习如何定制您的强化学习环境和您的神经网络模型

安装#

首先,安装 RLlib、PyTorchFarama Gymnasium,如下所示

pip install "ray[rllib]" torch "gymnasium[atari,accept-rom-license,mujoco]"

Python API#

RLlib 的 Python API 提供了将库应用于任何强化学习问题所需的全部灵活性。

您通过 Algorithm 类的一个实例来管理 RLlib 实验。一个 Algorithm 通常包含一个用于计算动作的神经网络,称为 policy(策略),您希望对其进行优化的强化学习环境,一个损失函数,一个优化器,以及一些描述算法执行逻辑的代码,例如确定何时收集样本、何时更新模型等。

多智能体训练中,Algorithm 同时管理多个策略的查询和优化。

通过算法的接口,您可以训练策略,计算动作,或通过检查点存储算法的状态。

配置和构建算法#

您首先创建一个 AlgorithmConfig 实例,并通过 config 对象的各种方法更改一些默认设置。

例如,您可以通过调用 config 的 environment() 方法来设置要使用的强化学习环境

from ray.rllib.algorithms.ppo import PPOConfig

# Create a config instance for the PPO algorithm.
config = (
    PPOConfig()
    .environment("Pendulum-v1")
)

为了扩展您的设置并定义您希望利用多少 EnvRunner Actor,您可以调用 env_runners() 方法。EnvRunners 用于从您的环境中收集样本以进行训练更新。

config.env_runners(num_env_runners=2)

对于与训练相关的设置或任何算法特定的设置,请使用 training() 方法

config.training(
    lr=0.0002,
    train_batch_size_per_learner=2000,
    num_epochs=10,
)

最后,通过调用您的 config 的 build_algo() 方法来构建实际的 Algorithm 实例。

# Build the Algorithm (PPO).
ppo = config.build_algo()

注意

在此处了解所有可用于配置 Algorithm 的方法

运行算法#

在您从配置构建好 PPO 后,您可以通过调用 train() 方法对其进行多次迭代训练,该方法返回一个结果字典,您可以将其美观地打印出来以便调试

from pprint import pprint

for _ in range(4):
    pprint(ppo.train())

算法检查点#

要保存 Algorithm 的当前状态,请通过调用其 save_to_path() 方法创建一个检查点,该方法返回保存检查点的目录。

您也可以自己提供一个检查点目录,而不是不向此调用传递任何参数并让 Algorithm 决定检查点的保存位置

checkpoint_path = ppo.save_to_path()

# OR:
# ppo.save_to_path([a checkpoint location of your choice])

评估算法#

RLlib 支持设置一个单独的 EnvRunnerGroup,其唯一目的是不时地在强化学习环境上评估您的模型。

使用您的 config 的 evaluation() 方法来设置详细信息。默认情况下,RLlib 在训练期间不执行评估,仅报告使用其“常规”EnvRunnerGroup 收集训练样本的结果。

config.evaluation(
    # Run one evaluation round every iteration.
    evaluation_interval=1,

    # Create 2 eval EnvRunners in the extra EnvRunnerGroup.
    evaluation_num_env_runners=2,

    # Run evaluation for exactly 10 episodes. Note that because you have
    # 2 EnvRunners, each one runs through 5 episodes.
    evaluation_duration_unit="episodes",
    evaluation_duration=10,
)

# Rebuild the PPO, but with the extra evaluation EnvRunnerGroup
ppo_with_evaluation = config.build_algo()

for _ in range(3):
    pprint(ppo_with_evaluation.train())

将 RLlib 与 Ray Tune 结合使用#

所有在线 RLlib Algorithm 类都与Ray Tune API兼容。

注意

离线强化学习算法,例如 BCCQLMARWIL,需要在 TuneRay Data 上进行更多工作才能添加 Ray Tune 支持。

这种集成允许在Ray Tune实验中利用您配置的 Algorithm

例如,以下代码对您的 PPO 进行超参数扫描,创建三个 Trials,每个 Trial 对应一个配置的学习率

from ray import train, tune
from ray.rllib.algorithms.ppo import PPOConfig

config = (
    PPOConfig()
    .environment("Pendulum-v1")
    # Specify a simple tune hyperparameter sweep.
    .training(
        lr=tune.grid_search([0.001, 0.0005, 0.0001]),
    )
)

# Create a Tuner instance to manage the trials.
tuner = tune.Tuner(
    config.algo_class,
    param_space=config,
    # Specify a stopping criterion. Note that the criterion has to match one of the
    # pretty printed result metrics from the results returned previously by
    # ``.train()``. Also note that -1100 is not a good episode return for
    # Pendulum-v1, we are using it here to shorten the experiment time.
    run_config=train.RunConfig(
        stop={"env_runners/episode_return_mean": -1100.0},
    ),
)
# Run the Tuner and capture the results.
results = tuner.fit()

请注意,每个 Trial 创建一个单独的Ray Actor,为每个 Trial 分配计算资源,并在您的 Ray 集群上(如果可能)并行运行它们

Trial status: 3 RUNNING
Current time: 2025-01-17 18:47:33. Total running time: 3min 0s
Logical resource usage: 9.0/12 CPUs, 0/0 GPUs
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name                   status       lr   iter  total time (s)  episode_return_mean  .._sampled_lifetime │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PPO_Pendulum-v1_b5c41_00000  RUNNING  0.001      29         86.2426             -998.449               108000 │
│ PPO_Pendulum-v1_b5c41_00001  RUNNING  0.0005     25         74.4335             -997.079               100000 │
│ PPO_Pendulum-v1_b5c41_00002  RUNNING  0.0001     20         60.0421             -960.293                80000 │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Tuner.fit() 返回一个 ResultGrid 对象,该对象允许对训练过程进行详细分析,并检索训练算法及其模型的检查点

# Get the best checkpoint corresponding to the best result
# from the preceding experiment.
best_checkpoint = best_result.checkpoint

部署训练好的模型进行推理#

训练后,您可能希望将模型部署到新环境,例如在生产中运行推理。为此,您可以使用前面示例中创建的检查点目录。要详细了解检查点、模型部署和恢复算法状态,请参阅此处的检查点页面

以下是如何从检查点创建新的模型实例并在强化学习环境中运行一个 episode 的推理。特别要注意使用 from_checkpoint() 方法创建模型以及使用 forward_inference() 方法计算动作

from pathlib import Path
import gymnasium as gym
import numpy as np
import torch
from ray.rllib.core.rl_module import RLModule

# Create only the neural network (RLModule) from our algorithm checkpoint.
# See here (https://docs.rayai.org.cn/en/master/rllib/checkpoints.html)
# to learn more about checkpointing and the specific "path" used.
rl_module = RLModule.from_checkpoint(
    Path(best_checkpoint.path)
    / "learner_group"
    / "learner"
    / "rl_module"
    / "default_policy"
)

# Create the RL environment to test against (same as was used for training earlier).
env = gym.make("Pendulum-v1", render_mode="human")

episode_return = 0.0
done = False

# Reset the env to get the initial observation.
obs, info = env.reset()

while not done:
    # Uncomment this line to render the env.
    # env.render()

    # Compute the next action from a batch (B=1) of observations.
    obs_batch = torch.from_numpy(obs).unsqueeze(0)  # add batch B=1 dimension
    model_outputs = rl_module.forward_inference({"obs": obs_batch})

    # Extract the action distribution parameters from the output and dissolve batch dim.
    action_dist_params = model_outputs["action_dist_inputs"][0].numpy()

    # We have continuous actions -> take the mean (max likelihood).
    greedy_action = np.clip(
        action_dist_params[0:1],  # 0=mean, 1=log(stddev), [0:1]=use mean, but keep shape=(1,)
        a_min=env.action_space.low[0],
        a_max=env.action_space.high[0],
    )
    # For discrete actions, you should take the argmax over the logits:
    # greedy_action = np.argmax(action_dist_params)

    # Send the action to the environment for the next step.
    obs, reward, terminated, truncated, info = env.step(greedy_action)

    # Perform env-loop bookkeeping.
    episode_return += reward
    done = terminated or truncated

print(f"Reached episode return of {episode_return}.")

或者,如果您的脚本中仍有正在运行的 Algorithm 实例,您可以通过 get_module() 方法获取 RLModule

rl_module = ppo.get_module("default_policy")  # Equivalent to `rl_module = ppo.get_module()`

定制您的强化学习环境#

在前面的示例中,您的强化学习环境Farama Gymnasium 中预注册的环境,例如 Pendulum-v1CartPole-v1。但是,如果您想针对自定义环境运行实验,请参阅下面的选项卡以获取一个少于 50 行代码的示例。

在此处查看有关如何在 RLlib 中设置和定制强化学习环境的深入指南

快速入门:自定义强化学习环境
import gymnasium as gym
from ray.rllib.algorithms.ppo import PPOConfig

# Define your custom env class by subclassing gymnasium.Env:

class ParrotEnv(gym.Env):
    """Environment in which the agent learns to repeat the seen observations.

    Observations are float numbers indicating the to-be-repeated values,
    e.g. -1.0, 5.1, or 3.2.
    The action space is the same as the observation space.
    Rewards are `r=-abs([observation] - [action])`, for all steps.
    """
    def __init__(self, config=None):
        # Since actions should repeat observations, their spaces must be the same.
        self.observation_space = config.get(
            "obs_act_space",
            gym.spaces.Box(-1.0, 1.0, (1,), np.float32),
        )
        self.action_space = self.observation_space
        self._cur_obs = None
        self._episode_len = 0

    def reset(self, *, seed=None, options=None):
        """Resets the environment, starting a new episode."""
        # Reset the episode len.
        self._episode_len = 0
        # Sample a random number from our observation space.
        self._cur_obs = self.observation_space.sample()
        # Return initial observation.
        return self._cur_obs, {}

    def step(self, action):
        """Takes a single step in the episode given `action`."""
        # Set `terminated` and `truncated` flags to True after 10 steps.
        self._episode_len += 1
        terminated = truncated = self._episode_len >= 10
        # Compute the reward: `r = -abs([obs] - [action])`
        reward = -sum(abs(self._cur_obs - action))
        # Set a new observation (random sample).
        self._cur_obs = self.observation_space.sample()
        return self._cur_obs, reward, terminated, truncated, {}

# Point your config to your custom env class:
config = (
    PPOConfig()
    .environment(
        ParrotEnv,
        # Add `env_config={"obs_act_space": [some Box space]}` to customize.
    )
)

# Build a PPO algorithm and train it.
ppo_w_custom_env = config.build_algo()
ppo_w_custom_env.train()

定制您的模型#

在前面的示例中,由于您没有在 AlgorithmConfig 中指定任何内容,RLlib 提供了默认的神经网络模型。如果您想重新配置 RLlib 默认模型的类型和大小(例如定义隐藏层数量及其激活函数),甚至使用 PyTorch 从头编写自己的自定义模型,请参阅此处的RLModule 类详细指南

请参阅下面的选项卡以获取一个 30 行代码的示例。

快速入门:自定义 RLModule
import torch

from ray.rllib.core.columns import Columns
from ray.rllib.core.rl_module.torch import TorchRLModule

# Define your custom env class by subclassing `TorchRLModule`:
class CustomTorchRLModule(TorchRLModule):
    def setup(self):
        # You have access here to the following already set attributes:
        # self.observation_space
        # self.action_space
        # self.inference_only
        # self.model_config  # <- a dict with custom settings
        input_dim = self.observation_space.shape[0]
        hidden_dim = self.model_config["hidden_dim"]
        output_dim = self.action_space.n

        # Define and assign your torch subcomponents.
        self._policy_net = torch.nn.Sequential(
            torch.nn.Linear(input_dim, hidden_dim),
            torch.nn.ReLU(),
            torch.nn.Linear(hidden_dim, output_dim),
        )

    def _forward(self, batch, **kwargs):
        # Push the observations from the batch through our `self._policy_net`.
        action_logits = self._policy_net(batch[Columns.OBS])
        # Return parameters for the default action distribution, which is
        # `TorchCategorical` (due to our action space being `gym.spaces.Discrete`).
        return {Columns.ACTION_DIST_INPUTS: action_logits}