Episode#

注意

Ray 2.40 默认使用 RLlib 的新 API 堆栈。Ray 团队已基本完成将算法、示例脚本和文档迁移到新的代码库。

如果您仍在使用旧 API 堆栈，请参阅新 API 堆栈迁移指南了解如何迁移的详细信息。

RLlib 以 Episodes 的形式存储和传输所有轨迹数据，特别是用于单智能体设置的 SingleAgentEpisode 和用于多智能体设置的 MultiAgentEpisode。数据仅在神经网络前向传播之前，由所谓的“连接器流水线”从这种 Episode 格式转换为张量批次（可能包括移动到 GPU）。

../_images/usage_of_episodes.svg — **Episode**（单智能体）：Episode 从一个单独的观察（“重置观察”）开始，然后在每个时间步继续包含一个 `(observation, action, reward)` 三元组。请注意，由于重置观察的存在，每个 episode 在每个时间步始终比动作或奖励多一个观察。Episode 的重要附加属性包括其 `id_` (str) 和 `terminated/truncated` (bool) 标志。有关 `SingleAgentEpisode` 暴露给用户的 API 的详细说明，请参阅下文。#

以这种整体轨迹格式（而不是张量批次）收集和移动数据的主要优势在于，它提供了对 RL 环境历史的 360° 可见性和完全访问。这意味着用户可以从 episode 中提取任意信息片段，以便由其自定义组件进一步处理。考虑一个 transformer 模型，它不仅需要最新的观察来计算下一个动作，还需要过去 n 个观察的整个序列。使用 get_observations()，用户可以轻松地在其自定义 ConnectorV2 流水线中提取此信息，并将数据添加到神经网络批次中。

Episode 相较于批次的另一个优势是更高效的内存占用。例如，像 DQN 这样的算法需要在训练批次中包含观察和下一个观察（用于计算基于 TD 误差的损失），从而复制了一个已经很大的观察张量。大多数时间使用 Episode 对象将内存需求减少到单个观察轨道，其中包含从重置到终止的所有观察。

本页详细介绍了使用 RLlib 的 Episode API 的方式。

SingleAgentEpisode#

本页仅描述单智能体情况。

注意

Ray 团队正在编写多智能体情况的详细说明，类似于本页，但针对 MultiAgentEpisode。

创建 SingleAgentEpisode#

RLlib 通常负责创建 SingleAgentEpisode 实例并进行传输，例如从 EnvRunner 到 Learner。但是，以下是如何手动生成并用虚拟数据填充一个初始为空的 episode

from ray.rllib.env.single_agent_episode import SingleAgentEpisode

# Construct a new episode (without any data in it yet).
episode = SingleAgentEpisode()
assert len(episode) == 0

episode.add_env_reset(observation="obs_0", infos="info_0")
# Even with the initial obs/infos, the episode is still considered len=0.
assert len(episode) == 0

# Fill the episode with some fake data (5 timesteps).
for i in range(5):
    episode.add_env_step(
        observation=f"obs_{i+1}",
        action=f"act_{i}",
        reward=f"rew_{i}",
        terminated=False,
        truncated=False,
        infos=f"info_{i+1}",
    )
assert len(episode) == 5

上面构建并填充的 SingleAgentEpisode 现在应该大致是这样的

../_images/sa_episode.svg — **（单智能体）Episode**：Episode 从一个单独的观察（“重置观察”）开始，然后在每个时间步继续包含一个 `(observation, action, reward)` 三元组。请注意，由于重置观察的存在，每个 episode 在每个时间步始终比动作或奖励多一个观察。Episode 的重要附加属性包括其 `id_` (str) 和 `terminated/truncated` (bool) 标志。有关 `SingleAgentEpisode` 暴露给用户的 API 的详细说明，请参阅下文。#

使用 SingleAgentEpisode 的 Getter API#

现在有了一个可以使用的 SingleAgentEpisode，可以使用其不同的“getter”方法来探索并从中提取信息

../_images/sa_episode_getters.svg — **SingleAgentEpisode getter APIs**：“getter”方法适用于 Episode 的所有字段，这些字段是 `observations`、`actions`、`rewards`、`infos` 和 `extra_model_outputs`。为简单起见，此处仅显示观察、动作和奖励的 getter。它们的行为很直观，当提供单个索引时返回单个项目，当提供索引列表或索引切片时返回项目列表（非 numpy 化情况；参见下文）。#

请注意，对于 extra_model_outputs，getter 稍微复杂一些，因为此数据中存在子键（例如：action_logp）。有关更多信息，请参阅 get_extra_model_outputs()。

以下代码片段总结了不同 getter 方法的各种功能

# We can now access information from the episode via its getter APIs.

from ray.rllib.utils.test_utils import check

# Get the very first observation ("reset observation"). Note that a single observation
# is returned here (not a list of size 1 or a batch of size 1).
check(episode.get_observations(0), "obs_0")
# ... which is the same as using the indexing operator on the Episode's
# `observations` property:
check(episode.observations[0], "obs_0")

# You can also get several observations at once by providing a list of indices:
check(episode.get_observations([1, 2]), ["obs_1", "obs_2"])
# .. or a slice of observations by providing a python slice object:
check(episode.get_observations(slice(1, 3)), ["obs_1", "obs_2"])

# Note that when passing only a single index, a single item is returned.
# Whereas when passing a list of indices or a slice, a list of items is returned.

# Similarly for getting rewards:
# Get the last reward.
check(episode.get_rewards(-1), "rew_4")
# ... which is the same as using the slice operator on the `rewards` property:
check(episode.rewards[-1], "rew_4")

# Similarly for getting actions:
# Get the first action in the episode (single item, not batched).
# This works regardless of the action space.
check(episode.get_actions(0), "act_0")
# ... which is the same as using the indexing operator on the `actions` property:
check(episode.actions[0], "act_0")

# Finally, you can slice the entire episode using the []-operator with a slice notation:
sliced_episode = episode[3:4]
check(list(sliced_episode.observations), ["obs_3", "obs_4"])
check(list(sliced_episode.actions), ["act_3"])
check(list(sliced_episode.rewards), ["rew_3"])

Numpy 化和非 Numpy 化的 Episode#

a SingleAgentEpisode 中的数据可以存在两种状态：非 numpy 化和 numpy 化。非 numpy 化 episode 将其数据项存储在普通的 Python 列表中，并将新的时间步数据附加到这些列表中。在 numpy 化 episode 中，这些列表已被转换为可能复杂的结构，其叶子节点是 NumPy 数组。请注意，numpy 化 episode 不一定已终止或截断，即底层 RL 环境并未宣布 episode 结束或已达到某个最大时间步数。

../_images/sa_episode_non_finalized_vs_finalized.svg

SingleAgentEpisode 对象从非 numpy 化状态开始，在该状态下数据存储在 Python 列表中，这使得从正在进行的 episode 中附加数据非常快

# Episodes start in the non-numpy'ized state (in which data is stored
# under the hood in lists).
assert episode.is_numpy is False

# Call `to_numpy()` to convert all stored data from lists of individual (possibly
# complex) items to numpy arrays. Note that RLlib normally performs this method call,
# so users don't need to call `to_numpy()` themselves.
episode.to_numpy()
assert episode.is_numpy is True

为了说明非 numpy 化 episode 中存储的数据与 numpy 化 episode 中存储的相同数据之间的差异，请看这里的复杂观察示例，它展示了两个 episode（一个非 numpy 化，另一个 numpy 化）中完全相同的观察数据

../_images/sa_episode_non_finalized.svg — **非 numpy 化 episode 中的复杂观察**：每个单独的观察都是一个（复杂）字典，与 gymnasium 环境的观察空间相匹配。到目前为止，Episode 中存储了三个这样的观察项目。#

../_images/sa_episode_finalized.svg — **numpy 化 episode 中的复杂观察**：整个观察记录是一个单一的复杂字典，与 gymnasium 环境的观察空间相匹配。结构的叶子节点是 `NDArrays`，保存叶子节点的单个值。请注意，这些 `NDArrays` 有一个额外的批次维度（axis=0），其长度与存储的 episode 长度（此处为 3）匹配。#

Episode.cut() 和回溯缓冲区#

在从 RL 环境收集样本期间，EnvRunner 有时必须停止向正在进行的（未终止的）SingleAgentEpisode 追加数据，以便返回迄今为止收集的数据。EnvRunner 然后在 SingleAgentEpisode 对象上调用 cut()，这会返回一个新的 episode 块，以便在下一轮采样中继续收集。

# An ongoing episode (of length 5):
assert len(episode) == 5
assert episode.is_done is False

# During an `EnvRunner.sample()` rollout, when enough data has been collected into
# one or more Episodes, the `EnvRunner` calls the `cut()` method, interrupting
# the ongoing Episode and returning a new continuation chunk (with which the
# `EnvRunner` can continue collecting data during the next call to `sample()`):
continuation_episode = episode.cut()

# The length is still 5, but the length of the continuation chunk is 0.
assert len(episode) == 5
assert len(continuation_episode) == 0

# Thanks to the lookback buffer, we can still access the most recent observation
# in the continuation chunk:
check(continuation_episode.get_observations(-1), "obs_5")

请注意，存在一个“回溯”机制，允许连接器从连续块中回溯到被截断 episode 的前 H 个时间步，其中 H 是一个可配置的参数。

../_images/sa_episode_cut_and_lookback.svg

默认的回溯范围 (H) 是 1。这意味着您在 cut() 后仍然可以访问最近的动作 (get_actions(-1))、最近的奖励 (get_rewards(-1)) 以及最近的两个观察 (get_observations([-2, -1]))。如果您希望能够访问更早的数据，请在您的 AlgorithmConfig 中更改此设置

config = AlgorithmConfig()
# Change the lookback horizon setting, in case your connector (pipelines) need
# to access data further in the past.
config.env_runners(episode_lookback_horizon=10)

回溯缓冲区和 Getter 详细介绍#

以下代码演示了 SingleAgentEpisode getter API 为用户提供的更多选项，用于访问更早的信息（回溯缓冲区内）。想象一下，您必须编写一个连接器片段，它必须将最后 5 个奖励添加到模型用于计算动作前向传播的张量批次中

# Construct a new episode (with some data in its lookback buffer).
episode = SingleAgentEpisode(
    observations=["o0", "o1", "o2", "o3"],
    actions=["a0", "a1", "a2"],
    rewards=[0.0, 1.0, 2.0],
    len_lookback_buffer=3,
)
# Since our lookback buffer is 3, all data already specified in the constructor should
# now be in the lookback buffer (and not be part of the `episode` chunk), meaning
# the length of `episode` should still be 0.
assert len(episode) == 0

# .. and trying to get the first reward will hence lead to an IndexError.
try:
    episode.get_rewards(0)
except IndexError:
    pass

# Get the last 3 rewards (using the lookback buffer).
check(episode.get_rewards(slice(-3, None)), [0.0, 1.0, 2.0])

# Assuming the episode actually started with `obs_0` (reset obs),
# then `obs_1` + `act_0` + reward=0.0, but your model always requires a 1D reward tensor
# of shape (5,) with the 5 most recent rewards in it.
# You could try to code for this by manually filling the missing 2 timesteps with zeros:
last_5_rewards = [0.0, 0.0] + episode.get_rewards(slice(-3, None))
# However, this will become extremely tedious, especially when moving to (possibly more
# complex) observations and actions.

# Instead, `SingleAgentEpisode` getters offer some useful options to solve this problem:
last_5_rewards = episode.get_rewards(slice(-5, None), fill=0.0)
# Note that the `fill` argument allows you to even go further back into the past, provided
# you are ok with filling timesteps that are not covered by the lookback buffer with
# a fixed value.

另一个有用的 getter 参数（除了 fill）是布尔参数 neg_index_as_lookback。如果设置为 True，负数索引将不被解释为“从末尾开始”，而是被解释为“进入回溯缓冲区”。这使您可以在遍历全局时间步范围的同时，从每个全局时间步回溯一定数量的时间步

# Construct a new episode (len=3 and lookback buffer=3).
episode = SingleAgentEpisode(
    observations=[
        "o-3",
        "o-2",
        "o-1",  # <- lookback  # noqa
        "o0",
        "o1",
        "o2",
        "o3",  # <- actual episode data  # noqa
    ],
    actions=[
        "a-3",
        "a-2",
        "a-1",  # <- lookback  # noqa
        "a0",
        "a1",
        "a2",  # <- actual episode data  # noqa
    ],
    rewards=[
        -3.0,
        -2.0,
        -1.0,  # <- lookback  # noqa
        0.0,
        1.0,
        2.0,  # <- actual episode data  # noqa
    ],
    len_lookback_buffer=3,
)
assert len(episode) == 3

# In case you want to loop through global timesteps 0 to 2 (timesteps -3, -2, and -1
# being the lookback buffer) and at each such global timestep look 2 timesteps back,
# you can do so easily using the `neg_index_as_lookback` arg like so:
for global_ts in [0, 1, 2]:
    rewards = episode.get_rewards(
        slice(global_ts - 2, global_ts + 1),
        # Switch behavior of negative indices from "from-the-end" to
        # "into the lookback buffer":
        neg_index_as_lookback=True,
    )
    print(rewards)

# The expected output should be:
# [-2.0, -1.0, 0.0]  # global ts=0 (plus looking back 2 ts)
# [-1.0, 0.0, 1.0]   # global ts=1 (plus looking back 2 ts)
# [0.0, 1.0, 2.0]    # global ts=2 (plus looking back 2 ts)