Episode#

RLlib 以 Episode 的形式存储和传输所有轨迹数据,特别是对于单智能体设置使用 SingleAgentEpisode,对于多智能体设置使用 MultiAgentEpisode。数据仅在神经网络前向传播之前,由所谓的 connector pipelines 将此 Episode 格式转换为张量批次(包括可能移动到 GPU)。

../_images/usage_of_episodes.svg

Episode 是在 RLlib 的不同组件之间存储和传输轨迹数据的主要载体(例如,从 EnvRunnerLearner,或从 ReplayBufferLearner)。RLlib 新 API 栈的一个主要设计原则是,所有轨迹数据尽可能长时间地保持这种episodic 形式。仅在神经网络通过之前,connector pipelines 会将 Episode 列表转换为张量批次。有关更多详细信息,请参阅此处关于 Connectors and Connector pipelines 的部分。#

以这种整体轨迹(相对于张量批次)的形式收集和移动数据的最大优势在于,它提供了 360° 的可见性和对 RL 环境历史的完全访问。这意味着用户可以从 episode 中提取任意信息片段,供其自定义组件进一步处理。设想一个 Transformer 模型,它不仅需要最新的观测值来计算下一个动作,而是需要最近 n 个观测值的整个序列。使用 get_observations(),用户可以在其自定义 ConnectorV2 pipeline 中轻松提取此信息,并将其添加到神经网络批次中。

Episode 相对于 Batch 的另一个优势是更有效率的内存占用。例如,DQN 等算法需要在训练批次中同时包含观测值和下一个观测值(用于计算基于 TD 误差的损失),从而复制一个本身就很大的观测值张量。在大部分时间使用 episode 对象可以减少内存需求,使其仅为单个观测轨迹,该轨迹包含从重置到终止的所有观测值。

本页将详细介绍如何使用 RLlib 的 Episode API。

SingleAgentEpisode#

本页仅描述单智能体情况。

注意

Ray 团队正在开发一个详细描述多智能体情况的页面,该页面将类似于本页,但针对 MultiAgentEpisode

创建 SingleAgentEpisode#

RLlib 通常会负责创建 SingleAgentEpisode 实例并进行传输,例如从 EnvRunnerLearner。但是,以下是手动生成一个最初为空的 episode 并填充一些伪数据的示例。

from ray.rllib.env.single_agent_episode import SingleAgentEpisode

# Construct a new episode (without any data in it yet).
episode = SingleAgentEpisode()
assert len(episode) == 0

episode.add_env_reset(observation="obs_0", infos="info_0")
# Even with the initial obs/infos, the episode is still considered len=0.
assert len(episode) == 0

# Fill the episode with some fake data (5 timesteps).
for i in range(5):
    episode.add_env_step(
        observation=f"obs_{i+1}",
        action=f"act_{i}",
        reward=f"rew_{i}",
        terminated=False,
        truncated=False,
        infos=f"info_{i+1}",
    )
assert len(episode) == 5

前面构建和填充的 SingleAgentEpisode 现在应该大致看起来像这样:

../_images/sa_episode.svg

(单智能体) Episode:episode 以一个单独的观测值(“重置观测值”)开始,然后在每个时间步以 (observation, action, reward) 的三元组继续。请注意,由于重置观测值,每个 episode - 在每个时间步 - 总是比其包含的动作或奖励多一个观测值。Episode 的重要附加属性是其 id_ (str) 和 terminated/truncated (bool) 标志。有关用户可用的 SingleAgentEpisode API 的详细描述,请参见下文。#

使用 SingleAgentEpisode 的 getter API#

现在有了一个可以操作的 SingleAgentEpisode,我们可以通过其不同的“getter”方法来探索和提取此 episode 中的信息。

../_images/sa_episode_getters.svg

SingleAgentEpisode getter API:Episode 的所有字段都有“getter”方法,这些字段是 observationsactionsrewardsinfosextra_model_outputs。为简单起见,此处仅显示 observations、actions 和 rewards 的 getter。它们的行为直观,当提供单个索引时返回单个项目,当提供索引列表或索引切片时返回项目列表(在非 NumPy 化的情况下;参见下文)。#

请注意,对于 extra_model_outputs,getter 稍微复杂一些,因为该数据存在子键(例如:action_logp)。有关更多信息,请参阅 get_extra_model_outputs()

以下代码片段总结了不同 getter 方法的各种功能。

# We can now access information from the episode via its getter APIs.

from ray.rllib.utils.test_utils import check

# Get the very first observation ("reset observation"). Note that a single observation
# is returned here (not a list of size 1 or a batch of size 1).
check(episode.get_observations(0), "obs_0")
# ... which is the same as using the indexing operator on the Episode's
# `observations` property:
check(episode.observations[0], "obs_0")

# You can also get several observations at once by providing a list of indices:
check(episode.get_observations([1, 2]), ["obs_1", "obs_2"])
# .. or a slice of observations by providing a python slice object:
check(episode.get_observations(slice(1, 3)), ["obs_1", "obs_2"])

# Note that when passing only a single index, a single item is returned.
# Whereas when passing a list of indices or a slice, a list of items is returned.

# Similarly for getting rewards:
# Get the last reward.
check(episode.get_rewards(-1), "rew_4")
# ... which is the same as using the slice operator on the `rewards` property:
check(episode.rewards[-1], "rew_4")

# Similarly for getting actions:
# Get the first action in the episode (single item, not batched).
# This works regardless of the action space.
check(episode.get_actions(0), "act_0")
# ... which is the same as using the indexing operator on the `actions` property:
check(episode.actions[0], "act_0")

# Finally, you can slice the entire episode using the []-operator with a slice notation:
sliced_episode = episode[3:4]
check(list(sliced_episode.observations), ["obs_3", "obs_4"])
check(list(sliced_episode.actions), ["act_3"])
check(list(sliced_episode.rewards), ["rew_3"])

NumPy 化和非 NumPy 化 Episode#

SingleAgentEpisode 中的数据可以处于两种状态:非 NumPy 化和 NumPy 化。非 NumPy 化的 episode 以纯 Python 列表存储其数据项,并将新的时间步数据追加到这些列表中。在 NumPy 化的 episode 中,这些列表已被转换为可能的复杂结构,其叶子节点是 NumPy 数组。请注意,NumPy 化的 episode 并不一定意味着它已经被终止或截断(根据 RL 环境声明 episode 结束或达到最大时间步数)。

../_images/sa_episode_non_finalized_vs_finalized.svg

SingleAgentEpisode 对象在非 NumPy 化状态下开始,此时数据存储在 Python 列表中,使得从正在进行的 episode 中追加数据非常快速。


# Episodes start in the non-numpy'ized state (in which data is stored
# under the hood in lists).
assert episode.is_numpy is False

# Call `to_numpy()` to convert all stored data from lists of individual (possibly
# complex) items to numpy arrays. Note that RLlib normally performs this method call,
# so users don't need to call `to_numpy()` themselves.
episode.to_numpy()
assert episode.is_numpy is True

为了说明非 NumPy 化 episode 与 NumPy 化 episode 中存储的数据之间的差异,请看这里复杂的观测值示例,它展示了两个 episode 中完全相同的观测值数据(一个非 NumPy 化,另一个 NumPy 化)。

../_images/sa_episode_non_finalized.svg

非 NumPy 化 episode 中的复杂观测值:每个单独的观测值都是一个与 gymnasium 环境观测值空间匹配的(复杂)字典。到目前为止,episode 中已存储了三个此类观测值项。#

../_images/sa_episode_finalized.svg

NumPy 化 episode 中的复杂观测值:整个观测值记录是一个与 gymnasium 环境观测值空间匹配的单一复杂字典。结构的最底层是 NDArrays,它们保存了叶子节点的值。请注意,这些 NDArrays 有一个额外的批次维度(axis=0),其长度与存储的 episode 长度(此处为 3)匹配。#

Episode.cut() 和回溯缓冲区#

在从 RL 环境收集样本时,EnvRunner 有时必须停止向正在进行的(未终止的)SingleAgentEpisode 追加数据,以返回迄今为止收集到的数据。EnvRunner 然后对 cut() 调用 SingleAgentEpisode 对象,该对象返回一个新的 episode 块,以便在下一轮采样中继续收集。


# An ongoing episode (of length 5):
assert len(episode) == 5
assert episode.is_done is False

# During an `EnvRunner.sample()` rollout, when enough data has been collected into
# one or more Episodes, the `EnvRunner` calls the `cut()` method, interrupting
# the ongoing Episode and returning a new continuation chunk (with which the
# `EnvRunner` can continue collecting data during the next call to `sample()`):
continuation_episode = episode.cut()

# The length is still 5, but the length of the continuation chunk is 0.
assert len(episode) == 5
assert len(continuation_episode) == 0

# Thanks to the lookback buffer, we can still access the most recent observation
# in the continuation chunk:
check(continuation_episode.get_observations(-1), "obs_5")

请注意,存在一个“回溯”(lookback)机制,允许 connector 在 continuation chunk 中查看 cut episode 的 H 个先前时间步,其中 H 是一个可配置参数。

../_images/sa_episode_cut_and_lookback.svg

默认的回溯范围(H)为 1。这意味着,在 cut() 之后,您仍然可以访问最后一个动作(get_actions(-1))、最后一个奖励(get_rewards(-1))和最后两个观测值(get_observations([-2, -1]))。如果您希望能够访问更久远的数据,请在您的 AlgorithmConfig 中更改此设置。

config = AlgorithmConfig()
# Change the lookback horizon setting, in case your connector (pipelines) need
# to access data further in the past.
config.env_runners(episode_lookback_horizon=10)

回溯缓冲区和 Getter 详解#

以下代码演示了 SingleAgentEpisode getter API 的用户可以访问更久远信息(在回溯缓冲区内)的更多选项。设想您需要编写一个 connector 组件,它必须将最后 5 个奖励添加到模型计算动作的前向传播所使用的张量批次中。


# Construct a new episode (with some data in its lookback buffer).
episode = SingleAgentEpisode(
    observations=["o0", "o1", "o2", "o3"],
    actions=["a0", "a1", "a2"],
    rewards=[0.0, 1.0, 2.0],
    len_lookback_buffer=3,
)
# Since our lookback buffer is 3, all data already specified in the constructor should
# now be in the lookback buffer (and not be part of the `episode` chunk), meaning
# the length of `episode` should still be 0.
assert len(episode) == 0

# .. and trying to get the first reward will hence lead to an IndexError.
try:
    episode.get_rewards(0)
except IndexError:
    pass

# Get the last 3 rewards (using the lookback buffer).
check(episode.get_rewards(slice(-3, None)), [0.0, 1.0, 2.0])

# Assuming the episode actually started with `obs_0` (reset obs),
# then `obs_1` + `act_0` + reward=0.0, but your model always requires a 1D reward tensor
# of shape (5,) with the 5 most recent rewards in it.
# You could try to code for this by manually filling the missing 2 timesteps with zeros:
last_5_rewards = [0.0, 0.0] + episode.get_rewards(slice(-3, None))
# However, this will become extremely tedious, especially when moving to (possibly more
# complex) observations and actions.

# Instead, `SingleAgentEpisode` getters offer some useful options to solve this problem:
last_5_rewards = episode.get_rewards(slice(-5, None), fill=0.0)
# Note that the `fill` argument allows you to even go further back into the past, provided
# you are ok with filling timesteps that are not covered by the lookback buffer with
# a fixed value.

另一个有用的 getter 参数(除了 fill)是 neg_index_as_lookback 布尔参数。如果设置为 True,则负索引不会被解释为“从末尾开始”,而是“进入回溯缓冲区”。这允许您遍历全局时间步的范围,同时从每个全局时间步回溯一定数量的时间步。


# Construct a new episode (len=3 and lookback buffer=3).
episode = SingleAgentEpisode(
    observations=[
        "o-3",
        "o-2",
        "o-1",  # <- lookback  # noqa
        "o0",
        "o1",
        "o2",
        "o3",  # <- actual episode data  # noqa
    ],
    actions=[
        "a-3",
        "a-2",
        "a-1",  # <- lookback  # noqa
        "a0",
        "a1",
        "a2",  # <- actual episode data  # noqa
    ],
    rewards=[
        -3.0,
        -2.0,
        -1.0,  # <- lookback  # noqa
        0.0,
        1.0,
        2.0,  # <- actual episode data  # noqa
    ],
    len_lookback_buffer=3,
)
assert len(episode) == 3

# In case you want to loop through global timesteps 0 to 2 (timesteps -3, -2, and -1
# being the lookback buffer) and at each such global timestep look 2 timesteps back,
# you can do so easily using the `neg_index_as_lookback` arg like so:
for global_ts in [0, 1, 2]:
    rewards = episode.get_rewards(
        slice(global_ts - 2, global_ts + 1),
        # Switch behavior of negative indices from "from-the-end" to
        # "into the lookback buffer":
        neg_index_as_lookback=True,
    )
    print(rewards)

# The expected output should be:
# [-2.0, -1.0, 0.0]  # global ts=0 (plus looking back 2 ts)
# [-1.0, 0.0, 1.0]   # global ts=1 (plus looking back 2 ts)
# [0.0, 1.0, 2.0]    # global ts=2 (plus looking back 2 ts)