Episode#
RLlib 以 Episode 的形式存储和传输所有轨迹数据,特别是对于单智能体设置使用 SingleAgentEpisode,对于多智能体设置使用 MultiAgentEpisode。数据仅在神经网络前向传播之前,由所谓的 connector pipelines 将此 Episode 格式转换为张量批次(包括可能移动到 GPU)。
Episode 是在 RLlib 的不同组件之间存储和传输轨迹数据的主要载体(例如,从 EnvRunner 到 Learner,或从 ReplayBuffer 到 Learner)。RLlib 新 API 栈的一个主要设计原则是,所有轨迹数据尽可能长时间地保持这种episodic 形式。仅在神经网络通过之前,connector pipelines 会将 Episode 列表转换为张量批次。有关更多详细信息,请参阅此处关于 Connectors and Connector pipelines 的部分。#
以这种整体轨迹(相对于张量批次)的形式收集和移动数据的最大优势在于,它提供了 360° 的可见性和对 RL 环境历史的完全访问。这意味着用户可以从 episode 中提取任意信息片段,供其自定义组件进一步处理。设想一个 Transformer 模型,它不仅需要最新的观测值来计算下一个动作,而是需要最近 n 个观测值的整个序列。使用 get_observations(),用户可以在其自定义 ConnectorV2 pipeline 中轻松提取此信息,并将其添加到神经网络批次中。
Episode 相对于 Batch 的另一个优势是更有效率的内存占用。例如,DQN 等算法需要在训练批次中同时包含观测值和下一个观测值(用于计算基于 TD 误差的损失),从而复制一个本身就很大的观测值张量。在大部分时间使用 episode 对象可以减少内存需求,使其仅为单个观测轨迹,该轨迹包含从重置到终止的所有观测值。
本页将详细介绍如何使用 RLlib 的 Episode API。
SingleAgentEpisode#
本页仅描述单智能体情况。
注意
Ray 团队正在开发一个详细描述多智能体情况的页面,该页面将类似于本页,但针对 MultiAgentEpisode。
创建 SingleAgentEpisode#
RLlib 通常会负责创建 SingleAgentEpisode 实例并进行传输,例如从 EnvRunner 到 Learner。但是,以下是手动生成一个最初为空的 episode 并填充一些伪数据的示例。
from ray.rllib.env.single_agent_episode import SingleAgentEpisode
# Construct a new episode (without any data in it yet).
episode = SingleAgentEpisode()
assert len(episode) == 0
episode.add_env_reset(observation="obs_0", infos="info_0")
# Even with the initial obs/infos, the episode is still considered len=0.
assert len(episode) == 0
# Fill the episode with some fake data (5 timesteps).
for i in range(5):
episode.add_env_step(
observation=f"obs_{i+1}",
action=f"act_{i}",
reward=f"rew_{i}",
terminated=False,
truncated=False,
infos=f"info_{i+1}",
)
assert len(episode) == 5
前面构建和填充的 SingleAgentEpisode 现在应该大致看起来像这样:
(单智能体) Episode:episode 以一个单独的观测值(“重置观测值”)开始,然后在每个时间步以 (observation, action, reward) 的三元组继续。请注意,由于重置观测值,每个 episode - 在每个时间步 - 总是比其包含的动作或奖励多一个观测值。Episode 的重要附加属性是其 id_ (str) 和 terminated/truncated (bool) 标志。有关用户可用的 SingleAgentEpisode API 的详细描述,请参见下文。#
使用 SingleAgentEpisode 的 getter API#
现在有了一个可以操作的 SingleAgentEpisode,我们可以通过其不同的“getter”方法来探索和提取此 episode 中的信息。
SingleAgentEpisode getter API:Episode 的所有字段都有“getter”方法,这些字段是 observations、actions、rewards、infos 和 extra_model_outputs。为简单起见,此处仅显示 observations、actions 和 rewards 的 getter。它们的行为直观,当提供单个索引时返回单个项目,当提供索引列表或索引切片时返回项目列表(在非 NumPy 化的情况下;参见下文)。#
请注意,对于 extra_model_outputs,getter 稍微复杂一些,因为该数据存在子键(例如:action_logp)。有关更多信息,请参阅 get_extra_model_outputs()。
以下代码片段总结了不同 getter 方法的各种功能。
# We can now access information from the episode via its getter APIs.
from ray.rllib.utils.test_utils import check
# Get the very first observation ("reset observation"). Note that a single observation
# is returned here (not a list of size 1 or a batch of size 1).
check(episode.get_observations(0), "obs_0")
# ... which is the same as using the indexing operator on the Episode's
# `observations` property:
check(episode.observations[0], "obs_0")
# You can also get several observations at once by providing a list of indices:
check(episode.get_observations([1, 2]), ["obs_1", "obs_2"])
# .. or a slice of observations by providing a python slice object:
check(episode.get_observations(slice(1, 3)), ["obs_1", "obs_2"])
# Note that when passing only a single index, a single item is returned.
# Whereas when passing a list of indices or a slice, a list of items is returned.
# Similarly for getting rewards:
# Get the last reward.
check(episode.get_rewards(-1), "rew_4")
# ... which is the same as using the slice operator on the `rewards` property:
check(episode.rewards[-1], "rew_4")
# Similarly for getting actions:
# Get the first action in the episode (single item, not batched).
# This works regardless of the action space.
check(episode.get_actions(0), "act_0")
# ... which is the same as using the indexing operator on the `actions` property:
check(episode.actions[0], "act_0")
# Finally, you can slice the entire episode using the []-operator with a slice notation:
sliced_episode = episode[3:4]
check(list(sliced_episode.observations), ["obs_3", "obs_4"])
check(list(sliced_episode.actions), ["act_3"])
check(list(sliced_episode.rewards), ["rew_3"])
NumPy 化和非 NumPy 化 Episode#
SingleAgentEpisode 中的数据可以处于两种状态:非 NumPy 化和 NumPy 化。非 NumPy 化的 episode 以纯 Python 列表存储其数据项,并将新的时间步数据追加到这些列表中。在 NumPy 化的 episode 中,这些列表已被转换为可能的复杂结构,其叶子节点是 NumPy 数组。请注意,NumPy 化的 episode 并不一定意味着它已经被终止或截断(根据 RL 环境声明 episode 结束或达到最大时间步数)。
SingleAgentEpisode 对象在非 NumPy 化状态下开始,此时数据存储在 Python 列表中,使得从正在进行的 episode 中追加数据非常快速。
# Episodes start in the non-numpy'ized state (in which data is stored
# under the hood in lists).
assert episode.is_numpy is False
# Call `to_numpy()` to convert all stored data from lists of individual (possibly
# complex) items to numpy arrays. Note that RLlib normally performs this method call,
# so users don't need to call `to_numpy()` themselves.
episode.to_numpy()
assert episode.is_numpy is True
为了说明非 NumPy 化 episode 与 NumPy 化 episode 中存储的数据之间的差异,请看这里复杂的观测值示例,它展示了两个 episode 中完全相同的观测值数据(一个非 NumPy 化,另一个 NumPy 化)。
非 NumPy 化 episode 中的复杂观测值:每个单独的观测值都是一个与 gymnasium 环境观测值空间匹配的(复杂)字典。到目前为止,episode 中已存储了三个此类观测值项。#
NumPy 化 episode 中的复杂观测值:整个观测值记录是一个与 gymnasium 环境观测值空间匹配的单一复杂字典。结构的最底层是 NDArrays,它们保存了叶子节点的值。请注意,这些 NDArrays 有一个额外的批次维度(axis=0),其长度与存储的 episode 长度(此处为 3)匹配。#
Episode.cut() 和回溯缓冲区#
在从 RL 环境收集样本时,EnvRunner 有时必须停止向正在进行的(未终止的)SingleAgentEpisode 追加数据,以返回迄今为止收集到的数据。EnvRunner 然后对 cut() 调用 SingleAgentEpisode 对象,该对象返回一个新的 episode 块,以便在下一轮采样中继续收集。
# An ongoing episode (of length 5):
assert len(episode) == 5
assert episode.is_done is False
# During an `EnvRunner.sample()` rollout, when enough data has been collected into
# one or more Episodes, the `EnvRunner` calls the `cut()` method, interrupting
# the ongoing Episode and returning a new continuation chunk (with which the
# `EnvRunner` can continue collecting data during the next call to `sample()`):
continuation_episode = episode.cut()
# The length is still 5, but the length of the continuation chunk is 0.
assert len(episode) == 5
assert len(continuation_episode) == 0
# Thanks to the lookback buffer, we can still access the most recent observation
# in the continuation chunk:
check(continuation_episode.get_observations(-1), "obs_5")
请注意,存在一个“回溯”(lookback)机制,允许 connector 在 continuation chunk 中查看 cut episode 的 H 个先前时间步,其中 H 是一个可配置参数。
默认的回溯范围(H)为 1。这意味着,在 cut() 之后,您仍然可以访问最后一个动作(get_actions(-1))、最后一个奖励(get_rewards(-1))和最后两个观测值(get_observations([-2, -1]))。如果您希望能够访问更久远的数据,请在您的 AlgorithmConfig 中更改此设置。
config = AlgorithmConfig()
# Change the lookback horizon setting, in case your connector (pipelines) need
# to access data further in the past.
config.env_runners(episode_lookback_horizon=10)
回溯缓冲区和 Getter 详解#
以下代码演示了 SingleAgentEpisode getter API 的用户可以访问更久远信息(在回溯缓冲区内)的更多选项。设想您需要编写一个 connector 组件,它必须将最后 5 个奖励添加到模型计算动作的前向传播所使用的张量批次中。
# Construct a new episode (with some data in its lookback buffer).
episode = SingleAgentEpisode(
observations=["o0", "o1", "o2", "o3"],
actions=["a0", "a1", "a2"],
rewards=[0.0, 1.0, 2.0],
len_lookback_buffer=3,
)
# Since our lookback buffer is 3, all data already specified in the constructor should
# now be in the lookback buffer (and not be part of the `episode` chunk), meaning
# the length of `episode` should still be 0.
assert len(episode) == 0
# .. and trying to get the first reward will hence lead to an IndexError.
try:
episode.get_rewards(0)
except IndexError:
pass
# Get the last 3 rewards (using the lookback buffer).
check(episode.get_rewards(slice(-3, None)), [0.0, 1.0, 2.0])
# Assuming the episode actually started with `obs_0` (reset obs),
# then `obs_1` + `act_0` + reward=0.0, but your model always requires a 1D reward tensor
# of shape (5,) with the 5 most recent rewards in it.
# You could try to code for this by manually filling the missing 2 timesteps with zeros:
last_5_rewards = [0.0, 0.0] + episode.get_rewards(slice(-3, None))
# However, this will become extremely tedious, especially when moving to (possibly more
# complex) observations and actions.
# Instead, `SingleAgentEpisode` getters offer some useful options to solve this problem:
last_5_rewards = episode.get_rewards(slice(-5, None), fill=0.0)
# Note that the `fill` argument allows you to even go further back into the past, provided
# you are ok with filling timesteps that are not covered by the lookback buffer with
# a fixed value.
另一个有用的 getter 参数(除了 fill)是 neg_index_as_lookback 布尔参数。如果设置为 True,则负索引不会被解释为“从末尾开始”,而是“进入回溯缓冲区”。这允许您遍历全局时间步的范围,同时从每个全局时间步回溯一定数量的时间步。
# Construct a new episode (len=3 and lookback buffer=3).
episode = SingleAgentEpisode(
observations=[
"o-3",
"o-2",
"o-1", # <- lookback # noqa
"o0",
"o1",
"o2",
"o3", # <- actual episode data # noqa
],
actions=[
"a-3",
"a-2",
"a-1", # <- lookback # noqa
"a0",
"a1",
"a2", # <- actual episode data # noqa
],
rewards=[
-3.0,
-2.0,
-1.0, # <- lookback # noqa
0.0,
1.0,
2.0, # <- actual episode data # noqa
],
len_lookback_buffer=3,
)
assert len(episode) == 3
# In case you want to loop through global timesteps 0 to 2 (timesteps -3, -2, and -1
# being the lookback buffer) and at each such global timestep look 2 timesteps back,
# you can do so easily using the `neg_index_as_lookback` arg like so:
for global_ts in [0, 1, 2]:
rewards = episode.get_rewards(
slice(global_ts - 2, global_ts + 1),
# Switch behavior of negative indices from "from-the-end" to
# "into the lookback buffer":
neg_index_as_lookback=True,
)
print(rewards)
# The expected output should be:
# [-2.0, -1.0, 0.0] # global ts=0 (plus looking back 2 ts)
# [-1.0, 0.0, 1.0] # global ts=1 (plus looking back 2 ts)
# [0.0, 1.0, 2.0] # global ts=2 (plus looking back 2 ts)