Data parallel attention#

使用数据并行注意力来部署 LLM，以提高吞吐量和资源利用率，尤其适用于稀疏专家混合（MoE）模型。

数据并行注意力会创建多个协调的推理引擎副本，这些副本并行处理请求。当与专家并行相结合用于稀疏 MoE 模型时，此模式最为有效，其中注意力（QKV）层在副本之间复制，而 MoE 专家被分片。这种分离提供了：

提高吞吐量：通过将请求分发到多个副本，处理更多并发请求。
更好的资源利用率：尤其有利于稀疏 MoE 模型，因为并非所有专家都能响应每个请求。
KV 缓存可扩展性：跨副本增加 KV 缓存容量，以处理更大的批次大小。
专家饱和度：在解码过程中实现更高的有效批次大小，以更好地饱和 MoE 层。

何时使用数据并行注意力#

考虑使用此模式，当

具有 MLA 的稀疏 MoE 模型：您正在为具有多头潜在注意力（MLA）的模型提供服务，其中 KV 缓存不能沿着头部维度进行分片。MLA 减少了 KV 缓存的内存需求，使得数据并行复制更有效。
高吞吐量要求：您需要处理大量并发请求，并希望最大化吞吐量。
KV 缓存受限：增加 KV 缓存容量可以提高吞吐量，而数据并行注意力可以有效地跨副本增加 KV 缓存容量。

何时不使用数据并行注意力

低到中等吞吐量：如果您无法饱和 MoE 层，数据并行注意力会增加不必要的复杂性。
非 MoE 模型：主要好处是提高有效批次大小以饱和专家，这不适用于密集模型。
足够的张量并行度：对于具有 GQA（分组查询注意力）的模型，首先使用张量并行度（TP）来分片 KV 缓存，直到 TP_size <= num_kv_heads。超出此范围后，TP 需要 KV 缓存复制 — 此时，数据并行注意力是更好的选择。

基本部署#

以下示例展示了如何使用数据并行注意力进行部署。

from ray import serve
from ray.serve.llm import LLMConfig, build_dp_openai_app

# Configure the model with data parallel settings
config = LLMConfig(
    model_loading_config={
        "model_id": "Qwen/Qwen2.5-0.5B-Instruct"
    },
    engine_kwargs={
        "data_parallel_size": 2,  # Number of DP replicas
        "tensor_parallel_size": 1,  # TP size per replica
        # Reduced for CI compatibility
        "max_model_len": 1024,
        "max_num_seqs": 32,
    },
    experimental_configs={
        # This is a temporary required config. We will remove this in future versions.
        "dp_size_per_node": 2,  # DP replicas per node
    },
)

app = build_dp_openai_app({
    "llm_config": config
})

serve.run(app, blocking=True)

生产 YAML 配置#

对于生产部署，请使用 YAML 配置文件。

applications:
- name: dp_llm_app
  route_prefix: /
  import_path: ray.serve.llm:build_dp_openai_app
  args:
    llm_config:
      model_loading_config:
        model_id: Qwen/Qwen2.5-0.5B-Instruct
      engine_kwargs:
        data_parallel_size: 4
        tensor_parallel_size: 2
      experimental_configs:
        dp_size_per_node: 4

使用以下命令部署

serve deploy dp_config.yaml

注意

在 deployment_config 中的 num_replicas 必须等于 engine_kwargs 中的 data_parallel_size。数据并行注意力部署不支持自动伸缩，因为所有副本都必须存在并协调一致。

配置参数#

必需参数#

data_parallel_size：要创建的数据并行副本数量。必须为正整数。
dp_size_per_node：每个节点的数据并行副本数量。必须在 experimental_configs 中设置。此参数控制副本如何在节点之间分布。这是一个临时必需的配置，我们将在未来版本中移除。

部署配置#

num_replicas：必须设置为 data_parallel_size。数据并行注意力需要固定数量的副本。
placement_group_strategy：自动设置为 "STRICT_PACK"，以确保副本被正确放置。

理解副本协调#

在数据并行注意力中，所有副本作为一个整体协同工作。

Rank 分配：每个副本从协调器接收一个唯一的 Rank（从 0 到 dp_size-1）。
请求分发：Ray Serve 的请求路由器使用负载均衡将请求分发到各个副本。
集合操作：副本协调执行模型所需的集合操作（例如，all-reduce）。
同步：部署要正常运行，所有副本都必须存在且健康。

协调开销极小。

启动：每个副本进行一次 RPC 调用以获取其 Rank。
运行时：请求处理期间没有协调开销。

有关更多详细信息，请参阅 Data parallel attention。

测试您的部署#

使用聊天补全请求进行测试。

curl -X POST "https://:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer fake-key" \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "messages": [
      {"role": "user", "content": "Explain data parallel attention"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

您也可以通过编程方式进行测试。

from openai import OpenAI

client = OpenAI(
    base_url="https://:8000/v1",
    api_key="fake-key"
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    messages=[
        {"role": "user", "content": "Explain data parallel attention"}
    ],
    max_tokens=100
)

print(response.choices[0].message.content)

与其他模式结合使用#

数据并行 + Prefill-decode 分离#

您可以将数据并行注意力与预填充-解码分离相结合，以独立扩展这两个阶段，同时在每个阶段内使用 DP。当您需要预填充和解码阶段的高吞吐量时，此模式非常有用。

以下示例显示了一个完整、功能齐全的部署。

from ray import serve
from ray.serve.llm import LLMConfig, build_dp_deployment
from ray.serve.llm.deployment import PDProxyServer
from ray.serve.llm.ingress import OpenAiIngress, make_fastapi_ingress

# Configure prefill with data parallel attention
prefill_config = LLMConfig(
    model_loading_config={
        "model_id": "Qwen/Qwen2.5-0.5B-Instruct"
    },
    engine_kwargs={
        "data_parallel_size": 2,  # 2 DP replicas for prefill
        "tensor_parallel_size": 1,
        "kv_transfer_config": {
            "kv_connector": "NixlConnector",
            "kv_role": "kv_both",
        },
        # Reduced for CI compatibility
        "max_model_len": 1024,
        "max_num_seqs": 32,
    },
    experimental_configs={
        "dp_size_per_node": 2,
    },
)

# Configure decode with data parallel attention
decode_config = LLMConfig(
    model_loading_config={
        "model_id": "Qwen/Qwen2.5-0.5B-Instruct"
    },
    engine_kwargs={
        "data_parallel_size": 2,  # 2 DP replicas for decode (adjusted for 4 GPU limit)
        "tensor_parallel_size": 1,
        "kv_transfer_config": {
            "kv_connector": "NixlConnector",
            "kv_role": "kv_both",
        },
        # Reduced for CI compatibility
        "max_model_len": 1024,
        "max_num_seqs": 32,
    },
    experimental_configs={
        "dp_size_per_node": 2,
    },
)

# Build prefill and decode deployments with DP
prefill_deployment = build_dp_deployment(prefill_config, name_prefix="Prefill:")
decode_deployment = build_dp_deployment(decode_config, name_prefix="Decode:")

# Create PDProxyServer to coordinate between prefill and decode
proxy_options = PDProxyServer.get_deployment_options(prefill_config, decode_config)
proxy_deployment = serve.deployment(PDProxyServer).options(**proxy_options).bind(
    prefill_server=prefill_deployment,
    decode_server=decode_deployment,
)

# Create OpenAI-compatible ingress
ingress_options = OpenAiIngress.get_deployment_options([prefill_config, decode_config])
ingress_cls = make_fastapi_ingress(OpenAiIngress)
ingress_deployment = serve.deployment(ingress_cls).options(**ingress_options).bind(
    llm_deployments=[proxy_deployment]
)

# Deploy the application
serve.run(ingress_deployment, blocking=True)

此配置创建了：

预填充阶段：2 个数据并行副本用于处理输入提示。
解码阶段：2 个数据并行副本用于生成 token。
PDProxyServer：协调预填充和解码阶段之间的请求。
OpenAI ingress：提供 OpenAI 兼容的 API 端点。

这允许您：

根据工作负载特性独立优化预填充和解码阶段。
在每个阶段内使用数据并行注意力来提高吞吐量。

注意

此示例总共使用了 4 个 GPU（2 个用于预填充，2 个用于解码）。请根据您可用的 GPU 资源调整 data_parallel_size 值。

注意

为了使此示例正常工作，您需要安装 NIXL。请参阅 Prefill/decode disaggregation 指南了解先决条件和安装说明。

另请参阅#

Data parallel attention - 数据并行注意力的架构详细信息
Prefill/decode disaggregation - 预填充-解码分离指南
Serving patterns - 服务模式概述
Quickstart examples - LLM 基本部署示例