部署混合推理 LLM#

混合推理模型通过允许您根据需要启用或禁用推理来提供灵活性。您可以为复杂查询使用结构化的、循序渐进的思考方式，而对于更简单的查询则跳过它，从而根据任务平衡准确性和效率。

本教程使用 Ray Serve LLM 部署一个混合推理 LLM。

与纯推理模型的区别#

混合推理模型是具有推理能力的模型，允许您打开和关闭思考过程。您可以在需要时启用结构化的、循序渐进的推理，但对于更简单的查询则跳过它以降低延迟。纯推理模型始终应用它们的推理行为，而混合模型则为您提供了对何时使用推理的细粒度控制。

模式	核心行为	用例示例	限制
开启思考	显式的多步思考过程	数学、编程、逻辑谜题、多跳问答、CoT 提示	响应时间更慢，使用更多 token。
关闭思考	直接生成答案	随意查询、简短指令、单步答案	可能难以进行复杂的推理或解释。

注意：推理通常受益于长上下文窗口（32K 至 +100 万 token）、高 token 吞吐量、低温度解码（贪婪采样）以及强大的指令调优或便笺式推理。

要查看部署像 QwQ-32 B 这样的纯推理模型的示例，请参阅部署推理 LLM。

启用或禁用思考#

一些混合推理模型允许您打开或关闭它们的“思考”模式。本节解释了何时使用思考模式以及何时跳过它，并演示了如何在实践中控制该设置。

何时启用或禁用思考模式#

为以下情况启用思考模式：

需要推理的复杂、多步任务，例如数学、物理或逻辑问题。
模糊的查询或信息不完整的情况。
规划、工作流编排，或当模型需要充当“代理”来协调其他工具或模型时。
分析错综复杂的数据、图像或图表。
深入的代码审查或评估其他 AI 系统的输出（LLM 作为裁判方法）。

为以下情况禁用思考模式：

简单、明确或常规任务。
低延迟和快速响应是首要任务。
更大自动化工作流中重复、直接的步骤。

如何启用或禁用思考模式#

切换思考模式因模型和框架而异。请查阅模型的文档，了解其如何构建和控制思考。

例如，要控制 Qwen-3 的推理，您可以

在提示中添加 "/think" 或 "/no_think"。
在请求中设置 enable_thinking： extra_body={"chat_template_kwargs": {"enable_thinking": ...}}。

有关实际示例，请参阅发送已启用思考的请求或发送已禁用思考的请求。

解析推理输出#

在思考模式下，混合模型通常使用 <think>...</think> 等标签将推理与最终答案分开。如果没有适当的解析器，此推理可能会出现在 content 字段中，而不是专用的 reasoning_content 字段中。

为确保 Ray Serve LLM 正确解析推理输出，请在您的 Ray Serve LLM 部署中配置一个 reasoning_parser。这会告知 vLLM 如何将模型的思考过程与其余输出隔离开。
注意：例如，Qwen-3 使用 qwen3 解析器。请参阅 vLLM 文档或您的模型文档，查找支持的解析器，或者如果需要，构建自己的解析器。

applications:
- ...
  args:
    llm_configs:
      - model_loading_config:
          model_id: my-qwen-3-32b
          model_source: Qwen/Qwen3-32B
        ...
        engine_kwargs:
          ...
          reasoning_parser: qwen3 # <-- for Qwen-3 models

有关完整示例，请参阅配置 Ray Serve LLM。

响应示例
使用推理解析器时，响应通常结构如下：

ChatCompletionMessage(
    content="The temperature is...",
    ...,
    reasoning_content="Okay, the user is asking for the temperature today and tomorrow..."
)

您可以这样提取内容和推理：

response = client.chat.completions.create(
  ...
)

print(f"Content: {response.choices[0].message.content}")
print(f"Reasoning: {response.choices[0].message.reasoning_content}")

配置 Ray Serve LLM#

在配置文件中设置您的 Hugging Face token 以访问受限模型。

Ray Serve LLM 提供了多种 Python API 来定义您的应用程序。使用 build_openai_app 从您的 LLMConfig 对象构建完整的应用程序。

设置 tensor_parallel_size 以将模型权重分布到节点上的 8 个 GPU 中。

# serve_qwen_3_32b.py
from ray.serve.llm import LLMConfig, build_openai_app
import os

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="my-qwen-3-32b",
        model_source="Qwen/Qwen3-32B",
    ),
    accelerator_type="L40S", # Or "A100-40G"
    deployment_config=dict(
        autoscaling_config=dict(
            # Increase number of replicas for higher throughput/concurrency.
            min_replicas=1,
            max_replicas=2,
        )
    ),
    ### Uncomment if your model is gated and needs your Hugging Face token to access it.
    # runtime_env=dict(env_vars={"HF_TOKEN": os.environ.get("HF_TOKEN")}),
    engine_kwargs=dict(
        # 4 GPUs is enough but you can increase tensor_parallel_size to fit larger models.
        tensor_parallel_size=4, max_model_len=32768, reasoning_parser="qwen3"
    ),
)
app = build_openai_app({"llm_configs": [llm_config]})

注意：在迁移到生产环境之前，请将您的设置迁移到 Serve 配置文件，以便您的部署可进行版本控制、可重现，并更易于维护 CI/CD 管道。有关示例，请参阅服务 LLM - 快速入门示例：生产指南。

本地部署#

先决条件

GPU 计算访问权限。
（可选）如果您使用受限模型，则需要 **Hugging Face token**，例如。将其存储在 export HF_TOKEN=<YOUR-TOKEN-HERE> 中。

注意：根据情况，您通常可以在模型的 Hugging Face 页面上请求访问权限。例如，Meta 的 Llama 模型批准可能需要几小时到几周的时间。

依赖项

pip install "ray[serve,llm]"

启动#

遵循配置 Ray Serve LLM 中的说明，在 Python 模块 serve_qwen_3_32b.py 中定义您的应用。

在终端中，运行

serve run serve_qwen_3_32b:app --non-blocking

部署通常需要几分钟时间，因为集群会被配置、vLLM 服务器会启动，并且模型会被下载。

您的本地端点可用地址为 https://:8000，您可以使用 OpenAI 客户端的占位符身份验证令牌，例如 "FAKE_KEY"。

使用您的配置文件中定义的 model_id（此处为 my-qwen-3-32b）来查询您的模型。以下是一些有关如何向启用了思考或禁用了思考的 Qwen-3 部署发送请求的示例。

发送已禁用思考的请求#

您可以通过在提示中添加 /no_think 标签或将 enable_thinking: False 转发给 vLLM 推理引擎来禁用 Qwen-3 的思考。

使用 /no_think 的 curl 示例

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer FAKE_KEY" \
  -d '{ "model": "my-qwen-3-32b", "messages": [{"role": "user", "content": "What is greater between 7.8 and 7.11 ? /no_think"}] }'

使用 enable_thinking: False 的 Python 示例

#client_thinking_disabled.py
from urllib.parse import urljoin
from openai import OpenAI

API_KEY = "FAKE_KEY"
BASE_URL = "https://:8000"

client = OpenAI(base_url=urljoin(BASE_URL, "v1"), api_key=API_KEY)

# Example: Complex query with thinking process
response = client.chat.completions.create(
    model="my-qwen-3-32b",
    messages=[
        {"role": "user", "content": "What's the capital of France ?"}
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)

print(f"Reasoning: \n{response.choices[0].message.reasoning_content}\n\n")
print(f"Answer: \n {response.choices[0].message.content}")

请注意，这里的 reasoning_content 为空。注意：根据您的模型文档，空可能表示 None、空字符串，甚至是空的标签 "<think></think>"。

发送已启用思考的请求#

您可以通过在提示中添加 /think 标签或将 enable_thinking: True 转发给 vLLM 推理引擎来启用 Qwen-3 的思考。

使用 /think 的 curl 示例

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer FAKE_KEY" \
  -d '{ "model": "my-qwen-3-32b", "messages": [{"role": "user", "content": "What is greater between 7.8 and 7.11 ? /think"}] }'

使用 enable_thinking: True 的 Python 示例

#client_thinking_enabled.py
from urllib.parse import urljoin
from openai import OpenAI

API_KEY = "FAKE_KEY"
BASE_URL = "https://:8000"

client = OpenAI(base_url=urljoin(BASE_URL, "v1"), api_key=API_KEY)

# Example: Complex query with thinking process
response = client.chat.completions.create(
    model="my-qwen-3-32b",
    messages=[
        {"role": "user", "content": "What's the capital of France ?"}
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}}
)

print(f"Reasoning: \n{response.choices[0].message.reasoning_content}\n\n")
print(f"Answer: \n {response.choices[0].message.content}")

如果您配置了有效的推理解析器，推理输出应显示在响应消息的 reasoning_content 字段中。否则，它可能包含在主 content 字段中，通常包装在 <think>...</think> 标签内。有关更多信息，请参阅解析推理输出。

关停#

关停您的 LLM 服务

serve shutdown -y

使用 Anyscale 服务部署到生产环境#

对于生产环境，建议使用 Anyscale 服务将您的 Ray Serve 应用部署到专用集群，而无需任何代码更改。Anyscale 提供可扩展性、容错性和负载均衡，确保对节点故障、高流量和滚动更新的弹性。有关使用中型模型（如本教程中的 Qwen-32b）进行部署的示例，请参阅使用 Anyscale 服务部署到生产环境。

流式传输推理内容#

在思考模式下，混合推理模型可能需要更长时间才能开始生成主要内容。您可以像流式传输主要内容一样流式传输中间推理输出。

#client_streaming.py
from urllib.parse import urljoin
from openai import OpenAI

API_KEY = "FAKE_KEY"
BASE_URL = "https://:8000"

client = OpenAI(base_url=urljoin(BASE_URL, "v1"), api_key=API_KEY)

# Example: Complex query with thinking process
response = client.chat.completions.create(
    model="my-qwen-3-32b",
    messages=[
        {"role": "user", "content": "I need to plan a trip to Paris from Seattle. Can you help me research flight costs, create an itinerary for 3 days, and suggest restaurants based on my dietary restrictions (vegetarian)?"}
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
    stream=True
)

# Stream 
for chunk in response:
    # Stream reasoning content
    if hasattr(chunk.choices[0].delta, "reasoning_content"):
        data_reasoning = chunk.choices[0].delta.reasoning_content
        if data_reasoning:
            print(data_reasoning, end="", flush=True)
    # Later, stream the final answer
    if hasattr(chunk.choices[0].delta, "content"):
        data_content = chunk.choices[0].delta.content
        if data_content:
            print(data_content, end="", flush=True)

总结#

在本教程中，您使用 Ray Serve LLM 部署了一个混合推理 LLM，涵盖了从开发到生产的整个过程。您学习了如何使用正确的推理解析器配置 Ray Serve LLM，在您的 Ray 集群上部署您的服务，发送请求，以及解析响应中的推理输出。