部署 gpt-oss#

gpt-oss 是一系列开源模型，旨在实现通用的语言理解和生成。200 亿参数变体（gpt-oss-20b）在较低延迟的情况下提供了强大的推理能力。这使其非常适合本地或专业用例。1200 亿参数的更大变体（gpt-oss-120b）专为生产规模、高推理工作负载而设计。

有关更多信息，请参阅 gpt-oss 系列。

配置 Ray Serve LLM#

Ray Serve LLM 提供了多种 Python API 来定义您的应用程序。使用 build_openai_app 从您的 LLMConfig 对象构建完整的应用程序。

以下是 gpt-oss-20b 和 gpt-oss-120b 的示例配置，具体取决于您的硬件和用例。

gpt-oss-20b#

部署 gpt-oss-20b 这样的小型模型，单 GPU 即可满足需求

# serve_gpt_oss.py
from ray.serve.llm import LLMConfig, build_openai_app

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="my-gpt-oss",
        model_source="openai/gpt-oss-20b",
    ),
    accelerator_type="L4",
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1,
            max_replicas=2,
        )
    ),
    engine_kwargs=dict(
        max_model_len=32768
    ),
)

app = build_openai_app({"llm_configs": [llm_config]})

gpt-oss-120b#

部署 gpt-oss-120b 这样中等大小的模型，单节点多 GPU 即可满足需求。设置 tensor_parallel_size 以将模型的权重分布到实例中的 GPU 上。

# serve_gpt_oss.py
from ray.serve.llm import LLMConfig, build_openai_app

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="my-gpt-oss",
        model_source="openai/gpt-oss-120b",
    ),
    accelerator_type="L40S", # Or "A100-40G"
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1,
            max_replicas=2,
        )
    ),
    engine_kwargs=dict(
        max_model_len=32768,
        tensor_parallel_size=2,
    ),
)

app = build_openai_app({"llm_configs": [llm_config]})

注意： 在迁移到生产环境之前，建议使用 Serve 配置文件，以便使您的部署版本可控、可复现，并更容易维护 CI/CD 管道。有关示例，请参阅服务 LLM - 快速入门示例：生产指南。

本地部署#

先决条件#

GPU 计算访问权限。

依赖项#

gpt-oss 集成已从 ray>=2.49.0 和 vllm==0.10.1 开始提供。

pip install "ray[serve,llm]>=2.49.0"
pip install "vllm==0.10.1"

启动服务#

请根据您选择的模型大小，遵循配置 Ray Serve LLM 中的说明，并在 Python 模块 serve_gpt_oss.py 中定义您的应用。

在终端中，运行

serve run serve_gpt_oss:app --non-blocking

部署通常需要几分钟，因为 Ray 会配置集群、启动 vLLM 服务器，然后 Ray Serve 会下载模型。

发送请求#

您的端点可以在本地访问，地址为 https://:8000。您可以使用占位符身份验证令牌来使用 OpenAI 客户端，例如 "FAKE_KEY"。

示例 curl#

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer FAKE_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "model": "my-gpt-oss", "messages": [{"role": "user", "content": "How many Rs in strawberry ?"}] }'

示例 Python#

#client.py
from urllib.parse import urljoin
from openai import OpenAI

api_key = "FAKE_KEY"
base_url = "https://:8000"

client = OpenAI(base_url=urljoin(base_url, "v1"), api_key=api_key)

# Example query
response = client.chat.completions.create(
    model="my-gpt-oss",
    messages=[
        {"role": "user", "content": "How many r's in strawberry"}
    ],
    stream=True
)

# Stream
for chunk in response:
    # Stream reasoning content
    if hasattr(chunk.choices[0].delta, "reasoning_content"):
        data_reasoning = chunk.choices[0].delta.reasoning_content
        if data_reasoning:
            print(data_reasoning, end="", flush=True)
    # Later, stream the final answer
    if hasattr(chunk.choices[0].delta, "content"):
        data_content = chunk.choices[0].delta.content
        if data_content:
            print(data_content, end="", flush=True)

关闭服务#

关闭您的 LLM 服务

serve shutdown -y

使用 Anyscale 服务部署到生产环境#

对于生产部署，请使用 Anyscale Service 将 Ray Serve 应用程序部署到专用集群，而无需修改代码。Anyscale 可确保可伸缩性、容错性和负载均衡，使服务能够抵抗节点故障、高流量和滚动更新。

启动服务#

Anyscale 提供开箱即用的镜像（anyscale/ray-llm），其中预装了 Ray Serve LLM、vLLM 以及所有必需的 GPU 和运行时依赖项。有关每个镜像包含内容的详细信息，请参阅 Anyscale 基础镜像。

构建最小化的 Dockerfile

FROM anyscale/ray:2.49.0-slim-py312-cu128

# C compiler for Triton’s runtime build step (vLLM V1 engine)
# https://github.com/vllm-project/vllm/issues/2997
RUN sudo apt-get update && \
    sudo apt-get install -y --no-install-recommends build-essential

RUN pip install vllm==0.10.1

在新的 service.yaml 文件中创建您的 Anyscale 服务配置，并使用 containerfile 引用 Dockerfile。

# service.yaml
name: deploy-gpt-oss
containerfile: ./Dockerfile # Build Ray Serve LLM with vllm==0.10.1
compute_config:
  auto_select_worker_config: true 
working_dir: .
cloud:
applications:
  # Point to your app in your Python module
  - import_path: serve_gpt_oss:app

部署您的服务

anyscale service deploy -f service.yaml

发送请求#

anyscale service deploy 命令输出显示了端点和身份验证 token。

(anyscale +3.9s) curl -H "Authorization: Bearer <YOUR-TOKEN>" <YOUR-ENDPOINT>

您也可以从 Anyscale 控制台的服务页面检索两者。点击顶部的 **Query**。有关示例请求，请参阅发送请求，但请务必使用正确的端点和身份验证令牌。

访问 Serve LLM Dashboard#

有关启用 LLM 特定日志记录的说明，请参阅启用 LLM 监控。要从 Anyscale 服务打开 Ray Serve LLM Dashboard：

在 Anyscale 控制台中，转到 **Service** 或 **Workspace** 选项卡。
导航到 **Metrics** 选项卡。
点击 **View in Grafana**，然后点击 **Serve LLM Dashboard**。

关停#

关闭您的 Anyscale 服务

anyscale service terminate -n deploy-gpt-oss

启用 LLM 监控#

Serve LLM Dashboard 提供对模型性能、延迟和系统行为的深度可见性，包括：

令牌吞吐量（tokens/sec）。
延迟指标：首次令牌时间 (TTFT)、每个输出令牌时间 (TPOT)。
KV 缓存利用率。

要启用这些指标，请转到您的 LLM 配置并设置 log_engine_metrics: true。

applications:
- ...
  args:
    llm_configs:
      - ...
        log_engine_metrics: true

提高并发性#

Ray Serve LLM 使用 vLLM 作为其后端引擎，它会根据您的配置记录其支持的*最大并发量*。

1xL4 的 gpt-oss-20b 示例日志

INFO 09-08 17:34:28 [kv_cache_utils.py:1017] Maximum concurrency for 32,768 tokens per request: 5.22x

2xL40S 的 gpt-oss-120b 示例日志

INFO 09-09 00:32:32 [kv_cache_utils.py:1017] Maximum concurrency for 32,768 tokens per request: 6.18x

要提高 gpt-oss 模型的并发性，请参阅部署小型 LLM：提高并发性，针对 gpt-oss-20b 等小型模型；以及部署中型 LLM：提高并发性，针对 gpt-oss-120b 等中型模型。

注意： 一些示例指南建议使用量化来提高并发性。gpt-oss 权重默认已为 4 位，因此通常不需要进一步量化。

有关更广泛的指导，请参阅选择用于 LLM 服务的 GPU 和优化 Ray Serve LLM 的性能。

推理配置#

您无需在部署 gpt-oss 和 Ray Serve LLM 时使用自定义推理解析器，可以直接访问模型响应中的推理内容。您还可以控制模型在请求中的推理工作量。

访问推理输出#

推理内容可在响应的 reasoning_content 字段中直接获取。

response = client.chat.completions.create(
    model="my-gpt-oss",
    messages=[
        ...
    ]
)
reasoning_content = response.choices[0].message.reasoning_content
content = response.choices[0].message.content

控制推理工作量#

gpt-oss 支持三个推理级别：低、中和高。默认级别为中。

您可以通过 reasoning_effort 请求参数来控制推理。

response = client.chat.completions.create(
    model="my-gpt-oss",
    messages=[
        {"role": "user", "content": "What are the three main touristic spots to see in Paris?"}
    ],
    reasoning_effort="low" # Or "medium", "high"
)

您也可以在系统提示中显式设置一个级别。

response = client.chat.completions.create(
    model="my-gpt-oss",
    messages=[
        {"role": "system", "content": "Reasoning: low. You are an AI travel assistant."},
        {"role": "user", "content": "What are the three main touristic spots to see in Paris?"}
    ]
)

注意： 没有可靠的方法可以完全禁用推理。

故障排除#

无法下载词汇表文件#

openai_harmony.HarmonyError: error downloading or loading vocab file: failed to download or load vocab

openai_harmony 库需要 tiktoken 编码文件，并尝试从 OpenAI 的公共主机获取它们。常见原因包括：

公司防火墙或代理阻止 openaipublic.blob.core.windows.net。您可能需要将此域名列入白名单。
间歇性网络问题。
多个进程尝试下载到同一缓存时出现的竞争条件。这可能发生在同时部署多个模型时。

您也可以提前直接下载 tiktoken 编码文件，并设置 TIKTOKEN_ENCODINGS_BASE 环境变量。

mkdir -p tiktoken_encodings
wget -O tiktoken_encodings/o200k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"
wget -O tiktoken_encodings/cl100k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"
export TIKTOKEN_ENCODINGS_BASE=${PWD}/tiktoken_encodings

未识别 `gpt-oss` 架构#

Value error, The checkpoint you are trying to load has model type `gpt_oss` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

旧版本的 vLLM 和 Transformers 不会注册 gpt_oss，当 vLLM 将控制权交给 Transformers 时会引发错误。请升级 **vLLM ≥ 0.10.1**，并让您的包解析器（如 pip）处理其他依赖项。

pip install -U "vllm>=0.10.1"

总结#

在本教程中，您学习了如何使用 Ray Serve LLM 部署 gpt-oss 模型，从开发到生产。您学习了如何配置 Ray Serve LLM、在 Ray 集群上部署您的服务、发送请求以及监控您的服务。

部署 gpt-oss#

配置 Ray Serve LLM#

gpt-oss-20b#

gpt-oss-120b#

本地部署#

先决条件#

依赖项#

启动服务#

发送请求#

示例 curl#

示例 Python#

关闭服务#

使用 Anyscale 服务部署到生产环境#

启动服务#

发送请求#

访问 Serve LLM Dashboard#

关停#

启用 LLM 监控#

提高并发性#

推理配置#

访问推理输出#

控制推理工作量#

故障排除#

无法下载词汇表文件#

未识别 gpt-oss 架构#

总结#

未识别 `gpt-oss` 架构#