生产中的最佳实践#

本节将帮助您

了解在生产环境中操作 Serve 的最佳实践
了解如何使用 Serve CLI 管理 Serve
配置查询 Serve 时的 HTTP 请求

CLI 最佳实践#

本节总结了使用 Serve CLI 部署到生产环境的最佳实践

使用 serve run 在本地手动测试和改进您的 Serve 应用。
使用 serve build 为您的 Serve 应用创建 Serve 配置文件。
- 对于开发，将您的 Serve 应用代码放在远程仓库中，并手动配置 Serve 配置文件中 runtime_env 的 working_dir 或 py_modules 字段，指向该仓库。
- 对于生产环境，将您的 Serve 应用代码放入自定义 Docker 镜像中，而不是使用 runtime_env。请参阅本教程了解如何创建自定义 Docker 镜像并在 KubeRay 上部署它们。
使用 serve status 跟踪您的 Serve 应用的健康状况和部署进度。请参阅监控指南了解更多信息。
使用 serve config 检查您的 Serve 应用接收到的最新配置。这是其目标状态。请参阅监控指南了解更多信息。
通过修改您的 Serve 配置文件并使用 serve deploy 重新部署它，进行轻量级配置更新（例如，修改 num_replicas 或 user_config）。

客户端 HTTP 请求#

这些文档中的大多数示例使用 Python 的 requests 库发送简单的 get 或 post 请求，例如

import requests

response = requests.get("https://:8000/")
result = response.text

此模式适用于原型开发，但不足以用于生产环境。在生产环境中，HTTP 请求应使用

重试：请求偶尔可能由于瞬时问题（例如，网络缓慢、节点故障、断电、流量激增等）而失败。重试失败的请求几次，以应对这些问题。
指数退避：为避免在瞬时错误期间用重试轰炸 Serve 应用，请在失败时应用指数退避。每次重试等待的时间应比上一次呈指数级增加，然后再运行。例如，第一次重试可能在失败后等待 0.1 秒，后续重试在失败后等待 0.4 秒（4 x 0.1）、1.6 秒、6.4 秒、25.6 秒等。
超时：为每次重试添加超时，以防止请求挂起。超时时间应长于应用的延迟，以便为您的应用提供足够的时间处理请求。此外，在 Serve 应用中设置一个端到端超时，以便慢请求不会成为副本的瓶颈。

import requests
from requests.adapters import HTTPAdapter, Retry

session = requests.Session()

retries = Retry(
    total=5,  # 5 retries total
    backoff_factor=1,  # Exponential backoff
    status_forcelist=[  # Retry on server errors
        500,
        501,
        502,
        503,
        504,
    ],
)

session.mount("http://", HTTPAdapter(max_retries=retries))

response = session.get("https://:8000/", timeout=10)  # Add timeout
result = response.text

负载卸载#

当请求发送到集群时，它首先被 Serve 代理接收，然后使用一个 DeploymentHandle 转发给一个副本进行处理。副本一次可以处理最多可配置数量的请求。使用 max_ongoing_requests 选项配置此数量。如果所有副本都忙碌且无法接受更多请求，请求将在 DeploymentHandle 中排队，直到有可用副本。

在高负载下，DeploymentHandle 队列可能会增长，导致高尾部延迟和系统负载过高。为避免不稳定，通常最好有意拒绝一些请求，以避免这些队列无限增长。这种技术称为“负载卸载”，它允许系统在不过度增加尾部延迟或使组件过载导致故障的情况下优雅地处理过量负载。

您可以使用 @serve.deployment 装饰器的 max_queued_requests 参数配置 Ray Serve 部署的负载卸载。这控制了每个 DeploymentHandle>（包括 Serve 代理）将排队的最大请求数量。一旦达到限制，任何新请求的入队将立即引发一个 BackPressureError>。HTTP 请求将返回 503 状态码（服务不可用）。

以下示例定义了一个模拟慢速请求处理的部署，并配置了 max_ongoing_requests 和 max_queued_requests。

import time
from ray import serve
from starlette.requests import Request

@serve.deployment(
    # Each replica will be sent 2 requests at a time.
    max_ongoing_requests=2,
    # Each caller queues up to 2 requests at a time.
    # (beyond those that are sent to replicas).
    max_queued_requests=2,
)
class SlowDeployment:
    def __call__(self, request: Request) -> str:
        # Emulate a long-running request, such as ML inference.
        time.sleep(2)
        return "Hello!"

为了测试该行为，并行发送 HTTP 请求以模拟多个客户端。Serve 接受 max_ongoing_requests 和 max_queued_requests 请求，并拒绝使用 503 或服务不可用状态码的后续请求。

import ray
import aiohttp

@ray.remote
class Requester:
    async def do_request(self) -> int:
        async with aiohttp.ClientSession("https://:8000/") as session:
            return (await session.get("/")).status

r = Requester.remote()
serve.run(SlowDeployment.bind())

# Send 4 requests first.
# 2 of these will be sent to the replica. These requests take a few seconds to execute.
first_refs = [r.do_request.remote() for _ in range(2)]
_, pending = ray.wait(first_refs, timeout=1)
assert len(pending) == 2
# 2 will be queued in the proxy.
queued_refs = [r.do_request.remote() for _ in range(2)]
_, pending = ray.wait(queued_refs, timeout=0.1)
assert len(pending) == 2

# Send an additional 5 requests. These will be rejected immediately because
# the replica and the proxy queue are already full.
for status_code in ray.get([r.do_request.remote() for _ in range(5)]):
    assert status_code == 503

# The initial requests will finish successfully.
for ref in first_refs:
    print(f"Request finished with status code {ray.get(ref)}.")

2024-02-28 11:12:22,287 INFO worker.py:1744 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
(ProxyActor pid=21011) INFO 2024-02-28 11:12:24,088 proxy 127.0.0.1 proxy.py:1140 - Proxy actor 15b7c620e64c8c69fb45559001000000 starting on node ebc04d744a722577f3a049da12c9f83d9ba6a4d100e888e5fcfa19d9.
(ProxyActor pid=21011) INFO 2024-02-28 11:12:24,089 proxy 127.0.0.1 proxy.py:1357 - Starting HTTP server on node: ebc04d744a722577f3a049da12c9f83d9ba6a4d100e888e5fcfa19d9 listening on port 8000
(ProxyActor pid=21011) INFO:     Started server process [21011]
(ServeController pid=21008) INFO 2024-02-28 11:12:24,199 controller 21008 deployment_state.py:1614 - Deploying new version of deployment SlowDeployment in application 'default'. Setting initial target number of replicas to 1.
(ServeController pid=21008) INFO 2024-02-28 11:12:24,300 controller 21008 deployment_state.py:1924 - Adding 1 replica to deployment SlowDeployment in application 'default'.
(ProxyActor pid=21011) WARNING 2024-02-28 11:12:27,141 proxy 127.0.0.1 544437ef-f53a-4991-bb37-0cda0b05cb6a / router.py:96 - Request dropped due to backpressure (num_queued_requests=2, max_queued_requests=2).
(ProxyActor pid=21011) WARNING 2024-02-28 11:12:27,142 proxy 127.0.0.1 44dcebdc-5c07-4a92-b948-7843443d19cc / router.py:96 - Request dropped due to backpressure (num_queued_requests=2, max_queued_requests=2).
(ProxyActor pid=21011) WARNING 2024-02-28 11:12:27,143 proxy 127.0.0.1 83b444ae-e9d6-4ac6-84b7-f127c48f6ba7 / router.py:96 - Request dropped due to backpressure (num_queued_requests=2, max_queued_requests=2).
(ProxyActor pid=21011) WARNING 2024-02-28 11:12:27,144 proxy 127.0.0.1 f92b47c2-6bff-4a0d-8e5b-126d948748ea / router.py:96 - Request dropped due to backpressure (num_queued_requests=2, max_queued_requests=2).
(ProxyActor pid=21011) WARNING 2024-02-28 11:12:27,145 proxy 127.0.0.1 cde44bcc-f3e7-4652-b487-f3f2077752aa / router.py:96 - Request dropped due to backpressure (num_queued_requests=2, max_queued_requests=2).
(ServeReplica:default:SlowDeployment pid=21013) INFO 2024-02-28 11:12:28,168 default_SlowDeployment 8ey9y40a e3b77013-7dc8-437b-bd52-b4839d215212 / replica.py:373 - __CALL__ OK 2007.7ms
(ServeReplica:default:SlowDeployment pid=21013) INFO 2024-02-28 11:12:30,175 default_SlowDeployment 8ey9y40a 601e7b0d-1cd3-426d-9318-43c2c4a57a53 / replica.py:373 - __CALL__ OK 4013.5ms
(ServeReplica:default:SlowDeployment pid=21013) INFO 2024-02-28 11:12:32,183 default_SlowDeployment 8ey9y40a 0655fa12-0b44-4196-8fc5-23d31ae6fcb9 / replica.py:373 - __CALL__ OK 3987.9ms
(ServeReplica:default:SlowDeployment pid=21013) INFO 2024-02-28 11:12:34,188 default_SlowDeployment 8ey9y40a c49dee09-8de1-4e7a-8c2f-8ce3f6d8ef34 / replica.py:373 - __CALL__ OK 3960.8ms
Request finished with status code 200.
Request finished with status code 200.
Request finished with status code 200.
Request finished with status code 200.