生产中的最佳实践#
本节将帮助您
了解在生产环境中运行 Serve 的最佳实践
详细了解如何使用 Serve CLI 管理 Serve
查询 Serve 时配置您的 HTTP 请求
CLI 最佳实践#
本节总结了使用 Serve CLI 部署到生产环境的最佳实践
使用
serve run在本地手动测试和改进您的 Serve 应用程序。使用
serve build为您的 Serve 应用程序创建 Serve 配置文件。在开发环境中,请将您的 Serve 应用程序的代码放在远程仓库中,并在 Serve 配置文件
runtime_env的working_dir或py_modules字段中手动配置,使其指向该仓库。在生产环境中,请将您的 Serve 应用程序的代码放在自定义 Docker 镜像中,而不是
runtime_env。请参阅本教程,了解如何创建自定义 Docker 镜像并在 KubeRay 上进行部署。
使用
serve status来跟踪您的 Serve 应用程序的健康状况和部署进度。有关更多信息,请参阅监控指南。使用
serve config来检查您的 Serve 应用程序收到的最新配置。这是其目标状态。有关更多信息,请参阅监控指南。通过修改您的 Serve 配置文件并使用
serve deploy重新部署,进行轻量级的配置更新(例如,num_replicas或user_config的更改)。
客户端 HTTP 请求#
这些文档中的大多数示例都使用 Python 的 requests 库进行简单的 get 或 post 请求,例如:
import requests
response = requests.get("https://:8000/")
result = response.text
这种模式对于原型开发很有用,但在生产环境中是不够的。在生产环境中,HTTP 请求应该使用:
重试:请求有时会因暂时性问题(例如,网络缓慢、节点故障、停电、流量激增等)而失败。重试失败的请求几次以应对这些问题。
指数退避:为了避免在暂时性错误期间用重试轰炸 Serve 应用程序,请在失败时应用指数退避。每次重试等待的时间都应该比前一次重试更长。例如,第一次重试可能在失败后等待 0.1 秒,后续重试则在失败后等待 0.4 秒(4 x 0.1)、1.6 秒、6.4 秒、25.6 秒等。
超时:为每次重试添加超时,以防止请求挂起。超时时间应长于应用程序的延迟时间,以便为您的应用程序提供足够的时间来处理请求。此外,还应在 Serve 应用程序中设置一个端到端超时,以避免慢请求成为副本的瓶颈。
import requests
from requests.adapters import HTTPAdapter, Retry
session = requests.Session()
retries = Retry(
total=5, # 5 retries total
backoff_factor=1, # Exponential backoff
status_forcelist=[ # Retry on server errors
500,
501,
502,
503,
504,
],
)
session.mount("http://", HTTPAdapter(max_retries=retries))
response = session.get("https://:8000/", timeout=10) # Add timeout
result = response.text
负载削减#
当一个请求被发送到集群时,它首先由 Serve 代理接收,然后代理使用 DeploymentHandle 将其转发给副本进行处理。副本一次最多可以处理配置数量的请求。通过 max_ongoing_requests 选项来配置此数量。如果所有副本都已繁忙且无法接受更多请求,则该请求将排队等待 DeploymentHandle,直到有副本可用。
在高负载下,DeploymentHandle 队列可能会增长,导致高尾部延迟和系统过载。为了避免不稳定,通常最好有意拒绝一些请求,以防止这些队列无限增长。这种技术被称为“负载削减”,它允许系统在不过载组件导致故障的情况下,优雅地处理过载。
您可以使用 @serve.deployment 装饰器的 max_queued_requests 参数为您的 Serve 部署配置负载削减。这控制着每个 DeploymentHandle(包括 Serve 代理)将排队的请求的最大数量。一旦达到限制,任何新请求的入队都会立即引发 BackPressureError。HTTP 请求将返回 503 状态码(服务不可用)。
以下示例定义了一个模拟慢请求处理的部署,并配置了 max_ongoing_requests 和 max_queued_requests。
import time
from ray import serve
from starlette.requests import Request
@serve.deployment(
# Each replica will be sent 2 requests at a time.
max_ongoing_requests=2,
# Each caller queues up to 2 requests at a time.
# (beyond those that are sent to replicas).
max_queued_requests=2,
)
class SlowDeployment:
def __call__(self, request: Request) -> str:
# Emulate a long-running request, such as ML inference.
time.sleep(2)
return "Hello!"
要测试此行为,请并行发送 HTTP 请求以模拟多个客户端。Serve 接受 max_ongoing_requests 和 max_queued_requests 的请求,并拒绝其他请求,返回 503(服务不可用)状态码。
import ray
import aiohttp
@ray.remote
class Requester:
async def do_request(self) -> int:
async with aiohttp.ClientSession("https://:8000/") as session:
return (await session.get("/")).status
r = Requester.remote()
serve.run(SlowDeployment.bind())
# Send 4 requests first.
# 2 of these will be sent to the replica. These requests take a few seconds to execute.
first_refs = [r.do_request.remote() for _ in range(2)]
_, pending = ray.wait(first_refs, timeout=1)
assert len(pending) == 2
# 2 will be queued in the proxy.
queued_refs = [r.do_request.remote() for _ in range(2)]
_, pending = ray.wait(queued_refs, timeout=0.1)
assert len(pending) == 2
# Send an additional 5 requests. These will be rejected immediately because
# the replica and the proxy queue are already full.
for status_code in ray.get([r.do_request.remote() for _ in range(5)]):
assert status_code == 503
# The initial requests will finish successfully.
for ref in first_refs:
print(f"Request finished with status code {ray.get(ref)}.")
2024-02-28 11:12:22,287 INFO worker.py:1744 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
(ProxyActor pid=21011) INFO 2024-02-28 11:12:24,088 proxy 127.0.0.1 proxy.py:1140 - Proxy actor 15b7c620e64c8c69fb45559001000000 starting on node ebc04d744a722577f3a049da12c9f83d9ba6a4d100e888e5fcfa19d9.
(ProxyActor pid=21011) INFO 2024-02-28 11:12:24,089 proxy 127.0.0.1 proxy.py:1357 - Starting HTTP server on node: ebc04d744a722577f3a049da12c9f83d9ba6a4d100e888e5fcfa19d9 listening on port 8000
(ProxyActor pid=21011) INFO: Started server process [21011]
(ServeController pid=21008) INFO 2024-02-28 11:12:24,199 controller 21008 deployment_state.py:1614 - Deploying new version of deployment SlowDeployment in application 'default'. Setting initial target number of replicas to 1.
(ServeController pid=21008) INFO 2024-02-28 11:12:24,300 controller 21008 deployment_state.py:1924 - Adding 1 replica to deployment SlowDeployment in application 'default'.
(ProxyActor pid=21011) WARNING 2024-02-28 11:12:27,141 proxy 127.0.0.1 544437ef-f53a-4991-bb37-0cda0b05cb6a / router.py:96 - Request dropped due to backpressure (num_queued_requests=2, max_queued_requests=2).
(ProxyActor pid=21011) WARNING 2024-02-28 11:12:27,142 proxy 127.0.0.1 44dcebdc-5c07-4a92-b948-7843443d19cc / router.py:96 - Request dropped due to backpressure (num_queued_requests=2, max_queued_requests=2).
(ProxyActor pid=21011) WARNING 2024-02-28 11:12:27,143 proxy 127.0.0.1 83b444ae-e9d6-4ac6-84b7-f127c48f6ba7 / router.py:96 - Request dropped due to backpressure (num_queued_requests=2, max_queued_requests=2).
(ProxyActor pid=21011) WARNING 2024-02-28 11:12:27,144 proxy 127.0.0.1 f92b47c2-6bff-4a0d-8e5b-126d948748ea / router.py:96 - Request dropped due to backpressure (num_queued_requests=2, max_queued_requests=2).
(ProxyActor pid=21011) WARNING 2024-02-28 11:12:27,145 proxy 127.0.0.1 cde44bcc-f3e7-4652-b487-f3f2077752aa / router.py:96 - Request dropped due to backpressure (num_queued_requests=2, max_queued_requests=2).
(ServeReplica:default:SlowDeployment pid=21013) INFO 2024-02-28 11:12:28,168 default_SlowDeployment 8ey9y40a e3b77013-7dc8-437b-bd52-b4839d215212 / replica.py:373 - __CALL__ OK 2007.7ms
(ServeReplica:default:SlowDeployment pid=21013) INFO 2024-02-28 11:12:30,175 default_SlowDeployment 8ey9y40a 601e7b0d-1cd3-426d-9318-43c2c4a57a53 / replica.py:373 - __CALL__ OK 4013.5ms
(ServeReplica:default:SlowDeployment pid=21013) INFO 2024-02-28 11:12:32,183 default_SlowDeployment 8ey9y40a 0655fa12-0b44-4196-8fc5-23d31ae6fcb9 / replica.py:373 - __CALL__ OK 3987.9ms
(ServeReplica:default:SlowDeployment pid=21013) INFO 2024-02-28 11:12:34,188 default_SlowDeployment 8ey9y40a c49dee09-8de1-4e7a-8c2f-8ce3f6d8ef34 / replica.py:373 - __CALL__ OK 3960.8ms
Request finished with status code 200.
Request finished with status code 200.
Request finished with status code 200.
Request finished with status code 200.