故障排除#

Ray Serve LLM 的常见问题解答。

常见问题解答#

如何使用受保护的 Hugging Face 模型?#

您可以使用 runtime_env 来指定访问模型所需的环境变量。要获取部署选项,您可以使用 LLMServer 类上的 get_deployment_options 方法。每个部署类都有自己的 get_deployment_options 方法。

from ray import serve
from ray.serve.llm import LLMConfig
from ray.serve.llm.deployment import LLMServer
from ray.serve.llm.ingress import OpenAiIngress
from ray.serve.llm.builders import build_openai_app

import os

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="llama-3-8b-instruct",
        model_source="meta-llama/Meta-Llama-3-8B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    # Pass the desired accelerator type (e.g., A10G, L4, etc.)
    accelerator_type="A10G",
    runtime_env=dict(
        env_vars=dict(
            HF_TOKEN=os.environ["HF_TOKEN"]
        )
    ),
)

app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)

为什么下载模型这么慢?#

如果您使用的是 Hugging Face 模型,可以通过设置 HF_HUB_ENABLE_HF_TRANSFER 并安装 pip install hf_transfer 来启用快速下载。

from ray import serve
from ray.serve.llm import LLMConfig
from ray.serve.llm.deployment import LLMServer
from ray.serve.llm.ingress import OpenAiIngress
from ray.serve.llm.builders import build_openai_app
import os

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="llama-3-8b-instruct",
        model_source="meta-llama/Meta-Llama-3-8B-Instruct",
    ),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1, max_replicas=2,
        )
    ),
    # Pass the desired accelerator type (e.g., A10G, L4, etc.)
    accelerator_type="A10G",
    runtime_env=dict(
        env_vars=dict(
            HF_TOKEN=os.environ["HF_TOKEN"],
            HF_HUB_ENABLE_HF_TRANSFER="1"
        )
    ),
)

# Deploy the application
app = build_openai_app({"llm_configs": [llm_config]})
serve.run(app, blocking=True)

获取帮助#

如果您遇到本指南未涵盖的问题

另请参阅#