使用 Ray Serve LLM 部署 LLM#

本指南将引导您使用 Ray Serve LLM 部署大型语言模型 (LLM)。它涵盖了配置、部署和与模型的交互。

该示例在利用 Ray Serve LLM 的强大功能进行生产级部署的同时，保持了与 OpenAI API 的兼容性。

有关更多详细信息，请参阅 Ray Serve LLM API 文档：https://docs.rayai.org.cn/en/latest/serve/llm/serving-llms.html

特定于 Anyscale 的配置

注意：本教程针对 Anyscale 平台进行了优化。在开源 Ray 上运行时，需要额外的配置。例如，您需要手动

配置您的 Ray 集群：设置您的多节点环境（包括主节点和工作节点），并管理资源分配（例如，自动伸缩、GPU/CPU 分配），而无需 Anyscale 的自动化。有关详细信息，请参阅 Ray 集群设置文档：https://docs.rayai.org.cn/en/latest/cluster/getting-started.html。
管理依赖项：在每个节点上安装和管理依赖项，因为您将无法使用 Anyscale 基于 Docker 的依赖项管理。请参阅 Ray 安装指南，了解在环境中安装和更新 Ray 的说明：https://docs.rayai.org.cn/en/latest/ray-core/handling-dependencies.html。
设置存储：配置您自己的分布式或共享存储系统（而不是依赖 Anyscale 的集成集群存储）。请查看 Ray 集群配置指南，了解有关设置共享存储解决方案的建议：https://docs.rayai.org.cn/en/latest/train/user-guides/persistent-storage.html。

依赖项#

确保您拥有正确的依赖项，并使用以下命令安装所需的 Python 包

pip install "ray[serve,llm]>=2.45.0" 

注意：如果您在使用 Anyscale 平台，可以使用以下 Docker 镜像：anyscale/ray-llm:2.45.0-py311-cu124。

否则，您也可以在 Anyscale 上自行构建 Docker 镜像，这可能会加快工作空间的启动时间和工作节点加载时间。

我们在工作区中包含了 Dockerfile。

设置您的 LLM 部署#

步骤 1：配置部署和工作节点#

创建一个 LLMConfig 对象，用于设置您模型的运行时环境、加载参数和引擎选项。

在此部署中，我们将 accelerator_type='L4' 设置为使用 L4 GPU 节点。

由于模型很大，我们通过设置 'tensor_parallel_size': 4 来使用张量并行，将负载分配到 4 个 GPU 上。

加载 Huggingface 受限模型

Qwen 模型不需要 Hugging Face Token，但某些模型（例如 Llama 3.1 模型）可能需要注册和访问。要使用受限的 Huggingface 模型，请访问链接：https://docs.rayai.org.cn/en/latest/serve/llm/serving-llms.html#how-do-i-use-gated-huggingface-models

使用 vLLM 的其他引擎参数

如果您想使用 vLLM 的更多参数，请查看：https://docs.vllm.com.cn/en/latest/serving/engine_args.html

from ray import serve
from ray.serve.llm import LLMConfig


llm_config = LLMConfig(
    model_loading_config={
        'model_id': 'Qwen/Qwen2.5-32B-Instruct'
    },
    engine_kwargs={
        'max_num_batched_tokens': 8192,
        'max_model_len': 8192,
        'max_num_seqs': 64,
        'tensor_parallel_size': 4,
        'trust_remote_code': True,
    },
    accelerator_type='L4',
    deployment_config={
        'autoscaling_config': {
            'target_ongoing_requests': 32
        },
        'max_ongoing_requests': 64,
    },
)

INFO 05-19 09:42:05 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform

加快模型下载/加载速度的技巧：#

为了加快模型下载速度，您可以通过设置 HF_HUB_ENABLE_HF_TRANSFER 并使用 pip install hf_transfer 进行安装来启用快速下载。请查看：https://docs.rayai.org.cn/en/latest/serve/llm/serving-llms.html#why-is-downloading-the-model-so-slow
为了加快模型加载速度，您可以将模型权重下载到 Anyscale 的集群存储中，这可以提高集群的启动时间和扩展效率。例如，我们可以指定存储在以下位置的 Qwen 模型文件（大小约 60GB）：/mnt/cluster_storage/Qwen/Qwen2.5-32B-Instruct

步骤 2：启动 Ray Serve 并部署您的模型#

您可以通过 build_openai_app 直接构建一个兼容 OpenAI API 的应用。

from ray.serve.llm import build_openai_app

# Build and deploy the model with OpenAI api compatibility:
llm_app = build_openai_app({"llm_configs": [llm_config]})
serve.run(llm_app)

2025-05-19 09:42:09,192	INFO worker.py:1694 -- Connecting to existing Ray cluster at address: 10.0.26.188:6379...
2025-05-19 09:42:09,202	INFO worker.py:1879 -- Connected to Ray cluster. View the dashboard at https://session-xl5p5c8v2puhejgj5rjjn1g6ht.i.anyscaleuserdata.com 
2025-05-19 09:42:09,210	INFO packaging.py:367 -- Pushing file package 'gcs://_ray_pkg_64fd167031b33f561d300e31010ccea98347bd4a.zip' (3.45MiB) to Ray cluster...
2025-05-19 09:42:09,225	INFO packaging.py:380 -- Successfully pushed file package 'gcs://_ray_pkg_64fd167031b33f561d300e31010ccea98347bd4a.zip'.
(ProxyActor pid=12151) INFO 2025-05-19 09:42:12,727 proxy 10.0.26.188 -- Proxy starting on node 7a87cdeb8936fafd92d0d4cab8456af74f2aae665f59cec80664527f (HTTP port: 8000).
INFO 2025-05-19 09:42:12,805 serve 8988 -- Started Serve in namespace "serve".
(ServeController pid=12094) INFO 2025-05-19 09:42:12,834 controller 12094 -- Deploying new version of Deployment(name='LLMDeployment:Qwen--Qwen2_5-32B-Instruct', app='default') (initial target replicas: 1).
(ServeController pid=12094) INFO 2025-05-19 09:42:12,835 controller 12094 -- Deploying new version of Deployment(name='LLMRouter', app='default') (initial target replicas: 2).
(ProxyActor pid=12151) INFO 2025-05-19 09:42:12,772 proxy 10.0.26.188 -- Got updated endpoints: {}.
(ProxyActor pid=12151) INFO 2025-05-19 09:42:12,837 proxy 10.0.26.188 -- Got updated endpoints: {Deployment(name='LLMRouter', app='default'): EndpointInfo(route='/', app_is_cross_language=False)}.
(ServeController pid=12094) INFO 2025-05-19 09:42:12,938 controller 12094 -- Adding 1 replica to Deployment(name='LLMDeployment:Qwen--Qwen2_5-32B-Instruct', app='default').
(ServeController pid=12094) INFO 2025-05-19 09:42:12,940 controller 12094 -- Adding 2 replicas to Deployment(name='LLMRouter', app='default').
(ProxyActor pid=12151) INFO 2025-05-19 09:42:12,854 proxy 10.0.26.188 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x75df38c78890>.

(autoscaler +20s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +20s) [autoscaler] [4xA10G:48CPU-192GB] Attempting to add 1 node(s) to the cluster (increasing from 0 to 1).
(autoscaler +25s) [autoscaler] [4xA10G:48CPU-192GB] Launched 1 instances.

(ServeController pid=12094) WARNING 2025-05-19 09:42:43,001 controller 12094 -- Deployment 'LLMDeployment:Qwen--Qwen2_5-32B-Instruct' in application 'default' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: [{"CPU": 1.0}, {"GPU": 1.0, "accelerator_type:A10G": 0.001}, {"GPU": 1.0, "accelerator_type:A10G": 0.001}, {"GPU": 1.0, "accelerator_type:A10G": 0.001}, {"GPU": 1.0, "accelerator_type:A10G": 0.001}], total resources available: {}. Use `ray status` for more details.
(ServeController pid=12094) WARNING 2025-05-19 09:42:43,002 controller 12094 -- Deployment 'LLMRouter' in application 'default' has 2 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 1}, total resources available: {}. Use `ray status` for more details.
(ServeController pid=12094) WARNING 2025-05-19 09:43:13,011 controller 12094 -- Deployment 'LLMDeployment:Qwen--Qwen2_5-32B-Instruct' in application 'default' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: [{"CPU": 1.0}, {"GPU": 1.0, "accelerator_type:A10G": 0.001}, {"GPU": 1.0, "accelerator_type:A10G": 0.001}, {"GPU": 1.0, "accelerator_type:A10G": 0.001}, {"GPU": 1.0, "accelerator_type:A10G": 0.001}], total resources available: {}. Use `ray status` for more details.
(ServeController pid=12094) WARNING 2025-05-19 09:43:13,012 controller 12094 -- Deployment 'LLMRouter' in application 'default' has 2 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 1}, total resources available: {"CPU": 45.0}. Use `ray status` for more details.

(ServeReplica:default:LLMRouter pid=3173, ip=10.0.2.213) INFO 05-19 09:43:18 [__init__.py:239] Automatically detected platform cuda.

(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) WARNING 2025-05-19 09:43:18,926 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- VLLM_USE_V1 environment variable is not set, using vLLM v0 as default. Later we may switch default to use v1 once vLLM v1 is mature.
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) No cloud storage mirror configured
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:43:18,940 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- Downloading the tokenizer for Qwen/Qwen2.5-32B-Instruct
(ProxyActor pid=3181, ip=10.0.2.213) INFO 2025-05-19 09:43:19,673 proxy 10.0.2.213 -- Proxy starting on node bfba2b6c78ebe78a0517c4e46aa9a7d4229b0b17a65d9fadabc26c3e (HTTP port: 8000).
(ProxyActor pid=3181, ip=10.0.2.213) INFO 2025-05-19 09:43:19,722 proxy 10.0.2.213 -- Got updated endpoints: {Deployment(name='LLMRouter', app='default'): EndpointInfo(route='/', app_is_cross_language=False)}.
(ProxyActor pid=3181, ip=10.0.2.213) INFO 2025-05-19 09:43:19,733 proxy 10.0.2.213 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x71013493c110>.
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) You are using a model of type qwen2 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) You are using a model of type qwen2 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.

(pid=4692, ip=10.0.2.213) INFO 05-19 09:43:27 [__init__.py:239] Automatically detected platform cuda. [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.rayai.org.cn/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(_get_vllm_engine_config pid=4692, ip=10.0.2.213) INFO 05-19 09:43:36 [config.py:585] This model supports multiple tasks: {'score', 'classify', 'reward', 'generate', 'embed'}. Defaulting to 'generate'.

(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:43:36,631 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Getting the server ready ...

(pid=4800, ip=10.0.2.213) INFO 05-19 09:43:41 [__init__.py:239] Automatically detected platform cuda.
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:43:41 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2) with config: model='Qwen/Qwen2.5-32B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-32B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qwen2.5-32B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":64}, use_cached_outputs=True, 
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:43:42 [ray_utils.py:288] Ray is already initialized. Skipping Ray initialization.
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:43:42 [ray_utils.py:314] Using the existing placement group
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:43:42 [ray_distributed_executor.py:176] use_ray_spmd_worker: False

(ServeController pid=12094) WARNING 2025-05-19 09:43:43,110 controller 12094 -- Deployment 'LLMDeployment:Qwen--Qwen2_5-32B-Instruct' in application 'default' has 1 replicas that have taken more than 30s to initialize.
(ServeController pid=12094) This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=12094) WARNING 2025-05-19 09:43:43,111 controller 12094 -- Deployment 'LLMRouter' in application 'default' has 2 replicas that have taken more than 30s to initialize.
(ServeController pid=12094) This may be caused by a slow __init__ or reconfigure method.
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:43:46,679 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...

(pid=4895, ip=10.0.2.213) INFO 05-19 09:43:47 [__init__.py:239] Automatically detected platform cuda.
(pid=4894, ip=10.0.2.213) INFO 05-19 09:43:47 [__init__.py:239] Automatically detected platform cuda.
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:43:48 [ray_distributed_executor.py:352] non_carry_over_env_vars from config: set()
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:43:48 [ray_distributed_executor.py:354] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'VLLM_USE_V1']
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:43:48 [ray_distributed_executor.py:357] If certain env vars should NOT be copied to workers, add them to /home/ray/.config/vllm/ray_non_carry_over_env_vars.json file
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:43:49 [cuda.py:291] Using Flash Attention backend.
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:43:51 [utils.py:931] Found nccl from library libnccl.so.2
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:43:51 [pynccl.py:69] vLLM is using nccl==2.21.5
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) WARNING 05-19 09:43:51 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:43:51 [shm_broadcast.py:259] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_be83b251'), local_subscribe_addr='ipc:///tmp/480e4aeb-e87f-414e-a20c-f85308fc985a', remote_subscribe_addr=None, remote_addr_ipv6=False)
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:43:51 [parallel_state.py:954] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:43:51 [model_runner.py:1110] Starting to load model Qwen/Qwen2.5-32B-Instruct...
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:43:52 [weight_utils.py:265] Using model weights format ['*.safetensors']
(pid=4893, ip=10.0.2.213) INFO 05-19 09:43:47 [__init__.py:239] Automatically detected platform cuda. [repeated 2x across cluster]

(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:43:57,728 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:44:08,780 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...
(ServeController pid=12094) WARNING 2025-05-19 09:44:13,140 controller 12094 -- Deployment 'LLMDeployment:Qwen--Qwen2_5-32B-Instruct' in application 'default' has 1 replicas that have taken more than 30s to initialize.
(ServeController pid=12094) This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=12094) WARNING 2025-05-19 09:44:13,141 controller 12094 -- Deployment 'LLMRouter' in application 'default' has 2 replicas that have taken more than 30s to initialize.
(ServeController pid=12094) This may be caused by a slow __init__ or reconfigure method.
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:44:19,829 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:44:30,880 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:44:41,928 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...
(ServeController pid=12094) WARNING 2025-05-19 09:44:43,166 controller 12094 -- Deployment 'LLMDeployment:Qwen--Qwen2_5-32B-Instruct' in application 'default' has 1 replicas that have taken more than 30s to initialize.
(ServeController pid=12094) This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=12094) WARNING 2025-05-19 09:44:43,166 controller 12094 -- Deployment 'LLMRouter' in application 'default' has 2 replicas that have taken more than 30s to initialize.
(ServeController pid=12094) This may be caused by a slow __init__ or reconfigure method.
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:44:52,976 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:45:04,024 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...
(ServeController pid=12094) WARNING 2025-05-19 09:45:13,192 controller 12094 -- Deployment 'LLMDeployment:Qwen--Qwen2_5-32B-Instruct' in application 'default' has 1 replicas that have taken more than 30s to initialize.
(ServeController pid=12094) This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=12094) WARNING 2025-05-19 09:45:13,193 controller 12094 -- Deployment 'LLMRouter' in application 'default' has 2 replicas that have taken more than 30s to initialize.
(ServeController pid=12094) This may be caused by a slow __init__ or reconfigure method.
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:45:15,073 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:45:26,119 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:45:37,134 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...
(ServeController pid=12094) WARNING 2025-05-19 09:45:43,220 controller 12094 -- Deployment 'LLMDeployment:Qwen--Qwen2_5-32B-Instruct' in application 'default' has 1 replicas that have taken more than 30s to initialize.
(ServeController pid=12094) This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=12094) WARNING 2025-05-19 09:45:43,220 controller 12094 -- Deployment 'LLMRouter' in application 'default' has 2 replicas that have taken more than 30s to initialize.
(ServeController pid=12094) This may be caused by a slow __init__ or reconfigure method.

(RayWorkerWrapper pid=4895, ip=10.0.2.213) INFO 05-19 09:45:47 [weight_utils.py:281] Time spent downloading weights for Qwen/Qwen2.5-32B-Instruct: 114.670886 seconds
(RayWorkerWrapper pid=4893, ip=10.0.2.213) INFO 05-19 09:43:49 [cuda.py:291] Using Flash Attention backend. [repeated 3x across cluster]
(RayWorkerWrapper pid=4893, ip=10.0.2.213) INFO 05-19 09:43:51 [utils.py:931] Found nccl from library libnccl.so.2 [repeated 3x across cluster]
(RayWorkerWrapper pid=4893, ip=10.0.2.213) INFO 05-19 09:43:51 [pynccl.py:69] vLLM is using nccl==2.21.5 [repeated 3x across cluster]
(RayWorkerWrapper pid=4893, ip=10.0.2.213) WARNING 05-19 09:43:51 [custom_all_reduce.py:137] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. [repeated 3x across cluster]
(RayWorkerWrapper pid=4893, ip=10.0.2.213) INFO 05-19 09:43:51 [parallel_state.py:954] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3 [repeated 3x across cluster]
(RayWorkerWrapper pid=4893, ip=10.0.2.213) INFO 05-19 09:43:51 [model_runner.py:1110] Starting to load model Qwen/Qwen2.5-32B-Instruct... [repeated 3x across cluster]
(RayWorkerWrapper pid=4893, ip=10.0.2.213) INFO 05-19 09:43:52 [weight_utils.py:265] Using model weights format ['*.safetensors'] [repeated 3x across cluster]

Loading safetensors checkpoint shards:   0% Completed | 0/17 [00:00<?, ?it/s]
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:45:48,144 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...
Loading safetensors checkpoint shards:   6% Completed | 1/17 [00:00<00:06,  2.50it/s]
Loading safetensors checkpoint shards:  12% Completed | 2/17 [00:00<00:06,  2.27it/s]
Loading safetensors checkpoint shards:  18% Completed | 3/17 [00:01<00:06,  2.22it/s]
Loading safetensors checkpoint shards:  24% Completed | 4/17 [00:01<00:05,  2.35it/s]
Loading safetensors checkpoint shards:  29% Completed | 5/17 [00:02<00:05,  2.31it/s]
Loading safetensors checkpoint shards:  35% Completed | 6/17 [00:02<00:04,  2.25it/s]
Loading safetensors checkpoint shards:  41% Completed | 7/17 [00:03<00:04,  2.22it/s]
Loading safetensors checkpoint shards:  47% Completed | 8/17 [00:03<00:04,  2.20it/s]
Loading safetensors checkpoint shards:  53% Completed | 9/17 [00:03<00:03,  2.43it/s]
Loading safetensors checkpoint shards:  59% Completed | 10/17 [00:04<00:02,  2.40it/s]
Loading safetensors checkpoint shards:  65% Completed | 11/17 [00:04<00:02,  2.32it/s]
Loading safetensors checkpoint shards:  71% Completed | 12/17 [00:05<00:02,  2.27it/s]
Loading safetensors checkpoint shards:  76% Completed | 13/17 [00:05<00:01,  2.23it/s]
Loading safetensors checkpoint shards:  82% Completed | 14/17 [00:06<00:01,  2.21it/s]
Loading safetensors checkpoint shards:  88% Completed | 15/17 [00:06<00:00,  2.20it/s]

(RayWorkerWrapper pid=4895, ip=10.0.2.213) INFO 05-19 09:45:55 [loader.py:447] Loading weights took 7.66 seconds

Loading safetensors checkpoint shards:  94% Completed | 16/17 [00:07<00:00,  2.20it/s]

(RayWorkerWrapper pid=4895, ip=10.0.2.213) INFO 05-19 09:45:55 [model_runner.py:1146] Model loading took 15.3918 GB and 123.242199 seconds

Loading safetensors checkpoint shards: 100% Completed | 17/17 [00:07<00:00,  2.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 17/17 [00:07<00:00,  2.26it/s]
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) 
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:45:59,195 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...

(RayWorkerWrapper pid=4895, ip=10.0.2.213) INFO 05-19 09:46:06 [worker.py:267] Memory profiling takes 10.85 seconds
(RayWorkerWrapper pid=4895, ip=10.0.2.213) INFO 05-19 09:46:06 [worker.py:267] the current vLLM instance can use total_gpu_memory (21.98GiB) x gpu_memory_utilization (0.90) = 19.78GiB
(RayWorkerWrapper pid=4895, ip=10.0.2.213) INFO 05-19 09:46:06 [worker.py:267] model weights take 15.39GiB; non_torch_memory takes 0.21GiB; PyTorch activation peak memory takes 0.72GiB; the rest of the memory reserved for KV Cache is 3.47GiB.
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:45:55 [loader.py:447] Loading weights took 7.61 seconds [repeated 3x across cluster]
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:45:55 [model_runner.py:1146] Model loading took 15.3918 GB and 123.829000 seconds [repeated 3x across cluster]
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:46:07 [executor_base.py:111] # cuda blocks: 3549, # CPU blocks: 4096
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:46:07 [executor_base.py:116] Maximum concurrency for 8192 tokens per request: 6.93x

(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:46:10,246 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Waiting for engine process ...
Capturing CUDA graph shapes:   0%|          | 0/11 [00:00<?, ?it/s]

(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:46:11 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.

Capturing CUDA graph shapes:   9%|▉         | 1/11 [00:00<00:07,  1.43it/s]
Capturing CUDA graph shapes:  18%|█▊        | 2/11 [00:01<00:06,  1.50it/s]
Capturing CUDA graph shapes:  27%|██▋       | 3/11 [00:01<00:05,  1.53it/s]
(ServeController pid=12094) WARNING 2025-05-19 09:46:13,234 controller 12094 -- Deployment 'LLMDeployment:Qwen--Qwen2_5-32B-Instruct' in application 'default' has 1 replicas that have taken more than 30s to initialize.
(ServeController pid=12094) This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=12094) WARNING 2025-05-19 09:46:13,235 controller 12094 -- Deployment 'LLMRouter' in application 'default' has 2 replicas that have taken more than 30s to initialize.
(ServeController pid=12094) This may be caused by a slow __init__ or reconfigure method.
Capturing CUDA graph shapes:  36%|███▋      | 4/11 [00:02<00:04,  1.56it/s]
Capturing CUDA graph shapes:  45%|████▌     | 5/11 [00:03<00:03,  1.61it/s]
Capturing CUDA graph shapes:  55%|█████▍    | 6/11 [00:03<00:03,  1.65it/s]
Capturing CUDA graph shapes:  64%|██████▎   | 7/11 [00:04<00:02,  1.69it/s]
Capturing CUDA graph shapes:  73%|███████▎  | 8/11 [00:04<00:01,  1.73it/s]
Capturing CUDA graph shapes:  82%|████████▏ | 9/11 [00:05<00:01,  1.76it/s]
Capturing CUDA graph shapes:  91%|█████████ | 10/11 [00:05<00:00,  1.78it/s]
Capturing CUDA graph shapes: 100%|██████████| 11/11 [00:07<00:00,  1.53it/s]

(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:46:18 [model_runner.py:1570] Graph capturing finished in 7 secs, took 0.27 GiB
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:46:18 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 22.35 seconds
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:46:07 [worker.py:267] Memory profiling takes 11.01 seconds [repeated 3x across cluster]
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:46:07 [worker.py:267] the current vLLM instance can use total_gpu_memory (21.98GiB) x gpu_memory_utilization (0.90) = 19.78GiB [repeated 3x across cluster]
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:46:07 [worker.py:267] model weights take 15.39GiB; non_torch_memory takes 0.21GiB; PyTorch activation peak memory takes 0.72GiB; the rest of the memory reserved for KV Cache is 3.47GiB. [repeated 3x across cluster]
(RayWorkerWrapper pid=4893, ip=10.0.2.213) INFO 05-19 09:46:11 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. [repeated 3x across cluster]

(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:46:18,377 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- [STATUS] Server is ready.
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:46:18,377 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z -- Started vLLM engine.

(pid=13590) INFO 05-19 09:46:24 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
(RayWorkerWrapper pid=4893, ip=10.0.2.213) INFO 05-19 09:46:18 [model_runner.py:1570] Graph capturing finished in 7 secs, took 0.27 GiB [repeated 3x across cluster]

(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:46:25,011 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z d1bdf7ac-880f-4133-aebe-2c73e30cf68c -- CALL llm_config OK 192.3ms
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:46:25,016 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z 1d024257-183d-4e5f-bdef-15d16a1fd9b7 -- CALL llm_config OK 196.5ms
INFO 2025-05-19 09:46:26,426 serve 8988 -- Application 'default' is ready at http://127.0.0.1:8000/.
INFO 2025-05-19 09:46:26,434 serve 8988 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x7a0b8195bb90>.

DeploymentHandle(deployment='LLMRouter')

使用 OpenAI 客户端进行流式聊天补全#

如果成功，您应该会看到以下信息打印出来

"INFO 2025-03-02 17:17:14,162 serve 61769 -- Application 'default' is ready at http://127.0.0.1:8000/. 

INFO 2025-03-02 17:17:14,162 serve 61769 -- Deployed app 'default' successfully."

注意：我们已将“v1”附加到基本 URL，因为 OpenAI 客户端需要它。

接下来，我们可以使用此 URL 和 API 密钥（尽管目前不需要密钥）初始化一个 OpenAI 客户端，然后从部署的模型流式传输聊天补全。

from openai import OpenAI

# Initialize client
client = OpenAI(base_url="https://:8000/v1", api_key="fake-key")
model_id='Qwen/Qwen2.5-32B-Instruct' ## model id need to be same as your deployment

# Basic chat completion with streaming
response = client.chat.completions.create(
    model=model_id,
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:49:36,349 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z 4818c277-1144-44b4-a49d-2bfc8d64a11d -- Received streaming request 4818c277-1144-44b4-a49d-2bfc8d64a11d
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:49:36,359 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z 4818c277-1144-44b4-a49d-2bfc8d64a11d -- Request 4818c277-1144-44b4-a49d-2bfc8d64a11d started. Prompt: <|im_start|>system
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) <|im_start|>user
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) Hello!<|im_end|>
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) <|im_start|>assistant
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) 

(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:49:36 [engine.py:310] Added request 4818c277-1144-44b4-a49d-2bfc8d64a11d.

{"asctime": "2025-05-19 09:49:36,767", "levelname": "INFO", "message": "HTTP Request: POST https://:8000/v1/chat/completions \"HTTP/1.1 200 OK\"", "filename": "_client.py", "lineno": 1025, "job_id": "02000000", "worker_id": "02000000ffffffffffffffffffffffffffffffffffffffffffffffff", "node_id": "7a87cdeb8936fafd92d0d4cab8456af74f2aae665f59cec80664527f", "timestamp_ns": 1747673376767196110}

Hello!(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:49:36 [metrics.py:481] Avg prompt throughput: 4.2 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
 How can I assist you today?

(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:49:37 [engine.py:330] Aborted request 4818c277-1144-44b4-a49d-2bfc8d64a11d.

(ServeReplica:default:LLMRouter pid=3173, ip=10.0.2.213) INFO 2025-05-19 09:49:37,128 default_LLMRouter dqlghpps 4818c277-1144-44b4-a49d-2bfc8d64a11d -- POST /v1/chat/completions 200 800.7ms
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:49:37,124 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z 4818c277-1144-44b4-a49d-2bfc8d64a11d -- Request 4818c277-1144-44b4-a49d-2bfc8d64a11d finished (stop). Total time: 0.7644044499999723s, Queue time: 0.0048291683197021484s, Generation+async time: 0.7595752816802701s, Input tokens: 31, Generated tokens: 10, tokens/s: 53.97753322001628, generated tokens/s: 13.165252004882019.
(ServeReplica:default:LLMDeployment:Qwen--Qwen2_5-32B-Instruct pid=3174, ip=10.0.2.213) INFO 2025-05-19 09:49:37,125 default_LLMDeployment:Qwen--Qwen2_5-32B-Instruct a4imwh9z 4818c277-1144-44b4-a49d-2bfc8d64a11d -- CALL /v1/chat/completions OK 777.8ms

关闭服务#

当您需要停止服务时，只需运行以下命令

!serve shutdown --yes

(ServeController pid=12094) INFO 2025-05-19 09:49:43,540 controller 12094 -- Removing 1 replica from Deployment(name='LLMDeployment:Qwen--Qwen2_5-32B-Instruct', app='default').
(ServeController pid=12094) INFO 2025-05-19 09:49:43,540 controller 12094 -- Removing 2 replicas from Deployment(name='LLMRouter', app='default').

(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:49:45 [ray_distributed_executor.py:127] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) INFO 05-19 09:49:45 [ray_distributed_executor.py:127] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.

(ServeController pid=12094) INFO 2025-05-19 09:49:45,560 controller 12094 -- Replica(id='a4imwh9z', deployment='LLMDeployment:Qwen--Qwen2_5-32B-Instruct', app='default') is stopped.
(ServeController pid=12094) INFO 2025-05-19 09:49:45,561 controller 12094 -- Replica(id='dqlghpps', deployment='LLMRouter', app='default') is stopped.
(ServeController pid=12094) INFO 2025-05-19 09:49:45,561 controller 12094 -- Replica(id='o2gy2ltw', deployment='LLMRouter', app='default') is stopped.
(_EngineBackgroundProcess pid=4800, ip=10.0.2.213) [rank0]:[W519 09:49:46.985931527 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.ac.cn/docs/stable/distributed.html#shutdown (function operator())

2025-05-19 09:49:46,647	SUCC scripts.py:772 -- Sent shutdown request; applications will be deleted asynchronously.

生产部署#

要进行生产级部署，请使用 Anyscale Services。这允许您将 Ray Serve 应用程序部署到具有内置可伸缩性、容错能力和负载均衡功能的专用集群。

我们将部署代码放在 serve_llm.py 中，然后可以使用此命令部署服务

anyscale service deploy serve_llm:llm_app --name=llm-service-qwen2p5-32B

(anyscale +1.6s) Restarting existing service 'llm-service-qwen2p5-32B'.
(anyscale +2.3s) Using workspace runtime dependencies env vars: {'HF_TOKEN': 'HF_TOKEN'}.
(anyscale +2.3s) Uploading local dir '.' to cloud storage.
(anyscale +8.1s) Service 'llm-service-qwen2p5-32B' deployed (version ID: 75vs71q8).
(anyscale +8.1s) View the service in the UI: 'https://console.anyscale.com/services/service2_ybvl7arasth81zdll29mfm1jts'
(anyscale +8.1s) Query the service once it's running using the following curl command (add the path you want to query):
(anyscale +8.1s) curl -H "Authorization: Bearer v-ysnEivLuvxo3ZITC8b7SkI0jZ1taqXk_eBprAr0TY" https://llm-service-qwen2p5-32b-xxx.xxx.x.anyscaleuserdata.com/
(autoscaler +11m55s) [autoscaler] Downscaling node i-018a78b690239bb67 (node IP: 10.0.2.213) due to node idle termination.

构建用于聊天交互的 LLM 客户端#

本节介绍了一个自定义的 LLMClient 类，它封装了 OpenAI API。它同时支持流式响应（逐个 token）和完整消息检索。

注意：在继续之前，请确保服务已运行，因为它可能需要一些时间才能完全可用。

from openai import OpenAI
from typing import Optional, Generator

from typing import Dict, List, Union
import torch
import numpy as np
from sentence_transformers import SentenceTransformer
from pprint import pprint
import chromadb


from openai import OpenAI
from typing import Optional, Generator

class LLMClient:
    def __init__(self, base_url: str, api_key: Optional[str] = None, model_id: str = None):
        # Ensure the base_url ends with a slash and does not include '/routes'
        if not base_url.endswith("/"):
            base_url += "/"
        if "/routes" in base_url:
            raise ValueError("base_url must end with '.com'")

        self.model_id = model_id
        self.client = OpenAI(
            base_url=base_url + "v1",
            api_key=api_key or "NOT A REAL KEY",
        )

    def get_response_streaming(
        self,
        prompt: str,
        temperature: float = 0.01,
    ) -> Generator[str, None, None]:
        """
        Get a response from the model based on the provided prompt.
        Yields the response tokens as they are streamed.
        """
        chat_completions = self.client.chat.completions.create(
            model=self.model_id,
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
            stream=True
        )

        for chat in chat_completions:
            delta = chat.choices[0].delta
            if delta.content:
                yield delta.content

    def get_response(
        self,
        prompt: str,
        temperature: float = 0.01,
    ) -> str:
        """
        Get a complete response from the model based on the provided prompt.
        """
        chat_response = self.client.chat.completions.create(
            model=self.model_id,
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
            stream=False
        )
        return chat_response.choices[0].message.content

查询服务#

部署时，您将服务暴露给一个可公开访问的 IP 地址，您可以向该地址发送请求。在上一个单元格的输出中，复制您的 API_KEY 和 BASE_URL。

在以下代码中替换并填写 BASE_URL 和 API_KEY 的占位符值

示例：流式响应#

# Initialize client
model_id='Qwen/Qwen2.5-32B-Instruct' ## model id need to be same as your deployment 
base_url = "https://llm-service-qwen2p5-32b-xxx.xxx.x.anyscaleuserdata.com" ## replace with your own service base url
api_key = "" ## replace with your own api key


llm_client = LLMClient(
    base_url=base_url,
    api_key=api_key,
    model_id=model_id,
)


# --- Get the response with streaming ---
prompt = "what is ray?"
print("Model response (streaming):")
for token in llm_client.get_response_streaming(prompt, temperature=0.5):
    print(token, end="")

Model response (streaming):
{"asctime": "2025-05-19 10:01:04,287", "levelname": "INFO", "message": "HTTP Request: POST https://llm-service-qwen2p5-32b-jgz99.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com/v1/chat/completions \"HTTP/1.1 200 OK\"", "filename": "_client.py", "lineno": 1025, "job_id": "02000000", "worker_id": "02000000ffffffffffffffffffffffffffffffffffffffffffffffff", "node_id": "7a87cdeb8936fafd92d0d4cab8456af74f2aae665f59cec80664527f", "timestamp_ns": 1747674064287065217}
Ray is a high-performance distributed computing framework that was originally developed by researchers at the RISELab (formerly known as AMPLab) at the University of California, Berkeley. It is designed to make it easier to write and scale parallel and distributed applications in Python. Ray is particularly well-suited for machine learning, reinforcement learning, and other data-intensive computing tasks.

Ray provides several key features:

1. **Task Parallelism**: Ray allows you to define tasks that can be executed in parallel across multiple CPUs or GPUs.

2. **Actor Model**: Ray supports the actor model of concurrency, which means you can create and manage stateful objects (actors) that can be distributed across multiple nodes.

3. **Scalability**: Ray is designed to scale from a single machine to a large cluster, making it easy to distribute computation and data across multiple machines.

4. **Integration**: Ray integrates with many popular machine learning frameworks and libraries, such as TensorFlow, PyTorch, and Scikit-learn, making it a versatile tool for data scientists and machine learning engineers.

5. **Ease of Use**: Ray aims to be easy to use, with a simple API that allows you to write parallel and distributed applications without needing deep knowledge of distributed systems.

Ray is used in various industries for tasks ranging from training large machine learning models to real-time data processing and simulation. It's particularly popular in the reinforcement learning community due to its support for complex, multi-agent scenarios.

示例：非流式（完整响应）#

prompt = "what is ray?"

# --- Get the response without streaming ---
response = llm_client.get_response(prompt, temperature=0.5)
print("\n\nModel response (non-streaming):")
print(response)

{"asctime": "2025-05-19 10:01:31,654", "levelname": "INFO", "message": "HTTP Request: POST https://llm-service-qwen2p5-32b-jgz99.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata.com/v1/chat/completions \"HTTP/1.1 200 OK\"", "filename": "_client.py", "lineno": 1025, "job_id": "02000000", "worker_id": "02000000ffffffffffffffffffffffffffffffffffffffffffffffff", "node_id": "7a87cdeb8936fafd92d0d4cab8456af74f2aae665f59cec80664527f", "timestamp_ns": 1747674091654282480}

Model response (non-streaming):
"Ray" can refer to different things depending on the context. Here are a few possibilities:

1. **Physics**: In physics, a ray is a line or beam of light, heat, or other form of electromagnetic radiation or particles traveling in a straight line. For example, in optics, rays are used to model the path that light takes.

2. **Computer Science**: Ray could refer to "Ray Tracing," a rendering technique used in computer graphics to generate an image by tracing the path of light as pixels in an image plane and simulating the effects of its encounters with virtual objects. It's widely used in video games, movies, and other applications requiring high-quality 3D graphics.

3. **Software**: "Ray" can also refer to an open-source distributed computing framework developed by the RISELab at UC Berkeley. This framework is designed to enable the development of scalable and high-performance applications, particularly in the fields of machine learning, reinforcement learning, and other data-intensive applications.

4. **Brand or Product Name**: "Ray" could also be part of a brand name or product name, such as Ray-Ban sunglasses or other consumer products.

If you're asking about a specific context or application of "Ray," please provide more details so I can give you a more accurate answer.