KubeRay 中的 GCS 容错#
全局控制服务 (GCS) 管理集群级别的元数据。默认情况下,GCS 缺乏容错能力,因为它将所有数据存储在内存中,并且故障可能导致整个 Ray 集群失败。为了使 GCS 具有容错能力,您必须拥有一个高可用的 Redis。这样,在 GCS 重启时,它可以从 Redis 实例检索所有数据并恢复其正常功能。
没有 GCS 容错的命运共享
没有 GCS 容错,Ray 集群、GCS 进程和 Ray head Pod 是命运共享的。如果 GCS 进程死亡,Ray head Pod 也会在 RAY_gcs_rpc_server_reconnect_timeout_s
秒后死亡。如果 Ray head Pod 根据 Pod 的 restartPolicy
重新启动,worker Pod 会尝试重新连接到新的 head Pod。worker Pod 会被新的 head Pod 终止;如果没有启用 GCS 容错,集群状态会丢失,并且新的 head Pod 会将 worker Pod 视为“未知 worker”。这对于大多数 Ray 应用来说是足够的;然而,对于 Ray Serve 来说并不理想,特别是如果高可用性对于您的用例至关重要。因此,我们建议在 RayService 自定义资源上启用 GCS 容错以确保高可用性。有关更多信息,请参阅Ray Serve 端到端容错文档。
另请参阅
如果您也需要 Redis 的容错能力,请参阅为持久容错 GCS 调优 Redis。
用例#
Ray Serve:推荐的配置是在 RayService 自定义资源上启用 GCS 容错以确保高可用性。有关更多信息,请参阅Ray Serve 端到端容错文档。
其他工作负载:不推荐使用 GCS 容错,并且不保证兼容性。
先决条件#
Ray 2.0.0+
KubeRay 1.3.0+
Redis:单分片 Redis 集群或 Redis Sentinel,一个或多个副本
快速入门#
步骤 1:使用 Kind 创建 Kubernetes 集群#
kind create cluster --image=kindest/node:v1.26.0
步骤 2:安装 KubeRay operator#
按照此文档通过 Helm repository 安装最新的稳定版 KubeRay operator。
步骤 3:安装启用 GCS FT 的 RayCluster#
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.external-redis.yaml
kubectl apply -f ray-cluster.external-redis.yaml
步骤 4:验证 Kubernetes 集群状态#
# Step 4.1: List all Pods in the `default` namespace.
# The expected output should be 4 Pods: 1 head, 1 worker, 1 KubeRay, and 1 Redis.
kubectl get pods
NAME READY STATUS RESTARTS AGE
kuberay-operator-6bc45dd644-ktbnh 1/1 Running 0 3m4s
raycluster-external-redis-head-xrjff 1/1 Running 0 2m41s
raycluster-external-redis-small-group-worker-dwt98 1/1 Running 0 2m41s
redis-6cf756c755-qljcv 1/1 Running 0 2m41s
# Step 4.2: List all ConfigMaps in the `default` namespace.
kubectl get configmaps
NAME DATA AGE
kube-root-ca.crt 1 3m4s
ray-example 2 2m41s
redis-config 1 2m41s
ray-cluster.external-redis.yaml 文件定义了 RayCluster、Redis 和 ConfigMaps 的 Kubernetes 资源。此示例中有两个 ConfigMap:ray-example
和 redis-config
。ray-example
ConfigMap 包含两个 Python 脚本:detached_actor.py
和 increment_counter.py
。
detached_actor.py
是一个创建名为counter_actor
的分离 actor 的 Python 脚本。import ray @ray.remote(num_cpus=1) class Counter: def __init__(self): self.value = 0 def increment(self): self.value += 1 return self.value ray.init(namespace="default_namespace") Counter.options(name="counter_actor", lifetime="detached").remote()
increment_counter.py
是一个递增计数器的 Python 脚本。import ray ray.init(namespace="default_namespace") counter = ray.get_actor("counter_actor") print(ray.get(counter.increment.remote()))
步骤 5:创建分离 actor#
# Step 5.1: Create a detached actor with the name, `counter_actor`.
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/detached_actor.py
# Step 5.2: Increment the counter.
kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/increment_counter.py
2025-04-18 02:51:25,359 INFO worker.py:1514 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2025-04-18 02:51:25,361 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.244.0.8:6379...
2025-04-18 02:51:25,557 INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at 10.244.0.8:8265
2025-04-18 02:51:29,069 INFO worker.py:1514 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2025-04-18 02:51:29,072 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.244.0.8:6379...
2025-04-18 02:51:29,198 INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at 10.244.0.8:8265
1
步骤 6:检查 Redis 中的数据#
# Step 6.1: Check the RayCluster's UID.
kubectl get rayclusters.ray.io raycluster-external-redis -o=jsonpath='{.metadata.uid}'
# [Example output]: 864b004c-6305-42e3-ac46-adfa8eb6f752
# Step 6.2: Check the head Pod's environment variable `RAY_external_storage_namespace`.
kubectl get pods $HEAD_POD -o=jsonpath='{.spec.containers[0].env}' | jq
[
{
"name": "RAY_external_storage_namespace",
"value": "66a4e2af-7c89-43db-a79c-71d1d0d9d71d"
},
{
"name": "RAY_REDIS_ADDRESS",
"value": "redis:6379"
},
{
"name": "REDIS_PASSWORD",
"valueFrom": {
"secretKeyRef": {
"key": "password",
"name": "redis-password-secret"
}
}
},
{
"name": "RAY_CLUSTER_NAME",
"valueFrom": {
"fieldRef": {
"apiVersion": "v1",
"fieldPath": "metadata.labels['ray.io/cluster']"
}
}
},
{
"name": "RAY_CLOUD_INSTANCE_ID",
"valueFrom": {
"fieldRef": {
"apiVersion": "v1",
"fieldPath": "metadata.name"
}
}
},
{
"name": "RAY_NODE_TYPE_NAME",
"valueFrom": {
"fieldRef": {
"apiVersion": "v1",
"fieldPath": "metadata.labels['ray.io/group']"
}
}
},
{
"name": "KUBERAY_GEN_RAY_START_CMD",
"value": "ray start --head --block --dashboard-agent-listen-port=52365 --dashboard-host=0.0.0.0 --metrics-export-port=8080 --num-cpus=0 --redis-password=$REDIS_PASSWORD "
},
{
"name": "RAY_PORT",
"value": "6379"
},
{
"name": "RAY_ADDRESS",
"value": "127.0.0.1:6379"
},
{
"name": "RAY_USAGE_STATS_KUBERAY_IN_USE",
"value": "1"
},
{
"name": "RAY_USAGE_STATS_EXTRA_TAGS",
"value": "kuberay_version=v1.3.0;kuberay_crd=RayCluster"
},
{
"name": "RAY_DASHBOARD_ENABLE_K8S_DISK_USAGE",
"value": "1"
}
]
# Step 6.3: Log into the Redis Pod.
# The password `5241590000000000` is defined in the `redis-config` ConfigMap.
# Step 6.4: Check the keys in Redis.
# Note: the schema changed in Ray 2.38.0. Previously we use a single HASH table,
# now we use multiple HASH tables with a common prefix.
export REDIS_POD=$(kubectl get pods --selector=app=redis -o custom-columns=POD:metadata.name --no-headers)
kubectl exec -i $REDIS_POD -- env REDISCLI_AUTH="5241590000000000" redis-cli KEYS '*'
RAY66a4e2af-7c89-43db-a79c-71d1d0d9d71d@ACTOR
RAY66a4e2af-7c89-43db-a79c-71d1d0d9d71d@JOB
RAY66a4e2af-7c89-43db-a79c-71d1d0d9d71d@ACTOR_TASK_SPEC
RAY66a4e2af-7c89-43db-a79c-71d1d0d9d71d@WORKERS
RAY66a4e2af-7c89-43db-a79c-71d1d0d9d71d@JobCounter
RAY66a4e2af-7c89-43db-a79c-71d1d0d9d71d@KV
RAY66a4e2af-7c89-43db-a79c-71d1d0d9d71d@NODE
# Step 6.5: Check the value of the key.
kubectl exec -i $REDIS_POD -- env REDISCLI_AUTH="5241590000000000" redis-cli HGETALL RAY864b004c-6305-42e3-ac46-adfa8eb6f752@NODE
# Before Ray 2.38.0:
# HGETALL 864b004c-6305-42e3-ac46-adfa8eb6f752
在ray-cluster.external-redis.yaml 中,RayCluster 没有设置 gcsFaultToleranceOptions.externalStorageNamespace
选项。因此,默认情况下,KubeRay 会自动将环境变量 RAY_external_storage_namespace
注入到 RayCluster 管理的所有 Ray Pod 中,并使用 RayCluster 的 UID 作为外部存储命名空间。有关此选项的更多信息,请参阅此部分。
步骤 7:杀死 head Pod 中的 GCS 进程#
# Step 7.1: Check the `RAY_gcs_rpc_server_reconnect_timeout_s` environment variable in both the head Pod and worker Pod.
kubectl get pods $HEAD_POD -o=jsonpath='{.spec.containers[0].env}' | jq
[
{
"name": "RAY_external_storage_namespace",
"value": "66a4e2af-7c89-43db-a79c-71d1d0d9d71d"
},
{
"name": "RAY_REDIS_ADDRESS",
"value": "redis:6379"
},
{
"name": "REDIS_PASSWORD",
"valueFrom": {
"secretKeyRef": {
"key": "password",
"name": "redis-password-secret"
}
}
},
{
"name": "RAY_CLUSTER_NAME",
"valueFrom": {
"fieldRef": {
"apiVersion": "v1",
"fieldPath": "metadata.labels['ray.io/cluster']"
}
}
},
{
"name": "RAY_CLOUD_INSTANCE_ID",
"valueFrom": {
"fieldRef": {
"apiVersion": "v1",
"fieldPath": "metadata.name"
}
}
},
{
"name": "RAY_NODE_TYPE_NAME",
"valueFrom": {
"fieldRef": {
"apiVersion": "v1",
"fieldPath": "metadata.labels['ray.io/group']"
}
}
},
{
"name": "KUBERAY_GEN_RAY_START_CMD",
"value": "ray start --head --block --dashboard-agent-listen-port=52365 --dashboard-host=0.0.0.0 --metrics-export-port=8080 --num-cpus=0 --redis-password=$REDIS_PASSWORD "
},
{
"name": "RAY_PORT",
"value": "6379"
},
{
"name": "RAY_ADDRESS",
"value": "127.0.0.1:6379"
},
{
"name": "RAY_USAGE_STATS_KUBERAY_IN_USE",
"value": "1"
},
{
"name": "RAY_USAGE_STATS_EXTRA_TAGS",
"value": "kuberay_version=v1.3.0;kuberay_crd=RayCluster"
},
{
"name": "RAY_DASHBOARD_ENABLE_K8S_DISK_USAGE",
"value": "1"
}
]
kubectl get pods $YOUR_WORKER_POD -o=jsonpath='{.spec.containers[0].env}' | jq
[
{
"name": "RAY_gcs_rpc_server_reconnect_timeout_s",
"value": "600"
},
{
"name": "FQ_RAY_IP",
"value": "raycluster-external-redis-head-svc.default.svc.cluster.local"
},
{
"name": "RAY_IP",
"value": "raycluster-external-redis-head-svc"
},
{
"name": "RAY_CLUSTER_NAME",
"valueFrom": {
"fieldRef": {
"apiVersion": "v1",
"fieldPath": "metadata.labels['ray.io/cluster']"
}
}
},
{
"name": "RAY_CLOUD_INSTANCE_ID",
"valueFrom": {
"fieldRef": {
"apiVersion": "v1",
"fieldPath": "metadata.name"
}
}
},
{
"name": "RAY_NODE_TYPE_NAME",
"valueFrom": {
"fieldRef": {
"apiVersion": "v1",
"fieldPath": "metadata.labels['ray.io/group']"
}
}
},
{
"name": "KUBERAY_GEN_RAY_START_CMD",
"value": "ray start --address=raycluster-external-redis-head-svc.default.svc.cluster.local:6379 --block --dashboard-agent-listen-port=52365 --metrics-export-port=8080 --num-cpus=1 "
},
{
"name": "RAY_PORT",
"value": "6379"
},
{
"name": "RAY_ADDRESS",
"value": "raycluster-external-redis-head-svc.default.svc.cluster.local:6379"
},
{
"name": "RAY_USAGE_STATS_KUBERAY_IN_USE",
"value": "1"
},
{
"name": "RAY_DASHBOARD_ENABLE_K8S_DISK_USAGE",
"value": "1"
}
]
# Step 7.2: Kill the GCS process in the head Pod.
kubectl exec -i $HEAD_POD -- pkill gcs_server
# Step 7.3: The head Pod fails and restarts after `RAY_gcs_rpc_server_reconnect_timeout_s` (60) seconds.
# In addition, the worker Pod isn't terminated by the new head after reconnecting because GCS fault
# tolerance is enabled.
kubectl get pods -l=ray.io/is-ray-node=yes
NAME READY STATUS RESTARTS AGE
raycluster-external-redis-head-xrjff 1/1 Running 1 (48s ago) 4m41s
raycluster-external-redis-small-group-worker-dwt98 1/1 Running 0 4m41s
在ray-cluster.external-redis.yaml 中,RayCluster 的 head Pod 或 worker Pod 都没有设置 RAY_gcs_rpc_server_reconnect_timeout_s
环境变量。因此,KubeRay 会自动将 RAY_gcs_rpc_server_reconnect_timeout_s
环境变量注入到 worker Pod 中,其值为 600,而 head Pod 使用默认值 60。worker Pod 的超时值必须长于 head Pod 的超时值,以便 worker Pod 不会在 head Pod 从故障中重启之前终止。
步骤 8:再次访问分离 actor#
kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/increment_counter.py
2025-04-18 02:53:25,356 INFO worker.py:1514 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2025-04-18 02:53:25,359 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.244.0.8:6379...
2025-04-18 02:53:25,488 INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at 10.244.0.8:8265
2
在此示例中,分离 actor 始终位于 worker Pod 上。
head Pod 的 rayStartParams
设置为 num-cpus: "0"
。因此,不会在 head Pod 上调度任何任务或 actor。
启用 GCS 容错后,即使 GCS 进程死亡并重启,您仍然可以访问分离 actor。请注意,容错不会持久化 actor 的状态。结果为 2 而不是 1 的原因是分离 actor 位于始终运行的 worker Pod 上。另一方面,如果 head Pod 托管分离 actor,则 increment_counter.py
脚本在此步骤中的结果将是 1。
步骤 9:删除 RayCluster 时移除存储在 Redis 中的键#
# Step 9.1: Delete the RayCluster custom resource.
kubectl delete raycluster raycluster-external-redis
# Step 9.2: KubeRay operator deletes all Pods in the RayCluster.
# Step 9.3: KubeRay operator creates a Kubernetes Job to delete the Redis key after the head Pod is terminated.
# Step 9.4: Check whether the RayCluster has been deleted.
kubectl get raycluster
No resources found in default namespace.
# Step 9.5: Check Redis keys after the Kubernetes Job finishes.
export REDIS_POD=$(kubectl get pods --selector=app=redis -o custom-columns=POD:metadata.name --no-headers)
kubectl exec -i $REDIS_POD -- env REDISCLI_AUTH="5241590000000000" redis-cli KEYS "*"
在 KubeRay v1.0.0 中,KubeRay operator 会为启用 GCS 容错的 RayCluster 添加一个 Kubernetes finalizer,以确保清理 Redis。KubeRay 仅在 Kubernetes Job 成功清理 Redis 后才会移除此 finalizer。
换句话说,如果 Kubernetes Job 失败,RayCluster 将不会被删除。在这种情况下,您应该手动移除 finalizer 并清理 Redis。
kubectl patch rayclusters.ray.io raycluster-external-redis --type json --patch='[ { "op": "remove", "path": "/metadata/finalizers" } ]'
从 KubeRay v1.1.0 开始,KubeRay 将 Redis 清理行为从强制性更改为尽力而为。即使 Kubernetes Job 失败,KubeRay 仍然会从 RayCluster 中移除 Kubernetes finalizer,从而解除 RayCluster 的删除阻塞。
用户可以通过设置 feature gate 值 ENABLE_GCS_FT_REDIS_CLEANUP
来关闭此功能。有关更多详细信息,请参阅KubeRay GCS 容错配置部分。
步骤 10:删除 Kubernetes 集群#
kind delete cluster
KubeRay GCS 容错配置#
快速入门示例中使用的ray-cluster.external-redis.yaml 包含有关配置选项的详细注释。请结合 YAML 文件阅读此部分。
这些配置需要 KubeRay 1.3.0+
以下部分使用了 KubeRay 1.3.0 中引入的新字段 gcsFaultToleranceOptions
。对于旧的 GCS 容错配置,包括 ray.io/ft-enabled
注解,请参阅旧文档。
1. 启用 GCS 容错#
gcsFaultToleranceOptions
:将gcsFaultToleranceOptions
字段添加到 RayCluster 自定义资源以启用 GCS 容错。kind: RayCluster metadata: spec: gcsFaultToleranceOptions: # <- Add this field to enable GCS fault tolerance.
2. 连接到外部 Redis#
redisAddress
:将redisAddress
添加到gcsFaultToleranceOptions
字段。使用此选项指定 Redis 服务的地址,从而允许 Ray head 连接到它。在ray-cluster.external-redis.yaml 中,RayCluster 自定义资源使用redis
Kubernetes ClusterIP 服务名称作为连接到 Redis 服务器的连接点。ClusterIP 服务也由 YAML 文件创建。kind: RayCluster metadata: spec: gcsFaultToleranceOptions: redisAddress: "redis:6379" # <- Add redis address here.
redisPassword
:将redisPassword
添加到gcsFaultToleranceOptions
字段。使用此选项指定 Redis 服务的密码,从而允许 Ray head 连接到它。在ray-cluster.external-redis.yaml 中,RayCluster 自定义资源从 Kubernetes Secret 加载密码。kind: RayCluster metadata: spec: gcsFaultToleranceOptions: redisAddress: "redis:6379" redisPassword: # <- Add redis password from a Kubernetes secret. valueFrom: secretKeyRef: name: redis-password-secret key: password
3. 使用外部存储命名空间#
externalStorageNamespace
(可选):将externalStorageNamespace
添加到gcsFaultToleranceOptions
字段。KubeRay 使用此选项的值来设置 RayCluster 管理的所有 Ray Pod 的环境变量RAY_external_storage_namespace
。在大多数情况下,您无需设置externalStorageNamespace
,因为 KubeRay 会自动将其设置为 RayCluster 的 UID。仅当您完全理解 GCS 容错和 RayService 的行为以避免此问题时,才修改此选项。有关更多详细信息,请参阅前面快速入门示例中的此部分。kind: RayCluster metadata: spec: gcsFaultToleranceOptions: externalStorageNamespace: "my-raycluster-storage" # <- Add this option to specify a storage namespace
4. 关闭 Redis 清理#
ENABLE_GCS_FT_REDIS_CLEANUP
:默认为 True。您可以在KubeRay operator 的 Helm chart 中设置环境变量来关闭此功能。
后续步骤#
有关更多信息,请参阅Ray Serve 端到端容错文档。
有关更多信息,请参阅Ray Core GCS 容错文档。