KubeRay 中的 GCS 容错#

全局控制服务 (GCS) 管理集群级别的元数据。默认情况下,GCS 缺乏容错能力,因为它将所有数据存储在内存中,并且故障可能导致整个 Ray 集群失败。为了使 GCS 具有容错能力,您必须拥有一个高可用的 Redis。这样,在 GCS 重启时,它可以从 Redis 实例检索所有数据并恢复其正常功能。

没有 GCS 容错的命运共享

没有 GCS 容错,Ray 集群、GCS 进程和 Ray head Pod 是命运共享的。如果 GCS 进程死亡,Ray head Pod 也会在 RAY_gcs_rpc_server_reconnect_timeout_s 秒后死亡。如果 Ray head Pod 根据 Pod 的 restartPolicy 重新启动,worker Pod 会尝试重新连接到新的 head Pod。worker Pod 会被新的 head Pod 终止;如果没有启用 GCS 容错,集群状态会丢失,并且新的 head Pod 会将 worker Pod 视为“未知 worker”。这对于大多数 Ray 应用来说是足够的;然而,对于 Ray Serve 来说并不理想,特别是如果高可用性对于您的用例至关重要。因此,我们建议在 RayService 自定义资源上启用 GCS 容错以确保高可用性。有关更多信息,请参阅Ray Serve 端到端容错文档

另请参阅

如果您也需要 Redis 的容错能力,请参阅为持久容错 GCS 调优 Redis

用例#

  • Ray Serve:推荐的配置是在 RayService 自定义资源上启用 GCS 容错以确保高可用性。有关更多信息,请参阅Ray Serve 端到端容错文档

  • 其他工作负载:不推荐使用 GCS 容错,并且不保证兼容性。

先决条件#

  • Ray 2.0.0+

  • KubeRay 1.3.0+

  • Redis:单分片 Redis 集群或 Redis Sentinel,一个或多个副本

快速入门#

步骤 1:使用 Kind 创建 Kubernetes 集群#

kind create cluster --image=kindest/node:v1.26.0

步骤 2:安装 KubeRay operator#

按照此文档通过 Helm repository 安装最新的稳定版 KubeRay operator。

步骤 3:安装启用 GCS FT 的 RayCluster#

curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.external-redis.yaml
kubectl apply -f ray-cluster.external-redis.yaml

步骤 4:验证 Kubernetes 集群状态#

# Step 4.1: List all Pods in the `default` namespace.
# The expected output should be 4 Pods: 1 head, 1 worker, 1 KubeRay, and 1 Redis.
kubectl get pods
NAME                                                 READY   STATUS    RESTARTS   AGE
kuberay-operator-6bc45dd644-ktbnh                    1/1     Running   0          3m4s
raycluster-external-redis-head-xrjff                 1/1     Running   0          2m41s
raycluster-external-redis-small-group-worker-dwt98   1/1     Running   0          2m41s
redis-6cf756c755-qljcv                               1/1     Running   0          2m41s
# Step 4.2: List all ConfigMaps in the `default` namespace.
kubectl get configmaps
NAME               DATA   AGE
kube-root-ca.crt   1      3m4s
ray-example        2      2m41s
redis-config       1      2m41s

ray-cluster.external-redis.yaml 文件定义了 RayCluster、Redis 和 ConfigMaps 的 Kubernetes 资源。此示例中有两个 ConfigMap:ray-exampleredis-configray-example ConfigMap 包含两个 Python 脚本:detached_actor.pyincrement_counter.py

  • detached_actor.py 是一个创建名为 counter_actor 的分离 actor 的 Python 脚本。

    import ray
    
    @ray.remote(num_cpus=1)
    class Counter:
        def __init__(self):
            self.value = 0
    
        def increment(self):
            self.value += 1
            return self.value
    
    ray.init(namespace="default_namespace")
    Counter.options(name="counter_actor", lifetime="detached").remote()
    
  • increment_counter.py 是一个递增计数器的 Python 脚本。

    import ray
    
    ray.init(namespace="default_namespace")
    counter = ray.get_actor("counter_actor")
    print(ray.get(counter.increment.remote()))
    

步骤 5:创建分离 actor#

# Step 5.1: Create a detached actor with the name, `counter_actor`.
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/detached_actor.py
# Step 5.2: Increment the counter.
kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/increment_counter.py
2025-04-18 02:51:25,359	INFO worker.py:1514 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2025-04-18 02:51:25,361	INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.244.0.8:6379...
2025-04-18 02:51:25,557	INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at 10.244.0.8:8265 
2025-04-18 02:51:29,069	INFO worker.py:1514 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2025-04-18 02:51:29,072	INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.244.0.8:6379...
2025-04-18 02:51:29,198	INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at 10.244.0.8:8265 
1

步骤 6:检查 Redis 中的数据#

# Step 6.1: Check the RayCluster's UID.
kubectl get rayclusters.ray.io raycluster-external-redis -o=jsonpath='{.metadata.uid}'
# [Example output]: 864b004c-6305-42e3-ac46-adfa8eb6f752
# Step 6.2: Check the head Pod's environment variable `RAY_external_storage_namespace`.
kubectl get pods $HEAD_POD -o=jsonpath='{.spec.containers[0].env}' | jq
[
  {
    "name": "RAY_external_storage_namespace",
    "value": "66a4e2af-7c89-43db-a79c-71d1d0d9d71d"
  },
  {
    "name": "RAY_REDIS_ADDRESS",
    "value": "redis:6379"
  },
  {
    "name": "REDIS_PASSWORD",
    "valueFrom": {
      "secretKeyRef": {
        "key": "password",
        "name": "redis-password-secret"
      }
    }
  },
  {
    "name": "RAY_CLUSTER_NAME",
    "valueFrom": {
      "fieldRef": {
        "apiVersion": "v1",
        "fieldPath": "metadata.labels['ray.io/cluster']"
      }
    }
  },
  {
    "name": "RAY_CLOUD_INSTANCE_ID",
    "valueFrom": {
      "fieldRef": {
        "apiVersion": "v1",
        "fieldPath": "metadata.name"
      }
    }
  },
  {
    "name": "RAY_NODE_TYPE_NAME",
    "valueFrom": {
      "fieldRef": {
        "apiVersion": "v1",
        "fieldPath": "metadata.labels['ray.io/group']"
      }
    }
  },
  {
    "name": "KUBERAY_GEN_RAY_START_CMD",
    "value": "ray start --head  --block  --dashboard-agent-listen-port=52365  --dashboard-host=0.0.0.0  --metrics-export-port=8080  --num-cpus=0  --redis-password=$REDIS_PASSWORD "
  },
  {
    "name": "RAY_PORT",
    "value": "6379"
  },
  {
    "name": "RAY_ADDRESS",
    "value": "127.0.0.1:6379"
  },
  {
    "name": "RAY_USAGE_STATS_KUBERAY_IN_USE",
    "value": "1"
  },
  {
    "name": "RAY_USAGE_STATS_EXTRA_TAGS",
    "value": "kuberay_version=v1.3.0;kuberay_crd=RayCluster"
  },
  {
    "name": "RAY_DASHBOARD_ENABLE_K8S_DISK_USAGE",
    "value": "1"
  }
]
# Step 6.3: Log into the Redis Pod.
# The password `5241590000000000` is defined in the `redis-config` ConfigMap.
# Step 6.4: Check the keys in Redis.
# Note: the schema changed in Ray 2.38.0. Previously we use a single HASH table,
# now we use multiple HASH tables with a common prefix.
export REDIS_POD=$(kubectl get pods --selector=app=redis -o custom-columns=POD:metadata.name --no-headers)
kubectl exec -i $REDIS_POD -- env REDISCLI_AUTH="5241590000000000" redis-cli KEYS '*'
RAY66a4e2af-7c89-43db-a79c-71d1d0d9d71d@ACTOR
RAY66a4e2af-7c89-43db-a79c-71d1d0d9d71d@JOB
RAY66a4e2af-7c89-43db-a79c-71d1d0d9d71d@ACTOR_TASK_SPEC
RAY66a4e2af-7c89-43db-a79c-71d1d0d9d71d@WORKERS
RAY66a4e2af-7c89-43db-a79c-71d1d0d9d71d@JobCounter
RAY66a4e2af-7c89-43db-a79c-71d1d0d9d71d@KV
RAY66a4e2af-7c89-43db-a79c-71d1d0d9d71d@NODE
# Step 6.5: Check the value of the key.
kubectl exec -i $REDIS_POD -- env REDISCLI_AUTH="5241590000000000" redis-cli HGETALL RAY864b004c-6305-42e3-ac46-adfa8eb6f752@NODE
# Before Ray 2.38.0:
# HGETALL 864b004c-6305-42e3-ac46-adfa8eb6f752

ray-cluster.external-redis.yaml 中,RayCluster 没有设置 gcsFaultToleranceOptions.externalStorageNamespace 选项。因此,默认情况下,KubeRay 会自动将环境变量 RAY_external_storage_namespace 注入到 RayCluster 管理的所有 Ray Pod 中,并使用 RayCluster 的 UID 作为外部存储命名空间。有关此选项的更多信息,请参阅此部分

步骤 7:杀死 head Pod 中的 GCS 进程#

# Step 7.1: Check the `RAY_gcs_rpc_server_reconnect_timeout_s` environment variable in both the head Pod and worker Pod.
kubectl get pods $HEAD_POD -o=jsonpath='{.spec.containers[0].env}' | jq
[
  {
    "name": "RAY_external_storage_namespace",
    "value": "66a4e2af-7c89-43db-a79c-71d1d0d9d71d"
  },
  {
    "name": "RAY_REDIS_ADDRESS",
    "value": "redis:6379"
  },
  {
    "name": "REDIS_PASSWORD",
    "valueFrom": {
      "secretKeyRef": {
        "key": "password",
        "name": "redis-password-secret"
      }
    }
  },
  {
    "name": "RAY_CLUSTER_NAME",
    "valueFrom": {
      "fieldRef": {
        "apiVersion": "v1",
        "fieldPath": "metadata.labels['ray.io/cluster']"
      }
    }
  },
  {
    "name": "RAY_CLOUD_INSTANCE_ID",
    "valueFrom": {
      "fieldRef": {
        "apiVersion": "v1",
        "fieldPath": "metadata.name"
      }
    }
  },
  {
    "name": "RAY_NODE_TYPE_NAME",
    "valueFrom": {
      "fieldRef": {
        "apiVersion": "v1",
        "fieldPath": "metadata.labels['ray.io/group']"
      }
    }
  },
  {
    "name": "KUBERAY_GEN_RAY_START_CMD",
    "value": "ray start --head  --block  --dashboard-agent-listen-port=52365  --dashboard-host=0.0.0.0  --metrics-export-port=8080  --num-cpus=0  --redis-password=$REDIS_PASSWORD "
  },
  {
    "name": "RAY_PORT",
    "value": "6379"
  },
  {
    "name": "RAY_ADDRESS",
    "value": "127.0.0.1:6379"
  },
  {
    "name": "RAY_USAGE_STATS_KUBERAY_IN_USE",
    "value": "1"
  },
  {
    "name": "RAY_USAGE_STATS_EXTRA_TAGS",
    "value": "kuberay_version=v1.3.0;kuberay_crd=RayCluster"
  },
  {
    "name": "RAY_DASHBOARD_ENABLE_K8S_DISK_USAGE",
    "value": "1"
  }
]
kubectl get pods $YOUR_WORKER_POD -o=jsonpath='{.spec.containers[0].env}' | jq
[
  {
    "name": "RAY_gcs_rpc_server_reconnect_timeout_s",
    "value": "600"
  },
  {
    "name": "FQ_RAY_IP",
    "value": "raycluster-external-redis-head-svc.default.svc.cluster.local"
  },
  {
    "name": "RAY_IP",
    "value": "raycluster-external-redis-head-svc"
  },
  {
    "name": "RAY_CLUSTER_NAME",
    "valueFrom": {
      "fieldRef": {
        "apiVersion": "v1",
        "fieldPath": "metadata.labels['ray.io/cluster']"
      }
    }
  },
  {
    "name": "RAY_CLOUD_INSTANCE_ID",
    "valueFrom": {
      "fieldRef": {
        "apiVersion": "v1",
        "fieldPath": "metadata.name"
      }
    }
  },
  {
    "name": "RAY_NODE_TYPE_NAME",
    "valueFrom": {
      "fieldRef": {
        "apiVersion": "v1",
        "fieldPath": "metadata.labels['ray.io/group']"
      }
    }
  },
  {
    "name": "KUBERAY_GEN_RAY_START_CMD",
    "value": "ray start  --address=raycluster-external-redis-head-svc.default.svc.cluster.local:6379  --block  --dashboard-agent-listen-port=52365  --metrics-export-port=8080  --num-cpus=1 "
  },
  {
    "name": "RAY_PORT",
    "value": "6379"
  },
  {
    "name": "RAY_ADDRESS",
    "value": "raycluster-external-redis-head-svc.default.svc.cluster.local:6379"
  },
  {
    "name": "RAY_USAGE_STATS_KUBERAY_IN_USE",
    "value": "1"
  },
  {
    "name": "RAY_DASHBOARD_ENABLE_K8S_DISK_USAGE",
    "value": "1"
  }
]
# Step 7.2: Kill the GCS process in the head Pod.
kubectl exec -i $HEAD_POD -- pkill gcs_server
# Step 7.3: The head Pod fails and restarts after `RAY_gcs_rpc_server_reconnect_timeout_s` (60) seconds.
# In addition, the worker Pod isn't terminated by the new head after reconnecting because GCS fault
# tolerance is enabled.
kubectl get pods -l=ray.io/is-ray-node=yes
NAME                                                 READY   STATUS    RESTARTS      AGE
raycluster-external-redis-head-xrjff                 1/1     Running   1 (48s ago)   4m41s
raycluster-external-redis-small-group-worker-dwt98   1/1     Running   0             4m41s

ray-cluster.external-redis.yaml 中,RayCluster 的 head Pod 或 worker Pod 都没有设置 RAY_gcs_rpc_server_reconnect_timeout_s 环境变量。因此,KubeRay 会自动将 RAY_gcs_rpc_server_reconnect_timeout_s 环境变量注入到 worker Pod 中,其值为 600,而 head Pod 使用默认值 60。worker Pod 的超时值必须长于 head Pod 的超时值,以便 worker Pod 不会在 head Pod 从故障中重启之前终止。

步骤 8:再次访问分离 actor#

kubectl exec -it $HEAD_POD -- python3 /home/ray/samples/increment_counter.py
2025-04-18 02:53:25,356	INFO worker.py:1514 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2025-04-18 02:53:25,359	INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.244.0.8:6379...
2025-04-18 02:53:25,488	INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at 10.244.0.8:8265 
2

在此示例中,分离 actor 始终位于 worker Pod 上。

head Pod 的 rayStartParams 设置为 num-cpus: "0"。因此,不会在 head Pod 上调度任何任务或 actor。

启用 GCS 容错后,即使 GCS 进程死亡并重启,您仍然可以访问分离 actor。请注意,容错不会持久化 actor 的状态。结果为 2 而不是 1 的原因是分离 actor 位于始终运行的 worker Pod 上。另一方面,如果 head Pod 托管分离 actor,则 increment_counter.py 脚本在此步骤中的结果将是 1。

步骤 9:删除 RayCluster 时移除存储在 Redis 中的键#

# Step 9.1: Delete the RayCluster custom resource.
kubectl delete raycluster raycluster-external-redis
# Step 9.2: KubeRay operator deletes all Pods in the RayCluster.
# Step 9.3: KubeRay operator creates a Kubernetes Job to delete the Redis key after the head Pod is terminated.
# Step 9.4: Check whether the RayCluster has been deleted.
kubectl get raycluster
No resources found in default namespace.
# Step 9.5: Check Redis keys after the Kubernetes Job finishes.
export REDIS_POD=$(kubectl get pods --selector=app=redis -o custom-columns=POD:metadata.name --no-headers)
kubectl exec -i $REDIS_POD -- env REDISCLI_AUTH="5241590000000000" redis-cli KEYS "*"

在 KubeRay v1.0.0 中,KubeRay operator 会为启用 GCS 容错的 RayCluster 添加一个 Kubernetes finalizer,以确保清理 Redis。KubeRay 仅在 Kubernetes Job 成功清理 Redis 后才会移除此 finalizer。

  • 换句话说,如果 Kubernetes Job 失败,RayCluster 将不会被删除。在这种情况下,您应该手动移除 finalizer 并清理 Redis。

    kubectl patch rayclusters.ray.io raycluster-external-redis --type json --patch='[ { "op": "remove", "path": "/metadata/finalizers" } ]'
    

从 KubeRay v1.1.0 开始,KubeRay 将 Redis 清理行为从强制性更改为尽力而为。即使 Kubernetes Job 失败,KubeRay 仍然会从 RayCluster 中移除 Kubernetes finalizer,从而解除 RayCluster 的删除阻塞。

用户可以通过设置 feature gate 值 ENABLE_GCS_FT_REDIS_CLEANUP 来关闭此功能。有关更多详细信息,请参阅KubeRay GCS 容错配置部分。

步骤 10:删除 Kubernetes 集群#

kind delete cluster

KubeRay GCS 容错配置#

快速入门示例中使用的ray-cluster.external-redis.yaml 包含有关配置选项的详细注释。请结合 YAML 文件阅读此部分。

这些配置需要 KubeRay 1.3.0+

以下部分使用了 KubeRay 1.3.0 中引入的新字段 gcsFaultToleranceOptions。对于旧的 GCS 容错配置,包括 ray.io/ft-enabled 注解,请参阅旧文档

1. 启用 GCS 容错#

  • gcsFaultToleranceOptions:将 gcsFaultToleranceOptions 字段添加到 RayCluster 自定义资源以启用 GCS 容错。

      kind: RayCluster
      metadata:
      spec:
        gcsFaultToleranceOptions: # <- Add this field to enable GCS fault tolerance.
    

2. 连接到外部 Redis#

  • redisAddress:将 redisAddress 添加到 gcsFaultToleranceOptions 字段。使用此选项指定 Redis 服务的地址,从而允许 Ray head 连接到它。在ray-cluster.external-redis.yaml 中,RayCluster 自定义资源使用 redis Kubernetes ClusterIP 服务名称作为连接到 Redis 服务器的连接点。ClusterIP 服务也由 YAML 文件创建。

    kind: RayCluster
    metadata:
    spec:
      gcsFaultToleranceOptions:
        redisAddress: "redis:6379" # <- Add redis address here.
    
  • redisPassword:将 redisPassword 添加到 gcsFaultToleranceOptions 字段。使用此选项指定 Redis 服务的密码,从而允许 Ray head 连接到它。在ray-cluster.external-redis.yaml 中,RayCluster 自定义资源从 Kubernetes Secret 加载密码。

    kind: RayCluster
    metadata:
    spec:
      gcsFaultToleranceOptions:
        redisAddress: "redis:6379"
        redisPassword: # <- Add redis password from a Kubernetes secret.
          valueFrom:
            secretKeyRef:
              name: redis-password-secret
              key: password
    

3. 使用外部存储命名空间#

  • externalStorageNamespace (可选):将 externalStorageNamespace 添加到 gcsFaultToleranceOptions 字段。KubeRay 使用此选项的值来设置 RayCluster 管理的所有 Ray Pod 的环境变量 RAY_external_storage_namespace。在大多数情况下,您无需设置 externalStorageNamespace,因为 KubeRay 会自动将其设置为 RayCluster 的 UID。仅当您完全理解 GCS 容错和 RayService 的行为以避免此问题时,才修改此选项。有关更多详细信息,请参阅前面快速入门示例中的此部分

    kind: RayCluster
    metadata:
    spec:
      gcsFaultToleranceOptions:
        externalStorageNamespace: "my-raycluster-storage" # <- Add this option to specify a storage namespace
    

4. 关闭 Redis 清理#

Redis 上的键逐出设置

如果您禁用 ENABLE_GCS_FT_REDIS_CLEANUP 但希望 Redis 自动移除 GCS 元数据,请在您的 redis.conf 或 redis-server 命令的命令行选项中设置以下两个选项(示例)

  • maxmemory=<your_memory_limit>

  • maxmemory-policy=allkeys-lru

这两个选项指示 Redis 在达到 maxmemory 限制时删除最近最少使用的键。有关更多信息,请参阅 Redis 的键逐出

请注意,Redis 执行此逐出操作,它不保证 Ray 不会使用已删除的键。

后续步骤#