Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Determine cluster failures
title: Cluster Status Maintenance
---

Karmada supports both `Push` and `Pull` modes to manage member clusters. More details about cluster registration please refer to [Cluster Registration](../clustermanager/cluster-registration.md).
Expand Down
56 changes: 6 additions & 50 deletions docs/userguide/failover/failover-analysis.md
Original file line number Diff line number Diff line change
@@ -1,76 +1,32 @@
---
title: Failover Analysis
title: Cluster Failover Process Analysis
---

Let's briefly analyze the Karmada failover feature.

## Add taints on fault cluster

After the cluster is determined to be unhealthy, a taint with `Effect` set to `NoSchedule` will be added to the cluster as follows:
After the [cluster status becomes unhealthy](./cluster-status-maintenance.md), a `taint{effect: NoSchedule}` will be added to the cluster as follows:

- when cluster's `Ready` condition is `False`, add the following taint:
- when cluster's `Ready` condition is `False`, Karmada controller will add the following taint to the cluster object:

```yaml
key: cluster.karmada.io/not-ready
effect: NoSchedule
```

- when cluster's `Ready` condition is `Unknown`, add the following taint:
- when cluster's `Ready` condition is `Unknown`, Karmada controller will add the following taint to the cluster object:

```yaml
key: cluster.karmada.io/unreachable
effect: NoSchedule
```

If an unhealthy cluster is not recovered for a period of time, which can be configured via `--failover-eviction-timeout` flag(default is 5 minutes), a new taint with `Effect` set to `NoExecute` will be added to the cluster as follows:

- when cluster's `Ready` condition is `False`, add the following taint:

```yaml
key: cluster.karmada.io/not-ready
effect: NoExecute
```

- when cluster's `Ready` condition is `Unknown`, add the following taint:

```yaml
key: cluster.karmada.io/unreachable
effect: NoExecute
```

## Tolerate cluster taints

After users creates a `PropagationPolicy/ClusterPropagationPolicy`, Karmada will automatically add the following toleration through webhook:

```yaml
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
name: nginx-propagation
namespace: default
spec:
placement:
clusterTolerations:
- effect: NoExecute
key: cluster.karmada.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: cluster.karmada.io/unreachable
operator: Exists
tolerationSeconds: 300
resourceSelectors:
- apiVersion: apps/v1
kind: Deployment
name: nginx
namespace: default
```

The `tolerationSeconds` can be configured via `--default-not-ready-toleration-seconds` flag(default is 300) and `default-unreachable-toleration-seconds` flag(default is 300).
In addition, Karmada controller will not actively add `NoExecute` taints to cluster objects. Users can actively manage taints on cluster objects, including `NoExecute` taints, through the [cluster taint management](./cluster-taint-management.md) feature.

## Failover

When karmada detects that the faulty cluster is no longer tolerated by `PropagationPolicy/ClusterPropagationPolicy`, the cluster will be removed from the resource scheduling result and the karmada scheduler will reschedule the reference application.
When Karmada detects that a cluster has been tainted with `NoExecute` and the taint cannot be tolerated by the toleration strategy in `PropagationPolicy/ClusterPropagationPolicy`, Karmada controller will remove the cluster from the resource scheduling result, and then Karmada scheduler will reschedule the target workload.

There are several constraints:
- For each rescheduled application, it still needs to meet the restrictions of `PropagationPolicy/ClusterPropagationPolicy`, such as ClusterAffinity or SpreadConstraints.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: 集群故障判定
title: 集群状态维护
---

Karmada 支持 `Push` 和 `Pull` 两种模式来管理成员集群,有关集群注册的更多详细信息,
Expand Down
Original file line number Diff line number Diff line change
@@ -1,80 +1,40 @@
---
title: 故障迁移过程解析
title: 集群故障迁移过程解析
---

让我们对Karmada集群故障迁移的过程进行一个简单的解析
让我们对 Karmada 集群故障迁移的过程进行一个简单的解析

## 添加集群污点

当[集群被判定为不健康](./determine-cluster-failures.md)之后,集群将会被添加上`Effect`值为`NoSchedule`的污点,具体情况为:
当[集群状态变得不健康](./cluster-status-maintenance.md)之后,集群将会被添加上 `taint{effect: NoSchedule}`,具体情况为:

- 当集群`Ready`状态为`False`时,将被添加如下污点
- 当集群的 `Ready` Condition 为 `False` 时,Karmada 控制器将为集群对象添加如下污点

```yaml
key: cluster.karmada.io/not-ready
effect: NoSchedule
```

- 当集群`Ready`状态为`Unknown`时,将被添加如下污点
- 当集群的 `Ready` Condition 为 `Unknown` 时,Karmada 控制器将为集群对象添加如下污点

```yaml
key: cluster.karmada.io/unreachable
effect: NoSchedule
```

如果集群的不健康状态持续一段时间(该时间可以通过`--failover-eviction-timeout`标签进行配置,默认值为5分钟)仍未恢复,集群将会被添加上`Effect`值为`NoExecute`的污点,具体情况为:

- 当集群`Ready`状态为`False`时,将被添加如下污点:

```yaml
key: cluster.karmada.io/not-ready
effect: NoExecute
```

- 当集群`Ready`状态为`Unknown`时,将被添加如下污点:

```yaml
key: cluster.karmada.io/unreachable
effect: NoExecute
```

## 容忍集群污点

当用户创建`PropagationPolicy/ClusterPropagationPolicy`资源后,Karmada会通过webhook为它们自动增加如下集群污点容忍:

```yaml
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
name: nginx-propagation
namespace: default
spec:
placement:
clusterTolerations:
- effect: NoExecute
key: cluster.karmada.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: cluster.karmada.io/unreachable
operator: Exists
tolerationSeconds: 300
...
```

其中,容忍的`tolerationSeconds`值可以通过`--default-not-ready-toleration-seconds`与`default-unreachable-toleration-seconds`标签进行配置,这两个标签的默认值均为300。
此外,Karmada 控制器将不会为集群对象主动添加 `NoExecute` 污点,用户可以通过[集群污点管理](./cluster-taint-management.md)功能对集群对象上的污点,包括 `NoExecute` 污点,进行主动管理。

## 故障迁移

当Karmada检测到故障群集不再被`PropagationPolicy/ClusterPropagationPolicy`分发策略容忍时,该集群将被从资源调度结果中删除,随后,Karmada调度器将重新调度相关工作负载
当 Karmada 发现集群被打上了 `NoExecute` 污点且该污点不能被 `PropagationPolicy/ClusterPropagationPolicy` 中的容忍策略所容忍后,Karmada 控制器将会将该集群从资源调度结果中删除,随后,Karmada 调度器将重新调度目标工作负载

重调度的过程有以下几个限制:
- 对于每个重调度的工作负载,其仍然需要满足`PropagationPolicy/ClusterPropagationPolicy`的约束,如ClusterAffinity或SpreadConstraints
- 对于每个重调度的工作负载,其仍然需要满足 `PropagationPolicy/ClusterPropagationPolicy` 的约束,如 ClusterAffinity 或 SpreadConstraints
- 应用初始调度结果中健康的集群在重调度过程中仍将被保留。

### Duplicated调度类型
### Duplicated 调度类型

对于`Duplicated`调度类型,当集群故障之后进行重新调度,满足分发策略限制的候选集群数量大于等于故障集群数量时,调度将继续执行,否则不执行。其中候选集群是指在本次调度过程中,区别与已调度的集群,新计算出的集群调度结果。
对于 `Duplicated` 调度类型,当集群故障之后进行重新调度,满足分发策略限制的候选集群数量大于等于故障集群数量时,调度将继续执行,否则不执行。其中候选集群是指在本次调度过程中,区别与已调度的集群,新计算出的集群调度结果。

以`Deployment`资源为例:

Expand Down Expand Up @@ -126,13 +86,13 @@ spec:
```
</details>

假设有5个成员集群,初始调度结果在member1和member2集群中。当member2集群发生故障,将触发调度器重调度。
假设有 5 个成员集群,初始调度结果在 member1 和 member2 集群中。当 member2 集群发生故障,将触发调度器重调度。

需要注意的是,重调度不会删除原本状态为Ready的集群member1上的工作负载。在其余3个集群中,只有member3和member5匹配`clusterAffinity`策略。
需要注意的是,重调度不会删除原本状态为 Ready 的集群 member1 上的工作负载。在其余 3 个集群中,只有 member3 和 member5 匹配 `clusterAffinity` 策略。

由于分发约束的限制,最后应用调度的结果将会是[member1, member3][member1, member5]。
由于分发约束的限制,最后应用调度的结果将会是 [member1, member3][member1, member5]。

### Divided调度类型
### Divided 调度类型

对于`Divided`调度类型,Karmada调度器将尝试将应用副本迁移到其他健康的集群中去。

Expand Down Expand Up @@ -192,18 +152,18 @@ spec:
```
</details>

Karmada调度器将根据权重表`weightPreference`来划分应用副本。初始调度结果中,member1集群上有1个副本,member2集群上有2个副本
Karmada 调度器将根据权重表 `weightPreference` 来划分应用副本。初始调度结果中,member1 集群上有 1 个副本,member2 集群上有 2 个副本

当member1集群故障之后,将触发重调度,最后的调度结果将会是member2集群上有3个副本
当 member1 集群故障之后,将触发重调度,最后的调度结果将会是 member2 集群上有 3 个副本

## 优雅故障迁移

为了防止集群故障迁移过程中服务发生中断,Karmada需要确保故障集群中应用副本的删除动作延迟到应用副本在新集群上可用之后才执行
为了防止集群故障迁移过程中服务发生中断,Karmada 需要确保故障集群中应用副本的删除动作延迟到应用副本在新集群上可用之后才执行

`ResourceBinding/ClusterResourceBinding`中增加了[GracefulEvictionTasks](https://github.com/karmada-io/karmada/blob/12e8f01d01571932e6fe45cb7f0d1bffd2e40fd9/pkg/apis/work/v1alpha2/binding_types.go#L75-L89)字段来表示优雅驱逐任务队列。
`ResourceBinding/ClusterResourceBinding` 中增加了 [GracefulEvictionTasks](https://github.com/karmada-io/karmada/blob/12e8f01d01571932e6fe45cb7f0d1bffd2e40fd9/pkg/apis/work/v1alpha2/binding_types.go#L75-L89) 字段来表示优雅驱逐任务队列。

当故障集群被taint-manager从资源调度结果中删除时,它将被添加到优雅驱逐任务队列中。
当故障集群被 taint-manager 从资源调度结果中删除时,它将被添加到优雅驱逐任务队列中。

`gracefulEvction`控制器负责处理优雅驱逐任务队列中的任务。在处理过程中,`gracefulEvction`控制器逐个评估优雅驱逐任务队列中的任务是否可以从队列中移除。判断条件如下:
`gracefulEvction` 控制器负责处理优雅驱逐任务队列中的任务。在处理过程中,`gracefulEvction` 控制器逐个评估优雅驱逐任务队列中的任务是否可以从队列中移除。判断条件如下:
- 检查当前资源调度结果中资源的健康状态。如果资源健康状态为健康,则满足条件。
- 检查当前任务的等待时长是否超过超时时间,超时时间可以通过`graceful-evction-timeout`标志配置(默认为10分钟)。如果超过,则满足条件。
- 检查当前任务的等待时长是否超过超时时间,超时时间可以通过 `graceful-evction-timeout` 标志配置(默认为10分钟)。如果超过,则满足条件。
2 changes: 1 addition & 1 deletion sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -82,9 +82,9 @@ module.exports = {
type: "category",
label: "Multi-cluster Failover",
items: [
"userguide/failover/cluster-status-maintenance",
"userguide/failover/cluster-failover",
"userguide/failover/cluster-taint-management",
"userguide/failover/determine-cluster-failures",
"userguide/failover/failover-analysis",
"userguide/failover/application-failover",
],
Expand Down