Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 4 additions & 6 deletions docs/recover-control-plane.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,6 @@

To recover from broken nodes in the control plane use the "recover\-control\-plane.yml" playbook.

* Backup what you can
* Provision new nodes to replace the broken ones
* Place the surviving nodes of the control plane first in the "etcd" and "kube\_control\_plane" groups
* Add the new nodes below the surviving control plane nodes in the "etcd" and "kube\_control\_plane" groups

Examples of what broken means in this context:

* One or more bare metal node(s) suffer from unrecoverable hardware failure
Expand All @@ -19,8 +14,12 @@ __Note that you need at least one functional node to be able to recover using th

## Runbook

* Backup what you can
* Provision new nodes to replace the broken ones
* Move any broken etcd nodes into the "broken\_etcd" group, make sure the "etcd\_member\_name" variable is set.
* Move any broken control plane nodes into the "broken\_kube\_control\_plane" group.
* Place the surviving nodes of the control plane first in the "etcd" and "kube\_control\_plane" groups
* Add the new nodes below the surviving control plane nodes in the "etcd" and "kube\_control\_plane" groups

Then run the playbook with ```--limit etcd,kube_control_plane``` and increase the number of ETCD retries by setting ```-e etcd_retries=10``` or something even larger. The amount of retries required is difficult to predict.

Expand All @@ -35,7 +34,6 @@ The playbook attempts to figure out it the etcd quorum is intact. If quorum is l
## Caveats

* The playbook has only been tested with fairly small etcd databases.
* If your new control plane nodes have new ip addresses you may have to change settings in various places.
* There may be disruptions while running the playbook.
* There are absolutely no guarantees.

Expand Down
1 change: 1 addition & 0 deletions roles/recover_control_plane/etcd/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
delegate_to: "{{ item }}"
with_items: "{{ groups['broken_etcd'] }}"
ignore_errors: true # noqa ignore-errors
ignore_unreachable: true
when:
- groups['broken_etcd']
- has_quorum
Expand Down