Wednesday, March 21, 2018

Dealing with failed hosts: Ansible playbooks - Part 1

Summary:
When using the serial keyword a single failed host aborts the entire playbook with the error "FATAL: all hosts have already failed -- aborting"
Removing the serial keyword and all hosts are evaluated and a single failure does not abort the playbook run.
Steps To Reproduce:
Create an inventory with a "test" group with three hosts, A, B, and C. Save the following playbook.
# test_fail.yml

---
- hosts: host1:host2:host3
  gather_facts: no
  tasks:
    - ping:
    - fail:
      when: inventory_hostname == 'host3'

- hosts: host4:host5
  gather_facts: no
  tasks:
    - ping:

- hosts: host6
  gather_facts: no
  tasks:
    - ping:

- hosts: host7:host8
  gather_facts: no
  tasks:
    - ping:


Run playbook
Expected Results:
This playbook should iterate through all three hosts, executing each action in sequence before moving on to the next host. A failure on one host should abort any further actions on that one host and move on to the next.
Actual Results:
The first failure that occurs aborts the entire playbook.
Run output:
$ ansible-playbook -i testinv testserial.yml 

PLAY [test] ******************************************************************* 

GATHERING FACTS *************************************************************** 
ok: [A]
ok: [C]
ok: [B]

TASK: [Who am I?] ************************************************************* 
ok: [C] => {
    "msg": "C"
}

TASK: [Test action] *********************************************************** 
changed: [C]

TASK: [Test failure] ********************************************************** 
skipping: [C]

TASK: [Next action] *********************************************************** 
changed: [C]

TASK: [Who am I?] ************************************************************* 
ok: [B] => {
    "msg": "B"
}

TASK: [Test action] *********************************************************** 
changed: [B]

TASK: [Test failure] ********************************************************** 
failed: [B] => {"failed": true}
msg: Bad host that fails

FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/home/slack/testserial.retry

A                          : ok=1    changed=0    unreachable=0    failed=0   
B                          : ok=3    changed=1    unreachable=0    failed=1   
C                          : ok=4    changed=2    unreachable=0    failed=0   

When using serial: 1, you're telling the play to only operate on 1 host at a time. The working group failure percentage is based off of the serial count (see #4407), so when there is only 1 host in the group and it fails the failure % will always be 100%. To move past this error, you should increase the serial group size or use ignore_errors: yes on tasks that are not fatal to the workflow of your playbook.
ignore_errors causes the playbook to continue executing tasks for that host, if we want to stop executing tasks for a host but continue the playbook with the next host it doesn't work.

No comments:

Post a Comment