Red Hat Ansible for Networking Automation – 1 of 2

October 2, 2018

Tweet This: 
Share on LinkedIn:  

By Taylor Owen, Kovarus Automation Solution Architect

Before we dive into the specifics for network automation with Red Hat Ansible, let’s first dive into what Ansible is, what features make it compelling to use in the first place, and common problems one might run into while using it.

One of the great things about Ansible, and Red Hat products in general, is the fact there are open-source and enterprise versions available. As an organization one can “kick the tires” see if it fits well for current use cases, and then get enterprise level support. Ansible’s backend is written in Python, and Python tends to be one of the more user-friendly programming languages. One of the most cited capabilities of Ansible is being agentless. When looking at the GitHub statistics of the Ansible open-source project you can see that it has over 3,700 contributors and over 1,800 modules. As of this article, Ansible is approximately 6 years old. That is an impressive level of adoption in a short amount of time. I contribute most of this to Ansible being written in Python, a low level of effort needed to get started due to being agentless, and having support for many different platforms through its 1,800+ modules.

Now that we’ve talked about some of the positives of Ansible, let’s discuss the number one complaint I hear about Ansible. “It is slow.” While I will agree that by default it can be slow there are many tweaks that can be done to speed it up. First, to understand why it can be slow, we need to understand more about the transport that Ansible uses, which is SSH.

WARNING MATH AHEAD!

Ansible leverages the SSH parameter called “ControlPersist.” By default, this is set to 60 seconds. This means that every minute Ansible is executing tasks against a node, a new SSH connection is created. Assuming it takes about 3 seconds to initiate the connection for each endpoint and having multiple tasks, one can see how that time starts to become costly. Now, multiply that by the number of nodes divided by the number of  forks and one can see how quickly that starts to add up. We can approximate the time wasted with the following formula:

Time on SSH = (SSH initiation time) x (num of tasks) x (num of nodes) x (avg task duration) / [(num of forks) x (ControlPersist timeout)]

Let’s take the following scenario:

SSH initiation time = 3 seconds

num of tasks = 60

num of nodes = 100

avg task duration = 6 seconds

num of forks = 5

ControlPersist timeout = 60 seconds

Using the formula above, the estimated time wasted is 6 minutes on SSH initiating a connection. Obviously, the more nodes, increased network latency, more tasks, etc. the higher that number is. If I haven’t lost your interest yet, then great! That will be the only math we do for the rest of this two-part blog post, I promise!

Obviously, we need to look at ways we can reduce the time wasted. While we cannot reduce the number of nodes or the time it takes for SSH to initiate, there are a few ways we can decrease the time wasted. Most of these changes can be made in your ansible.cfg file. I’m just going to briefly go over a few of them just for awareness.

Forks are one of the first things to increase to help with decreasing overall time spent. This allows more processes to be run concurrently, and by default this is set to 5. This means that Ansible will only configure 5 hosts concurrently. As we increase the number of forks we get a reduction in time spent. The tradeoff here is system resources. The recommendation is 4 GB of RAM per 100 forks. For most use cases this is more than enough. There is a point of diminishing returns when the limits of CPU and bandwidth start being exhausted. In very large-scale deployments, network bandwidth starts to become the limiting factor due to SSH and content being delivered to the node.

Pipelining is the next tweak that is often done. This reduces the number of network operations required to execute a task, thus increasing the number of tasks to be executed per connection. There are some caveats to this such as ensuring “requiretty” is disabled in the sudoers file on the nodes.

The last most common tweak is increasing the timeout for “ControlPersist.” As I mentioned prior, this is set to 60 seconds by default. Ansible recommends that this be increased to 30 minutes depending on your use cases. This will allow the SSH connections to stay open longer and thus reduce the number of times SSH has to be reinitiated during a playbook run.

It is also good to look at the different Ansible playbook strategies that can be leveraged. By default Ansible uses the “linear” strategy where all hosts will run each task before any host starts the next task. The interesting thing about playbook strategies is that Ansible treats them as plugins so there are a few out there. One interesting plugin is Mitogen found here https://mitogen.readthedocs.io/en/stable/ansible.html.

Now that we have talked about the high-level benefits of Ansible and how to resolve some of the issues one may run into, in the next post we’ll talk about using Ansible for network automation.