Parallel SSH Execution

Short definition

Parallel SSH execution is the practice of running the same SSH command on multiple remote servers simultaneously rather than sequentially, reducing the total time for fleet-wide operations from hours to seconds.

Extended definition

When a command needs to run on hundreds of servers, sequential execution is impractical. If each SSH connection takes two seconds to establish and the command takes three seconds to run, a fleet of two hundred servers takes over sixteen minutes sequentially. With parallel execution, all two hundred connections are opened simultaneously, and the entire operation completes in roughly the time it takes to run on one server. Parallel SSH execution is the standard approach for fleet-wide operations, including nginx reloads, cache purges, config deployments, and health checks. It requires that the control server can establish many simultaneous outbound SSH connections, which is typically limited by the system’s open file descriptor limit rather than network bandwidth.

Deep technical explanation

Threading vs async models: Parallel SSH tools typically use one of two concurrency models. Thread-based tools like Fabric’s ThreadingGroup spawn one thread per target host, each managing its own SSH session. Async tools use an event loop to multiplex many connections on fewer threads. For fleets of hundreds of servers, both approaches work, though async scales more efficiently at very high connection counts.

Connection limits: The operating system limits simultaneous open connections via the ulimit -n setting (maximum open file descriptors). Each SSH connection consumes at least one file descriptor. For large fleets, this limit should be raised in /etc/security/limits.conf before running parallel operations.

Partial failure handling: In any parallel operation across a large fleet, some connections will fail due to network issues, overloaded hosts, or misconfiguration. A well-designed parallel SSH tool collects results from all hosts independently, reports which succeeded and which failed, and does not abort the entire operation because one host was unreachable.

Batching: For operations with side effects (restarts, config changes), running against the entire fleet simultaneously may be undesirable. Batching divides the target set into groups and processes one group at a time with a configurable delay between batches, limiting the blast radius of an error while still being much faster than pure sequential execution.

Idempotency: Commands run via parallel SSH should be idempotent where possible, meaning running the same command twice produces the same result as running it once. This allows safe retries against failed hosts without manual intervention.

How BlueGrid.io uses it

BlueGrid.io’s fleet management pipeline executes all nginx operations in parallel across the entire target server set using Fabric’s ThreadingGroup. A config change that would take thirty minutes sequentially across two hundred servers completes in under a minute. Results are collected per server, and failures are surfaced immediately for targeted retry, without affecting the servers that completed successfully.

Why it matters

Sequential execution does not scale. At a certain fleet size, sequential SSH operations take longer than the maintenance windows available, making some operational tasks practically impossible without parallel execution. Parallel SSH is what makes centralised management of large infrastructure feasible from a single control point.