One post tagged with "redundancy"

Modbus TCP Gateway Failover: Building Redundant PLC Communication for Manufacturing [2026]

March 4, 2026 · 14 min read

Modbus TCP gateway failover architecture

Modbus TCP remains the most widely deployed industrial protocol in manufacturing. Despite being a 1979 design extended to Ethernet in 1999, its simplicity — request/response over TCP, 16-bit registers, four function codes that cover 90% of use cases — makes it the lowest common denominator that virtually every PLC, VFD, and sensor hub supports.

But simplicity has a cost: Modbus TCP has zero built-in redundancy. No heartbeats. No automatic reconnection. No session recovery. When the TCP connection drops — and in a factory environment with electrical noise, cable vibrations, and switch reboots, it will drop — your data collection goes dark until someone manually restarts the gateway or the application logic handles recovery.

This guide covers the architecture patterns for building resilient Modbus TCP gateways that maintain data continuity through link failures, PLC reboots, and network partitions.

Understanding Why Modbus TCP Connections Fail

Before designing failover, you need to understand the failure modes. In a year of operating Modbus TCP gateways across manufacturing floors, you'll encounter all of these:

Failure Mode 1: TCP Connection Reset (ECONNRESET)

The PLC or an intermediate switch drops the TCP connection. Common causes:

PLC firmware update or watchdog reboot
Switch port flap (cable vibration, loose connector)
PLC connection limit exceeded (most support 6-16 simultaneous TCP connections)
Network switch spanning tree reconvergence (can take 30-50 seconds on older managed switches)

Detection time: Immediate — the next modbus_read_registers() call returns ECONNRESET.

Failure Mode 2: Connection Timeout (ETIMEDOUT)

The PLC stops responding but doesn't close the connection. The TCP socket remains open, but reads time out. Common causes:

PLC CPU overloaded (complex ladder logic consuming all scan cycles)
Network congestion (broadcast storms, misconfigured VLANs)
IP conflict (another device grabbed the PLC's address)
PLC in STOP mode (program halted, communication stack still partially active)

Detection time: Your configured response timeout (typically 500ms-2s) per read operation. For a 100-tag poll cycle, a full timeout can mean 50-200 seconds of dead time before you confirm the link is down.

Failure Mode 3: Connection Refused (ECONNREFUSED)

The PLC's TCP stack is active but Modbus is not. Common causes:

PLC in bootloader mode after firmware flash
Modbus TCP server disabled in PLC configuration
Firewall rule change on managed switch blocking port 502

Detection time: Immediate on the next connection attempt.

Failure Mode 4: Silent Failure (EPIPE/EBADF)

The connection appears open from the gateway's perspective, but the PLC has already closed it. The first write or read on a stale socket triggers EPIPE or EBADF. This happens when:

PLC reboots cleanly but the gateway missed the FIN packet (common with UDP-accelerated switches)
OS socket cleanup runs asynchronously

Detection time: Only on the next read/write attempt — could be seconds to minutes if polling intervals are long.

The Connection Recovery State Machine

A resilient Modbus TCP gateway implements a state machine with five states:

                    ┌─────────────┐
                    │  CONNECTING  │
                    │  (backoff)   │
                    └──────┬──────┘
                           │ modbus_connect() success
                    ┌──────▼──────┐
             ┌─────│  CONNECTED   │─────┐
             │     │  (polling)   │     │
             │     └──────┬──────┘     │
             │            │            │
      timeout/error  link_state=1  read error
             │            │            │
    ┌────────▼───┐  ┌─────▼─────┐  ┌──▼──────────┐
    │  RECONNECT │  │  READING   │  │ LINK_DOWN    │
    │  (flush +  │  │  (normal)  │  │ (notify +    │
    │   close)   │  │            │  │  reconnect)  │
    └────────┬───┘  └───────────┘  └──┬──────────┘
             │                        │
             └────────────────────────┘
                    close + backoff

Key Implementation Details

1. Always close before reconnecting. A stale Modbus context will leak file descriptors and eventually exhaust the OS socket table. When any error occurs in the ETIMEDOUT/ECONNRESET/EPIPE/EBADF family, the correct sequence is:

modbus_flush(context)   → drain pending data
modbus_close(context)   → close the TCP socket
sleep(backoff_ms)       → prevent reconnection storms
modbus_connect(context) → establish new connection

Never call modbus_connect() on a context that hasn't been closed first. The libmodbus library doesn't handle this gracefully — you'll get zombie sockets.

2. Implement exponential backoff with a ceiling. After a connection failure, don't retry immediately — the PLC may be rebooting and needs time. A practical backoff schedule:

Attempt	Delay	Cumulative Time
1	1 second	1s
2	2 seconds	3s
3	4 seconds	7s
4	8 seconds	15s
5+	10 seconds (ceiling)	25s+

The 10-second ceiling is important — you don't want the backoff growing to minutes. PLC reboots typically complete in 15-45 seconds. A 10-second retry interval means you'll reconnect within one retry cycle after the PLC comes back.

3. Flush serial buffers for Modbus RTU. If your gateway also supports Modbus RTU (serial), always call modbus_flush() before reading after a reconnection. Serial buffers can contain stale response fragments from before the disconnection, and these will corrupt the first read's response parsing.

4. Track link state as a first-class data point. Don't just log connection status — deliver it to the cloud alongside your tag data. A special "link state" tag (boolean: 0 = disconnected, 1 = connected) transmitted immediately (not batched) gives operators real-time visibility into gateway health. When the link transitions from 1→0, send a notification. When it transitions from 0→1, force-read all tags to establish current values.

Register Grouping: Minimizing Round Trips

Modbus TCP's request/response model means each read operation incurs a full TCP round trip (~0.5-5ms on a local network, 50-200ms over cellular). Reading 100 individual registers one at a time takes 100 round trips — potentially 500ms on a good day.

The optimization is contiguous register grouping — instead of reading registers one at a time, read blocks of contiguous registers in a single request.

The Grouping Algorithm

Given a sorted list of register addresses to read, the gateway walks through them and groups contiguous registers that meet three criteria:

Same function code — you can't mix input registers (FC 4, 3xxxxx) with holding registers (FC 3, 4xxxxx) in one request
Contiguous addresses — register N+1 immediately follows register N (with appropriate gaps filled)
Same polling interval — don't group a 1-second alarm tag with a 60-second temperature tag
Maximum register count ≤ 50 — while Modbus allows up to 125 registers per read, keeping requests under 50 registers (~100 bytes) prevents fragmentation issues on constrained networks and limits the blast radius of a single failed read

Example: Optimized vs Naive Polling

Consider a chiller with 10 compressor circuits, each reporting 16 process variables:

Naive approach: 160 individual reads = 160 round trips

Read register 300003 → 1 register  (CQT1 Condenser Inlet Temp)
Read register 300004 → 1 register  (CQT1 Approach Temp)
Read register 300005 → 1 register  (CQT1 Chill In Temp)
...
Read register 300016 → 1 register  (CQT1 Superheat Temp)

Grouped approach: Registers 300003-300018 are contiguous, same function code (FC 4), same interval (60s)

Read registers 300003 → 16 registers (all CQT1 process data in ONE request)
Read registers 300350 → 16 registers (all CQT2 process data in ONE request)
...

Result: 160 round trips → 10 round trips. On a 2ms RTT network, that's 320ms → 20ms.

Handling Non-Contiguous Gaps

Real PLC register maps aren't perfectly contiguous. The chiller above has CQT1 data at registers 300003-300018 and CQT2 data starting at 300350 — a gap of 332 registers. Don't try to read 300003-300695 in one request to "fill the gap" — you'll read hundreds of irrelevant registers and waste bandwidth.

Instead, break at non-contiguous boundaries:

Group 1: 300003-300018  (16 registers, CQT1 process data)
Group 2: 300022-300023  (2 registers, CQT1 alarm bits)
Group 3: 300038-300043  (6 registers, CQT1 expansion + version)
Group 4: 300193-300194  (2 registers, CQT1 status words)
Group 5: 300260-300278  (19 registers, CQT2-10 alarm bits)
Group 6: 300350-300366  (17 registers, CQT2-3 temperatures)
...

The 50ms Inter-Read Delay

Between consecutive Modbus read requests, insert a 50ms delay. This sounds counterintuitive — why slow down? — but it serves two purposes:

PLC scan cycle breathing room. Many PLCs process Modbus requests in their communication interrupt, which competes with the main scan cycle. Rapid-fire requests can extend the scan cycle, triggering watchdog timeouts on safety-critical programs.
TCP congestion avoidance. On constrained networks (especially cellular gateways), bursting 50 reads in 100ms can overflow buffers. The 50ms spacing distributes the load evenly.

Dual-Path Failover Architecture

For mission-critical data collection (pharmaceutical batch records, automotive quality traceability), a single gateway represents a single point of failure. The dual-path architecture uses two independent gateways polling the same PLC:

Architecture

                    ┌──────────┐
                    │   PLC    │
                    │ (Modbus) │
                    └──┬───┬──┘
                       │   │
              Port 502 │   │ Port 502
                       │   │
              ┌────────▼┐ ┌▼────────┐
              │Gateway A│ │Gateway B│
              │(Primary)│ │(Standby)│
              └────┬────┘ └────┬────┘
                   │           │
                   ▼           ▼
              ┌────────────────────┐
              │  MQTT Broker       │
              │  (cloud/edge)      │
              └────────────────────┘

Active/Standby vs Active/Active

Active/Standby: Gateway A polls the PLC. Gateway B monitors A's heartbeat (via MQTT LWT or a shared health topic). If A goes silent for >30 seconds, B starts polling. When A recovers, it checks B's status and either resumes as primary or remains standby.

Pro: Only one gateway reads from the PLC, respecting the PLC's connection limit
Con: 30-second failover gap

Active/Active: Both gateways poll the PLC simultaneously. The cloud platform deduplicates data based on timestamps and device serial numbers. If one gateway fails, the other's data is already flowing.

Pro: Zero-downtime failover, no coordination needed
Con: Doubles PLC connection count and network traffic. Most PLCs support this (6-16 connections), but verify.

Recommendation: Active/Active with cloud-side deduplication. The PLC connection overhead is negligible compared to the operational cost of a 30-second data gap. Cloud-side deduplication is trivial — tag ID + timestamp + device serial number provides a natural composite key.

Store-and-Forward: Surviving Cloud Disconnections

Gateway-to-PLC failover handles half the problem. The other half is cloud connectivity — cellular links drop, VPN tunnels restart, and MQTT brokers undergo maintenance. During these outages, the gateway must buffer data locally and forward it when connectivity returns.

The Paged Ring Buffer

A production-grade store-and-forward buffer uses a paged ring buffer — pre-allocated memory divided into fixed-size pages, with separate write and read pointers:

┌──────────┐
│ Page 0   │ ← read_pointer (next to transmit)
│ [data]   │
├──────────┤
│ Page 1   │
│ [data]   │
├──────────┤
│ Page 2   │ ← write_pointer (next to fill)
│ [empty]  │
├──────────┤
│ Page 3   │
│ [empty]  │
└──────────┘

When the MQTT connection is healthy:

Tag data is written to the current work page
When the page fills, it moves to the "used" queue
The buffer transmits the oldest used page to MQTT (QoS 1 for delivery confirmation)
On publish acknowledgment, the page moves to the "free" queue

When the MQTT connection drops:

Tag data continues writing to pages (the PLC doesn't stop producing data)
Used pages accumulate in the queue
If the queue fills, the oldest used page is recycled as a work page — accepting data loss of the oldest data to preserve the newest

This design guarantees:

Constant memory usage — no dynamic allocation on an embedded device
Graceful degradation — oldest data is sacrificed first
Thread safety — mutex-protected page transitions prevent race conditions between the reading thread (PLC poller) and writing thread (MQTT publisher)

Sizing the Buffer

Buffer size depends on your data rate and expected maximum outage duration:

buffer_size = data_rate_bytes_per_second × max_outage_seconds × 1.2 (overhead)

For a typical deployment:

100 tags × 4 bytes/value = 400 bytes per poll cycle
1 poll per second = 400 bytes/second
Binary encoding with batch overhead: ~500 bytes/second
Target 4 hours of offline buffering: 500 × 14,400 = 7.2MB

With 512KB pages, that's ~14 pages. Allocate 16 pages (minimum 3 needed for operation: one writing, one transmitting, one free) for an 8MB buffer.

Binary vs JSON Encoding for Buffered Data

JSON is wasteful for buffered data. The same 100-tag reading:

JSON: {"groups":[{"ts":1709500800,"device_type":1018,"serial_number":23456,"values":[{"id":1,"values":[245]},{"id":2,"values":[312]},...]}]} → ~2KB
Binary: Header (0xF7 + group count + timestamp + device info) + packed tag values → ~500 bytes

Binary encoding uses a compact format:

[0xF7] [num_groups:4] [timestamp:4] [device_type:2] [serial_num:4] 
[num_values:4] [tag_id:2] [status:1] [value_count:1] [value_size:1] [values...]

Over a cellular connection billing at $5/GB, the 4× bandwidth savings of binary encoding pays for itself within days on a busy gateway.

Alarm Tag Priority: Batched vs Immediate Delivery

Not all tags are created equal. A temperature reading that's 0.1°C different from the last poll can wait for the next batch. An alarm bit that just flipped from 0 to 1 cannot.

The gateway should support two delivery modes per tag:

Batched Delivery (Default)

Tags are accumulated in the batch buffer and delivered on the batch timeout (typically 5-30 seconds) or batch size limit (typically 10-500KB). This is efficient for process variables that change slowly.

Configuration:

{
  "name": "Tank Temperature",
  "id": 1,
  "addr": 300202,
  "type": "int16",
  "interval": 60,
  "compare": false
}

Immediate Delivery (do_not_batch)

Tags bypass the batch buffer entirely. When the value changes, a single-value batch is created, serialized, and pushed to the output buffer immediately. This is essential for:

Alarm words — operators need sub-second alarm notification
Machine state transitions — running/stopped/faulted changes trigger downstream actions
Safety interlocks — any safety-relevant state change must be delivered without batching delay

Configuration:

{
  "name": "CQT 1 Alarm Bits 1",
  "id": 163,
  "addr": 300022,
  "type": "uint16",
  "interval": 1,
  "compare": true,
  "do_not_batch": true
}

The compare: true flag is critical for immediate-delivery tags — without it, the gateway would transmit on every read cycle (every 1 second), flooding the network. With comparison enabled, the gateway only transmits when the alarm word actually changes — zero bandwidth during normal operation, instant delivery when an alarm fires.

Calculated Tags: Extracting Bit-Level Alarms from PLC Words

Many PLCs pack multiple alarm states into a single 16-bit register. Bit 0 might indicate "high temperature," bit 1 "low flow," bit 2 "compressor fault," etc. Rather than requiring the cloud platform to perform bitwise decoding, a production gateway extracts individual bits and delivers them as separate boolean tags.

The extraction uses shift-and-mask arithmetic:

alarm_word = 0xA5 = 10100101 in binary

bit_0 = (alarm_word >> 0) & 0x01 = 1  → "High Temperature" = TRUE
bit_1 = (alarm_word >> 1) & 0x01 = 0  → "Low Flow" = FALSE
bit_2 = (alarm_word >> 2) & 0x01 = 1  → "Compressor Fault" = TRUE
...

These calculated tags are defined as children of the parent alarm word. When the parent tag changes value (detected by the compare flag), all child calculated tags are re-evaluated and delivered. If the parent doesn't change, no child processing occurs — zero CPU overhead during steady state.

This architecture keeps the PLC configuration simple (one alarm word per circuit) while giving cloud consumers individual, addressable alarm signals.

Putting It All Together: A Production Gateway Checklist

Before deploying a Modbus TCP gateway to production, verify:

machineCDN and Modbus TCP

machineCDN's edge gateway implements these patterns natively — connection state management, contiguous register grouping, binary batch encoding, paged ring buffers, and calculated alarm tags — so that plant engineers can focus on which tags to monitor rather than how to keep the data flowing. The gateway's JSON-based tag configuration maps directly to the PLC's register map, and the dual-format delivery system (binary for efficiency, JSON for interoperability) adapts to whatever network path is available.

For manufacturing teams running Modbus TCP equipment — from chillers and dryers to injection molding machines and conveying systems — getting the gateway layer right is the difference between a monitoring system that works in the lab and one that survives a year on the factory floor.

Building a Modbus TCP monitoring system? machineCDN handles protocol translation, buffering, and cloud delivery for manufacturing equipment — so your data keeps flowing even when your network doesn't.

Understanding Why Modbus TCP Connections Fail​

Failure Mode 1: TCP Connection Reset (ECONNRESET)​

Failure Mode 2: Connection Timeout (ETIMEDOUT)​

Failure Mode 3: Connection Refused (ECONNREFUSED)​

Failure Mode 4: Silent Failure (EPIPE/EBADF)​

The Connection Recovery State Machine​

Key Implementation Details​

Register Grouping: Minimizing Round Trips​

The Grouping Algorithm​

Example: Optimized vs Naive Polling​

Handling Non-Contiguous Gaps​

The 50ms Inter-Read Delay​

Dual-Path Failover Architecture​

Architecture​

Active/Standby vs Active/Active​

Store-and-Forward: Surviving Cloud Disconnections​

The Paged Ring Buffer​

Sizing the Buffer​

Binary vs JSON Encoding for Buffered Data​

Alarm Tag Priority: Batched vs Immediate Delivery​

Batched Delivery (Default)​

Immediate Delivery (do_not_batch)​

Calculated Tags: Extracting Bit-Level Alarms from PLC Words​

Putting It All Together: A Production Gateway Checklist​

machineCDN and Modbus TCP​