13 posts tagged with "Industrial IoT"

Industrial Internet of Things technology and applications

Best MQTT Broker for Industrial IoT in 2026: Choosing the Right Message Broker for Manufacturing

March 3, 2026 · 9 min read

Industrial IoT Experts

MQTT has become the dominant messaging protocol for industrial IoT, and for good reason: it's lightweight, handles unreliable networks gracefully, and scales from a single sensor to millions of devices. But choosing the right MQTT broker for manufacturing is a different problem than choosing one for consumer IoT. Factory floor data has different latency requirements, reliability expectations, and security constraints than smart home sensors or fleet telemetry.

Best PLC Data Collection Software 2026: 10 Platforms for Extracting Value from Your Controllers

March 3, 2026 · 10 min read

MachineCDN Team

Industrial IoT Experts

Your PLCs already know everything about your manufacturing operation — cycle times, temperatures, pressures, motor speeds, part counts, alarm states. The problem isn't data. It's getting that data out of the PLC and into a place where humans and AI can actually use it. PLC data collection software bridges that gap, and choosing the right platform determines whether you get actionable intelligence or just another data silo.

Edge Gateway Hot-Reload and Watchdog Patterns for Industrial IoT [2026]

March 3, 2026 · 12 min read

Here's a scenario every IIoT engineer dreads: it's 2 AM on a Saturday, your edge gateway in a plastics manufacturing plant has lost its MQTT connection to the cloud, and nobody notices until Monday morning. Forty-eight hours of production data — temperatures, pressures, cycle counts, alarms — gone. The maintenance team wanted to correlate a quality defect with process data from Saturday afternoon. They can't.

This is a reliability problem, and it's solvable. The patterns that separate a production-grade edge gateway from a prototype are: configuration hot-reload (change settings without restarting), connection watchdogs (detect and recover from silent failures), and graceful resource management (handle reconnections without memory leaks).

This guide covers the architecture behind each of these patterns, with practical design decisions drawn from real industrial deployments.

Edge gateway hot-reload and firmware patterns

The Problem: Why Edge Gateways Fail Silently

Industrial edge gateways operate in hostile environments: temperature swings, electrical noise, intermittent network connectivity, and 24/7 uptime requirements. The failure modes are rarely dramatic — they're insidious:

MQTT connection drops silently. The broker stops responding, but the client library doesn't fire a disconnect callback because the TCP connection is still half-open.
Configuration drift. An engineer updates tag definitions on the management server, but the gateway is still running the old configuration.
Memory exhaustion. Each reconnection allocates new buffers without properly freeing the old ones. After enough reconnections, the gateway runs out of memory and crashes.
PLC link flapping. The PLC reboots or loses power briefly. The gateway keeps polling, getting errors, but never properly re-detects or reconnects.

Solving these requires three interlocking systems: hot-reload for configuration, watchdogs for connections, and disciplined resource management.

Pattern 1: Configuration File Hot-Reload

The simplest and most robust approach to configuration hot-reload is file-based with stat polling. The gateway periodically checks if its configuration file has been modified (using the file's modification timestamp), and if so, reloads and applies the new configuration.

Design: stat() Polling vs. inotify

You have two options for detecting file changes:

stat() polling — Check the file's st_mtime on every main loop iteration:

on_each_cycle():
    current_stat = stat(config_file)
    if current_stat.mtime != last_known_mtime:
        reload_configuration()
        last_known_mtime = current_stat.mtime

inotify (Linux) — Register for kernel-level file change notifications:

fd = inotify_add_watch(config_file, IN_MODIFY)
poll(fd)  // blocks until file changes
reload_configuration()

For industrial edge gateways, stat() polling wins. Here's why:

It's simpler. No file descriptor management, no edge cases with inotify watches being silently dropped.
It works across filesystems. inotify doesn't work on NFS, CIFS, or some embedded filesystems. stat() works everywhere.
The cost is negligible. A single stat() call takes ~1 microsecond. Even at 1 Hz, it's invisible.
It naturally integrates with the main loop. Industrial gateways already run a polling loop for PLC reads. Adding a stat() check is one line.

Graceful Reload: The Teardown-Rebuild Cycle

When a configuration change is detected, the gateway must:

Stop active PLC connections. For EtherNet/IP, destroy all tag handles. For Modbus, close the serial port or TCP connection.
Free allocated memory. Tag definitions, batch buffers, connection contexts — all of it.
Re-read and validate the new configuration.
Re-detect the PLC and re-establish connections with the new tag map.
Resume data collection with a forced initial read of all tags.

The critical detail is step 2. Industrial gateways often use a pool allocator instead of individual malloc/free calls. All configuration-related memory is allocated from a single large buffer. On reload, you simply reset the allocator's pointer to the beginning of the buffer:

// Pseudo-code: pool allocator reset
config_memory.write_pointer = config_memory.base_address
config_memory.used_bytes = 0
config_memory.free_bytes = config_memory.total_size

This eliminates the risk of memory leaks during reconfiguration. No matter how many times you reload, memory usage stays constant.

Multi-File Configuration

Production gateways often have multiple configuration files:

Daemon config — Network settings, serial port parameters, batch sizes, timeouts
Device configs — Per-device-type tag maps (one JSON file per machine model)
Connection config — MQTT broker address, TLS certificates, authentication tokens

Each file should be watched independently. If only the daemon config changes (e.g., someone adjusts the batch timeout), you don't need to re-detect the PLC — just update the runtime parameter. If a device config changes (e.g., someone adds a new tag), you need to rebuild the tag chain.

A practical approach: when the daemon config changes, set a flag to force a status report on the next MQTT cycle. When a device config changes, trigger a full teardown-rebuild of that device's tag chain.

Pattern 2: Connection Watchdogs

The most dangerous failure mode in MQTT-based telemetry is the silent disconnect. The TCP connection appears alive (no RST received), but the broker has stopped processing messages. The client's publish calls succeed (they're just writing to a local socket buffer), but data never reaches the cloud.

The MQTT Delivery Confirmation Watchdog

The robust solution uses MQTT QoS 1 delivery confirmations as a heartbeat:

// Track the timestamp of the last confirmed delivery
last_delivery_timestamp = 0

on_publish_confirmed(packet_id):
    last_delivery_timestamp = now()

on_watchdog_check():  // runs every N seconds
    if last_delivery_timestamp == 0:
        return  // no data sent yet, nothing to check

    elapsed = now() - last_delivery_timestamp
    if elapsed > WATCHDOG_TIMEOUT:
        trigger_reconnect()

With MQTT QoS 1, the broker sends a PUBACK for every published message. If you haven't received a PUBACK in, say, 120 seconds, but you've been publishing data, something is wrong.

The key insight is that you're not watching the connection state — you're watching the delivery pipeline. A connection can appear healthy (no disconnect callback fired) while the delivery pipeline is stalled.

Reconnection Strategy: Async with Backoff

When the watchdog triggers, the reconnection must be:

Asynchronous — Don't block the PLC polling loop. Data collection should continue even while MQTT is reconnecting. Collected data gets buffered locally.
Non-destructive — The MQTT loop thread must be stopped before destroying the client. Stopping the loop with force=true ensures no callbacks fire during teardown.
Complete — Disconnect, destroy the client, reinitialize the library, create a new client, set callbacks, start the loop, then connect. Half-measures (just calling reconnect) often leave stale state.

A dedicated reconnection thread works well:

reconnect_thread():
    while true:
        wait_for_signal()  // semaphore blocks until watchdog triggers

        log("Starting MQTT reconnection")
        stop_mqtt_loop(force=true)
        disconnect()
        destroy_client()
        cleanup_library()

        // Re-initialize from scratch
        init_library()
        create_client(device_id)
        set_credentials(username, password)
        set_tls(certificate_path)
        set_protocol(MQTT_3_1_1)
        set_callbacks(on_connect, on_disconnect, on_message, on_publish)
        start_loop()
        set_reconnect_delay(5, 5, no_exponential)
        connect_async(host, port, keepalive=60)

        signal_complete()  // release semaphore

Why a separate thread? The connect_async call can block for up to 60 seconds on DNS resolution or TCP handshake. If this runs on the main thread, PLC polling stops. Industrial processes don't wait for your network issues.

PLC Connection Watchdog

MQTT isn't the only connection that needs watching. PLC connections — both EtherNet/IP and Modbus TCP — can also fail silently.

For Modbus TCP, the watchdog logic is simpler because each read returns an explicit error code:

on_modbus_read_error(error_code):
    if error_code in [ETIMEDOUT, ECONNRESET, ECONNREFUSED, EPIPE, EBADF]:
        close_modbus_connection()
        set_link_state(DOWN)
        // Will reconnect on next polling cycle

For EtherNet/IP via libraries like libplctag, a return code of -32 (connection failed) should trigger:

Setting the link state to DOWN
Destroying the tag handles
Attempting re-detection on the next cycle

A critical detail: track consecutive errors, not individual ones. A single timeout might be a transient hiccup. Three consecutive timeouts (error_count >= 3) indicate a real problem. Break the polling cycle early to avoid hammering a dead connection.

Link State Telemetry

The gateway should treat the connection state itself as a telemetry point. When the PLC link goes up or down, immediately publish a link state tag — a boolean value with do_not_batch: true:

link_state_changed(device, new_state):
    publish_immediately(
        tag_id=LINK_STATE_TAG,
        value=new_state,  // true=up, false=down
        timestamp=now()
    )

This gives operators cloud-side visibility into gateway connectivity. A dashboard can show "Device offline since 2:47 AM" instead of just "no data" — which is ambiguous (was the device off, or was the gateway offline?).

Pattern 3: Store-and-Forward Buffering

When MQTT is disconnected, you can't just drop data. A production gateway needs a paged ring buffer that accumulates data during disconnections and drains it when connectivity returns.

Paged Buffer Architecture

The buffer divides a fixed-size memory region into pages of equal size:

Total buffer: 2 MB
Page size: ~4 KB (derived from max batch size)
Pages: ~500

Page states:
  FREE → Available for writing
  WORK → Currently being written to
  USED → Full, queued for delivery

The lifecycle:

Writing: Data is appended to the WORK page. When it's full, WORK moves to the USED queue, and a FREE page becomes the new WORK page.
Sending: When MQTT is connected, the first USED page is sent. On PUBACK confirmation, the page moves to FREE.
Overflow: If all pages are USED (buffer full, MQTT down for too long), the oldest USED page is recycled as the new WORK page. This loses the oldest data to preserve the newest — the right tradeoff for most industrial applications.

Thread safety is critical. The PLC polling thread writes to the buffer, the MQTT thread reads from it, and the PUBACK callback advances the read pointer. A mutex protects all buffer operations:

buffer_add_data(data, size):
    lock(mutex)
    append_to_work_page(data, size)
    if work_page_full():
        move_work_to_used()
    try_send_next()
    unlock(mutex)

on_puback(packet_id):
    lock(mutex)
    advance_read_pointer()
    if page_fully_delivered():
        move_page_to_free()
    try_send_next()
    unlock(mutex)

on_disconnect():
    lock(mutex)
    connected = false
    packet_in_flight = false  // reset delivery state
    unlock(mutex)

Sizing the Buffer

Buffer sizing depends on your data rate and your maximum acceptable offline duration:

buffer_size = data_rate_bytes_per_second × max_offline_seconds

For a typical deployment:

50 tags × 4 bytes average × 1 read/second = 200 bytes/second
With binary encoding overhead: ~300 bytes/second
Maximum offline duration: 2 hours (7,200 seconds)
Buffer needed: 300 × 7,200 = ~2.1 MB

A 2 MB buffer with 4 KB pages gives you ~500 pages — more than enough for 2 hours of offline operation.

The Minimum Three-Page Rule

The buffer needs at minimum 3 pages to function:

One WORK page (currently being written to)
One USED page (queued for delivery)
One page in transition (being delivered, not yet confirmed)

If you can't fit 3 pages in the buffer, the page size is too large relative to the buffer. Validate this at initialization time and reject invalid configurations rather than failing at runtime.

Pattern 4: Periodic Forced Reads

Even with change-detection enabled (the compare flag), a production gateway should periodically force-read all tags and transmit their values regardless of whether they changed. This serves several purposes:

Proof of life. Downstream systems can distinguish "the value hasn't changed" from "the gateway is dead."
State synchronization. If the cloud-side database lost data (a rare but real scenario), periodic full-state updates resynchronize it.
Clock drift correction. Over time, individual tag timers can drift. A periodic full reset realigns all tags.

A practical approach: reset all tags on the hour boundary. Check the system clock, and when the hour rolls over, clear all "previously read" flags. Every tag will be read and transmitted on its next polling cycle, regardless of change detection:

on_each_read_cycle():
    current_hour = localtime(now()).hour
    previous_hour = localtime(last_read_time).hour

    if current_hour != previous_hour:
        reset_all_tags()  // clear read-once flags
        log("Hourly forced read: all tags will be re-read")

This adds at most one extra transmission per tag per hour — a negligible bandwidth cost for significant reliability improvement.

Pattern 5: SAS Token and Certificate Expiry Monitoring

If your MQTT connection uses time-limited credentials (like Azure IoT Hub SAS tokens or short-lived TLS certificates), the gateway must monitor expiry and refresh proactively.

For SAS tokens, extract the se (expiry) parameter from the connection string and compare it against the current system time:

on_config_load(sas_token):
    expiry_timestamp = extract_se_parameter(sas_token)

    if current_time > expiry_timestamp:
        log_warning("Token has expired!")
        // Still attempt connection — the broker will reject it,
        // but the error path will trigger a config reload
    else:
        time_remaining = expiry_timestamp - current_time
        log("Token valid for %d hours", time_remaining / 3600)

Don't silently fail. If the token is expired, log a prominent warning. The gateway should still attempt to connect (the broker rejection will be informative), but operations teams need visibility into credential lifecycle.

For TLS certificates, monitor both the certificate file's modification time (has a new cert been deployed?) and the certificate's validity period (is it about to expire?).

How machineCDN Implements These Patterns

machineCDN's edge gateway — deployed on OpenWRT-based industrial routers in plastics manufacturing plants — implements all five patterns:

Configuration hot-reload using stat() polling on the main loop, with pool-allocated memory for zero-leak teardown/rebuild cycles
Dual watchdogs for MQTT delivery confirmation (120-second timeout) and PLC link state (3 consecutive errors trigger reconnection)
Paged ring buffer with 2 MB capacity, supporting both JSON and binary encoding, with automatic overflow handling that preserves newest data
Hourly forced reads that ensure complete state synchronization regardless of change detection
SAS token monitoring with proactive expiry warnings

These patterns enable 99.9%+ data capture rates even in plants with intermittent cellular connectivity — because the gateway collects data continuously and back-fills when connectivity returns.

Implementation Checklist

If you're building or evaluating an edge gateway for industrial IoT, verify that it supports:

Capability	Why It Matters
Config hot-reload without restart	Zero-downtime updates, no data gaps during reconfiguration
Pool-based memory allocation	No memory leaks across reload cycles
MQTT delivery watchdog	Detects silent connection failures
Async reconnection thread	PLC polling continues during MQTT recovery
Paged store-and-forward buffer	Preserves data during network outages
Consecutive error thresholds	Avoids false-positive disconnections
Link state telemetry	Distinguishes "offline gateway" from "idle machine"
Periodic forced reads	State synchronization and proof-of-life
Credential expiry monitoring	Proactive certificate/token management

Conclusion

Reliability in industrial IoT isn't about preventing failures — it's about recovering from them automatically. Networks will drop. PLCs will reboot. Certificates will expire. The question is whether your edge gateway handles these events gracefully or silently loses data.

The patterns in this guide — hot-reload, watchdogs, store-and-forward, forced reads, and credential monitoring — are the difference between a gateway that works in the lab and one that works at 3 AM on a holiday weekend in a plant with spotty cellular coverage.

Build for the 3 AM scenario. Your operations team will thank you.

Dependent Tag Architectures: Building Event-Driven Data Hierarchies in Industrial IoT [2026]

March 2, 2026 · 10 min read

Most IIoT platforms treat every data point as equal. They poll each tag on a fixed schedule, blast everything to the cloud, and let someone else figure out what matters. That approach works fine when you have ten tags. It collapses when you have ten thousand.

Production-grade edge systems take a fundamentally different approach: they model relationships between tags — parent-child dependencies, calculated values derived from raw registers, and event-driven reads that fire only when upstream conditions change. The result is dramatically less bus traffic, lower latency on the signals that matter, and a data architecture that mirrors how the physical process actually works.

This article is a deep technical guide to building these hierarchical tag architectures from the ground up.

Dependent tag architecture for IIoT

The Problem with Flat Polling

In a traditional SCADA or IIoT setup, the edge gateway maintains a flat list of tags. Each tag has an address and a polling interval:

Tag: Barrel_Temperature    Address: 40001    Interval: 1s
Tag: Screw_Speed           Address: 40002    Interval: 1s
Tag: Mold_Pressure         Address: 40003    Interval: 1s
Tag: Machine_State         Address: 40010    Interval: 1s
Tag: Alarm_Word_1          Address: 40020    Interval: 1s
Tag: Alarm_Word_2          Address: 40021    Interval: 1s

Every second, the gateway reads every tag — regardless of whether anything changed. This creates three problems:

Bus saturation on serial links. A Modbus RTU link at 9600 baud can handle roughly 10–15 register reads per second. With 200 tags at 1-second intervals, you're mathematically guaranteed to fall behind.
Wasted bandwidth to the cloud. If barrel temperature hasn't changed in 30 seconds, you're uploading the same value 30 times. At $0.005 per MQTT message on most cloud IoT services, that adds up.
Missing the events that matter. When everything polls at the same rate, a critical alarm state change gets the same priority as a temperature reading that hasn't moved in an hour.

Introducing Tag Hierarchies

A dependent tag architecture introduces three concepts:

1. Parent-Child Dependencies

A dependent tag is one that only gets read when its parent tag's value changes. Consider a machine status word. When the status word changes from "Running" to "Fault," you want to immediately read all the associated diagnostic registers. When the status word hasn't changed, those diagnostic registers are irrelevant.

# Conceptual configuration
parent_tag:
  name: machine_status_word
  address: 40010
  interval: 1s
  compare: true
  dependent_tags:
    - name: fault_code
      address: 40011
    - name: fault_timestamp
      address: 40012-40013
    - name: last_setpoint
      address: 40014

When machine_status_word changes, the edge daemon immediately performs a forced read of all three dependent tags and delivers them in the same telemetry group — with the same timestamp. This guarantees temporal coherence: the fault code, timestamp, and last setpoint all share the exact timestamp of the state change that triggered them.

2. Calculated Tags

A calculated tag is a virtual data point derived from a parent tag's raw value through bitwise operations. The most common use case: decoding packed alarm words.

Industrial PLCs frequently pack 16 boolean alarms into a single 16-bit register. Rather than polling 16 separate coil addresses (which requires 16 Modbus transactions), you read one holding register and extract each bit:

Alarm_Word_1 (uint16 at 40020):
  Bit 0 → High Temperature Alarm
  Bit 1 → Low Pressure Alarm
  Bit 2 → Motor Overload
  Bit 3 → Emergency Stop Active
  ...
  Bit 15 → Communication Fault

A well-designed edge gateway handles this decomposition at the edge:

parent_tag:
  name: alarm_word_1
  address: 40020
  type: uint16
  interval: 1s
  compare: true       # Only process when value changes
  do_not_batch: true  # Deliver immediately — don't wait for batch timeout
  calculated_tags:
    - name: high_temp_alarm
      type: bool
      shift: 0
      mask: 0x01
    - name: low_pressure_alarm
      type: bool
      shift: 1
      mask: 0x01
    - name: motor_overload
      type: bool
      shift: 2
      mask: 0x01
    - name: estop_active
      type: bool
      shift: 3
      mask: 0x01

The beauty of this approach:

One Modbus read instead of sixteen
Zero cloud processing — the edge already decomposed the alarm word into named boolean tags
Change-driven delivery — if the alarm word hasn't changed, nothing gets sent. When bit 2 flips from 0 to 1, only the changed calculated tags get delivered.

3. Comparison-Based Delivery

The compare flag on a tag definition tells the edge daemon to track the last-known value and suppress delivery when the new value matches. This is distinct from a polling interval — the tag still gets read on schedule, but the value only gets delivered when it changes.

This is particularly powerful for:

Status words and mode registers that change infrequently
Alarm bits where you care about transitions, not steady state
Setpoint registers that only change when an operator makes an adjustment

A well-implemented comparison handles type-aware equality. Comparing two float values with bitwise equality is fine for PLC registers (they're IEEE 754 representations read directly from memory — no floating-point arithmetic involved). Comparing two uint16 values is straightforward. The edge daemon should store the raw bytes, not a converted representation.

Register Grouping: The Foundation

Before dependent tags can work efficiently, the underlying polling engine needs contiguous register grouping. This is the practice of combining multiple tags into a single Modbus read request when their addresses are adjacent.

Consider these five tags:

Tag A: addr 40001, type uint16  (1 register)
Tag B: addr 40002, type uint16  (1 register)
Tag C: addr 40003, type float   (2 registers)
Tag D: addr 40005, type uint16  (1 register)
Tag E: addr 40010, type uint16  (1 register)  ← gap

An intelligent polling engine groups A through D into a single Read Holding Registers call: start address 40001, quantity 5. Tag E starts a new group because there's a 5-register gap.

The grouping rules are:

Same function code. You can't combine holding registers (FC03) with input registers (FC04) in one read.
Contiguous addresses. Any gap breaks the group.
Same polling interval. A tag polling at 1s and a tag polling at 60s shouldn't be in the same group.
Maximum group size. The Modbus spec limits a single read to 125 registers (some devices impose lower limits — 50 is a safe practical maximum).

After the bulk read returns, the edge daemon dispatches individual register values to each tag definition, handling type conversion per tag (uint16, int16, float from two consecutive registers, etc.).

The 32-Bit Float Problem

When a tag spans two Modbus registers (common for 32-bit integers and IEEE 754 floats), the edge daemon must handle word ordering. Some PLCs store the high word first (big-endian), others store the low word first (little-endian). A typical edge system stores the raw register pair and then calls the appropriate conversion:

Big-endian (AB CD): value = (register[0] << 16) | register[1]
Little-endian (CD AB): value = (register[1] << 16) | register[0]

For IEEE 754 floats, the 32-bit integer is reinterpreted as a floating-point value. Getting this wrong produces garbage data — a common source of "the numbers look random" support tickets.

Architecture: Tying It Together

Here's how a production edge system processes a single polling cycle with dependent tags:

1. Start timestamp group (T = now)
2. For each tag in the poll list:
   a. Check if interval has elapsed since last read
   b. If not due, skip (but check if it's part of a contiguous group)
   c. Read tag (or group of tags) from PLC
   d. If compare=true and value unchanged: skip delivery
   e. If compare=true and value changed:
      i.   Deliver value (batched or immediate)
      ii.  If tag has calculated_tags: compute each one, deliver
      iii. If tag has dependent_tags:
           - Finalize current batch group
           - Force-read all dependent tags (recursive)
           - Start new batch group
   f. Update last-known value and last-read timestamp
3. Finalize timestamp group

The critical detail is step (e)(iii): when a parent tag triggers a dependent read, the current batch group gets finalized and the dependent tags are read in a forced mode (ignoring their individual interval timers). This ensures the dependent values reflect the state at the moment of the parent's change, not some future polling cycle.

Practical Considerations

Serial Link Timing

On Modbus RTU, the 3.5-character silent interval between frames is mandatory. At 9600 baud with 8N1 encoding, one character takes ~1.04ms, so the minimum inter-frame gap is ~3.64ms. With a typical request frame of 8 bytes and a response frame of 5 + 2*N bytes (for N registers), a single read of 10 registers takes approximately:

Request:    8 bytes × 1.04ms = 8.3ms
Turnaround: ~3.5ms (device processing)
Response:   (5 + 20) bytes × 1.04ms = 26ms
Gap:        3.64ms
Total:      ~41.4ms per read

This means you can fit roughly 24 read operations per second on a 9600-baud link. If you're polling 150 tags with 1-second intervals, grouping is not optional — it's survival.

Alarm Tag Design

For alarm words, always configure:

compare: true — only deliver when an alarm state changes
do_not_batch: true — bypass the batch timeout and deliver immediately
interval: 1 (1 second) — poll frequently to catch transient alarms

Process variables like temperatures and pressures can safely use longer intervals (30–60 seconds) with compare: false since trending data benefits from regular samples.

Avoiding Circular Dependencies

If Tag A is dependent on Tag B, and Tag B is dependent on Tag A, you'll create an infinite recursion in the read loop. Production systems guard against this by either:

Limiting dependency depth (typically 1–2 levels)
Tracking a "reading" flag to prevent re-entry
Flattening the graph at configuration parse time

Hourly Full-Refresh

Even with change-driven delivery, it's good practice to force-read and deliver all tags at least once per hour. This catches any edge cases where a value changed but the change was missed (e.g., a brief network hiccup that caused a read failure during the exact moment of change). A simple approach: track the hour boundary and reset the "already read" flag on all tags when the hour rolls over.

How machineCDN Handles Tag Hierarchies

machineCDN's edge infrastructure supports all three relationship types natively. When you configure a device in the platform, you define parent-child dependencies, calculated alarm bits, and comparison-based delivery in the device configuration — no custom scripting required.

The platform's edge daemon handles contiguous register grouping automatically, supports both EtherNet/IP and Modbus (TCP and RTU) from the same configuration model, and provides dual-format batch delivery (JSON for debugging, binary for bandwidth efficiency). Alarm tags are delivered immediately outside the batch cycle, ensuring sub-second alert latency even when the batch timeout is set to 30 seconds.

For teams managing fleets of machines across multiple plants, this means the tag architecture you define once gets deployed consistently to every edge gateway — whether it's monitoring a chiller system with 160+ process variables or a simple TCU with 20 tags.

Key Takeaways

Model relationships, not just addresses. Tags have dependencies that mirror the physical process. Your data architecture should reflect that.
Use comparison to suppress noise. A status word that hasn't changed in 6 hours doesn't need 21,600 duplicate deliveries.
Calculated tags eliminate cloud processing. Decompose packed alarm words at the edge — one Modbus read becomes 16 named boolean signals.
Dependent reads guarantee temporal coherence. When a parent changes, all children are read with the same timestamp.
Group contiguous registers ruthlessly. On serial links, the difference between grouped and ungrouped reads is the difference between working and not working.

The flat-list polling model was fine for SCADA systems monitoring 50 tags on a single HMI. For IIoT platforms handling thousands of data points across fleets of machines, hierarchical tag architectures aren't an optimization — they're the foundation.

MQTT Connection Resilience and Watchdog Patterns for Industrial IoT [2026]

March 2, 2026 · 14 min read

In industrial IoT, the MQTT connection between an edge gateway and the cloud isn't just another network link — it's the lifeline that carries every sensor reading, every alarm event, and every machine heartbeat from the factory floor to the platform where decisions get made. When that connection fails (and it will), the difference between losing data and delivering it reliably comes down to how well you've designed your resilience patterns.

This guide covers the engineering patterns that make MQTT connections production-hardened for industrial telemetry — the kind of patterns that emerge only after years of operating edge devices in factories with unreliable cellular connections, expired certificates, and firmware updates that reboot network interfaces at 2 AM.

The Industrial MQTT Reliability Challenge

Enterprise MQTT (monitoring dashboards, chat apps, consumer IoT) can tolerate occasional message loss. Industrial MQTT cannot. Here's why:

A single missed alarm could mean a $200,000 compressor failure goes undetected
Regulatory compliance may require continuous data records with no gaps
Production analytics (OEE, downtime tracking) become meaningless with data holes
Edge gateways operate unattended for months or years — there's nobody to restart the process

The standard MQTT client libraries provide reconnection, but reconnection alone isn't resilience. True resilience means:

Data generated during disconnection is preserved
Reconnection happens without blocking data acquisition
Authentication tokens are refreshed before they expire
The system detects and recovers from "zombie connections" (TCP says connected, but no data flows)
All of this works on devices with 32MB of RAM running on cellular networks

Asynchronous Connection Architecture

The first and most important pattern: never let MQTT connection attempts block your data acquisition loop.

The Problem with Synchronous Connect

A synchronous mqtt_connect() call blocks until it either succeeds or times out. On a cellular network with DNS issues, this can take 30–60 seconds. During that time, your edge device isn't reading any PLCs, which means:

Lost data points during the connection attempt
Stale data in the PLC's scan buffer
Potential PLC communication timeouts if you miss polling windows

The Async Pattern

The production-proven pattern separates the connection lifecycle into its own thread:

Main Thread:                    Connection Thread:
┌─────────────┐                ┌──────────────────┐
│ Read PLCs   │                │ Wait for signal   │
│ Batch data  │──signal───────>│ Connect async     │
│ Buffer data │                │ Set callbacks      │
│ Continue... │<──callback─────│ Report status      │
└─────────────┘                └──────────────────┘

Key design decisions:

Use a semaphore pair to coordinate: one "job ready" semaphore and one "thread idle" semaphore. The main thread only signals a new connection attempt if the connection thread is idle (try-wait on the idle semaphore).
Connection thread is long-lived — it starts at boot and runs forever, waiting for connection signals. Don't create/destroy threads for each connection attempt; the overhead on embedded Linux systems is significant.
Never block the main thread waiting for connection. If the connection thread is busy with a previous attempt, skip and try again on the next cycle.

// Pseudocode for async connection pattern
void connection_thread() {
    while (true) {
        wait(job_semaphore);         // Block until signaled
        
        result = mqtt_connect_async(host, port, keepalive=60);
        if (result != SUCCESS) {
            log("Connection attempt failed: %d", result);
        }
        
        post(idle_semaphore);        // Signal that we're done
    }
}

void main_loop() {
    while (true) {
        read_plc_data();
        batch_and_buffer_data();
        
        if (!mqtt_connected && try_wait(idle_semaphore)) {
            // Connection thread is idle — kick off new attempt
            post(job_semaphore);
        }
    }
}

Reconnection Delay

After a disconnection, don't immediately hammer the broker with reconnection attempts:

Fixed delay: 5 seconds between attempts works well for most industrial scenarios
Don't use exponential backoff for industrial MQTT — unlike consumer apps where millions of clients might storm a broker simultaneously, your edge gateway is one device connecting to one endpoint. A constant 5-second retry gets you reconnected faster than exponential backoff without creating meaningful load.
Disable jitter — again, you're not protecting against thundering herd. Get connected as fast as reliably possible.

Page-Based Output Buffering

The output buffer is where resilience lives. When MQTT is disconnected, data keeps flowing from PLCs. Without proper buffering, that data is lost.

Buffer Architecture

The most robust pattern for embedded systems uses a page-based ring buffer:

┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│  Page 0  │ │  Page 1  │ │  Page 2  │ │  Page 3  │
│ [filled] │ │ [filling]│ │  [free]  │ │  [free]  │
│  sent ✓  │ │  ← write │ │          │ │          │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
     ↑ read

Three page states:

Free pages: Available for new data
Work page: Currently being written to by the data acquisition loop
Used pages: Filled with data, waiting to be sent

How it flows:

Data arrives from the batch layer → written to the current work page
When the work page is full → moved to the used pages queue
When MQTT is connected → first used page begins transmission
When MQTT confirms delivery (via PUBACK for QoS 1) → page moves back to free pool
When the connection drops → stop sending, but keep accepting data

The Critical Overflow Case

What happens when all pages are full and new data arrives? You have two choices:

Drop new data (preserve old data) — generally wrong for industrial monitoring, where the most recent data is most valuable
Overwrite oldest data (preserve new data) — correct for most IIoT scenarios

The practical implementation: when no free pages are available, extract the oldest used page (which hasn't been sent yet), reuse it for new data, and log a buffer overflow warning. This means you lose the oldest unsent data, but you always have the most recent readings.

Page Size Tuning

Page size creates a trade-off:

Page Size	Pros	Cons
Small (4KB)	More pages → finer granularity	More overhead per page
Medium (16KB)	Good balance	—
Large (64KB)	Fewer MQTT publishes	Single corrupt byte wastes more data

Practical recommendation: For industrial telemetry, 16–32KB pages work well. With a 500KB total buffer, that gives you 16–32 pages. At typical telemetry rates (1KB every 10 seconds), this provides 3–5 minutes of offline buffering — enough to ride through most network glitches.

Minimum page count: You need at least 3 pages for the system to function: one being written, one being sent, and one free for rotation. Validate this at initialization.

Thread Safety

The buffer must be thread-safe because it's accessed from:

The data acquisition thread (writes)
The MQTT publish callback (marks pages as delivered)
The connection/disconnection callbacks (enable/disable sending)

Use a single mutex protecting all buffer operations. Don't use multiple fine-grained locks — the complexity isn't worth it for the throughput levels of industrial telemetry (kilobytes per second, not gigabytes).

MQTT Delivery Pipeline: One Packet at a Time

For QoS 1 delivery (the minimum for industrial data), the edge gateway must track delivery acknowledgments. The pattern that works in production:

Stop-and-Wait Protocol

Rather than flooding the broker with multiple in-flight publishes, use a strict one-at-a-time delivery:

Send one message from the head of the buffer
Set a "packet sent" flag — no more sends until this clears
Wait for PUBACK via the publish callback
On PUBACK: Clear the flag, advance the read pointer, send the next message
On disconnect: Clear the flag (the retransmission will happen after reconnection)

// MQTT publish callback (called by network thread)
void on_publish(int packet_id) {
    lock(buffer_mutex);
    
    // Verify the acknowledged ID matches our sent packet
    if (current_page->read_pointer->message_id == packet_id) {
        // Advance read pointer past this message
        advance_read_pointer(current_page);
        
        // If page fully delivered, move to free pool
        if (read_pointer >= write_pointer) {
            move_page_to_free(current_page);
        }
        
        // Allow next send
        packet_in_flight = false;
        
        // Immediately try to send next message
        try_send_next();
    }
    
    unlock(buffer_mutex);
}

Why one at a time? Industrial edge devices have limited RAM. Maintaining a window of multiple in-flight messages requires tracking each one for retransmission. The throughput difference is negligible because industrial telemetry data rates are low (typically <100 messages per minute), and the round-trip to a cloud MQTT broker is 50–200ms. One-at-a-time gives you ~5–20 messages per second — more than enough.

Watchdog Patterns

Reconnection handles obvious disconnections. Watchdogs handle the subtle ones.

The Zombie Connection Problem

TCP connections can enter a state where:

The local TCP stack believes the connection is active
The remote broker has timed out and dropped the session
No PINGREQ/PINGRESP is exchanged because the network path is black-holed (packets leave but never arrive)
The MQTT library's internal keep-alive timer hasn't fired yet

During a zombie connection, your edge device is silently discarding data — it thinks it's publishing, but nothing reaches the broker.

MQTT Delivery Watchdog

Monitor the time since the last successfully delivered packet (confirmed by PUBACK):

// Record delivery time on every PUBACK
void on_publish(int packet_id) {
    clock_gettime(CLOCK_MONOTONIC, &last_delivered_timestamp);
    // ... rest of delivery handling
}

// In your main loop (every 60 seconds)
void check_mqtt_watchdog() {
    if (!mqtt_connected)
        return;
    
    elapsed = now - last_delivered_timestamp;
    
    if (has_pending_data && elapsed > WATCHDOG_TIMEOUT) {
        log("MQTT watchdog: no delivery in %d seconds, forcing reconnect", elapsed);
        mqtt_disconnect();
        // Reconnection thread will handle the rest
    }
}

Watchdog timeout: Set this to 2–3× your keep-alive interval. If your MQTT keep-alive is 60 seconds, set the watchdog to 120–180 seconds. This gives the MQTT library's built-in keep-alive mechanism time to detect the problem first, with the watchdog as a safety net.

Upstream Token/Certificate Watchdog

Cloud IoT platforms (Azure IoT Hub, AWS IoT Core, Google Cloud IoT) use time-limited authentication:

Azure IoT Hub: Shared Access Signature (SAS) tokens with expiry timestamps
AWS IoT Core: X.509 certificates with expiry dates
Google Cloud IoT: JWT tokens (typically 1–24 hour lifetime)

When a token expires, the broker closes the connection. If your edge device doesn't handle this gracefully, it enters a reconnection loop that burns battery (for cellular devices) and creates connection storm load on the broker.

The pattern:

Parse the token expiry at startup — extract the se= (signature expiry) timestamp from SAS tokens
Log a warning when the token is approaching expiry (e.g., within 1 week)
Compare against system time — if the token is expired, log a critical alert but continue trying to connect (the token might be refreshable via a management API)
If the system clock is wrong (common on embedded devices without RTC), the token check will fail spuriously — log this case separately

// SAS token expiry check
time_t se_timestamp = parse_sas_expiry(token);
time_t now = time(NULL);

if (now > se_timestamp) {
    log(WARNING, "SAS token expired! Token valid until: %s", ctime(&se_timestamp));
    log(WARNING, "Current time: %s — ensure NTP is running", ctime(&now));
    // Continue anyway — reconnection will fail with auth error
} else {
    time_t remaining = se_timestamp - now;
    if (remaining < 604800) {  // Less than 1 week
        log(WARNING, "SAS token expires in %d days", remaining / 86400);
    }
}

System Uptime Reporting

Include system and daemon uptime in your status messages. This helps diagnose issues remotely:

System uptime tells you if the device rebooted (power outage, watchdog reset, kernel panic)
Daemon uptime tells you if just the software restarted (crash, OOM kill, manual restart)
Azure/MQTT uptime tells you how long the current connection has been active

When you see a pattern of short MQTT uptimes with long system uptimes, you know it's a connectivity or authentication issue, not a hardware problem.

Status Reporting Over MQTT

Edge gateways should periodically publish their own health status, not just telemetry data. A well-designed status message includes:

{
  "cmd": "status",
  "ts": 1709391600,
  "version": {
    "sdk": "2.1.0",
    "firmware": "5.22",
    "revision": "a3f8c2d"
  },
  "system_uptime": 864000,
  "daemon_uptime": 72000,
  "sas_expiry": 1712070000,
  "plc": {
    "type": 1017,
    "link_state": 1,
    "config_version": "v3.2",
    "serial_number": 196612
  },
  "buffer": {
    "free_pages": 12,
    "used_pages": 3,
    "overflow_count": 0
  }
}

Publish status on two occasions:

Immediately after connecting — so the cloud knows the device is alive and what version it's running
Periodically (every 5–15 minutes) — for ongoing health monitoring

Extended status (including full tag listings and values) should only be sent on-demand (via cloud-to-device command) to avoid wasting bandwidth.

Protocol Version and QoS Selection

MQTT Protocol Version

Use MQTT 3.1.1 for industrial deployments in 2026. While MQTT 5.0 offers useful features (topic aliases, flow control, shared subscriptions), the library support on embedded Linux systems is less mature, and many cloud IoT brokers still have edge cases with v5 features.

MQTT 3.1.1 does everything an edge gateway needs:

QoS 0/1/2
Retained messages
Last Will and Testament
Keep-alive

QoS Level Selection

Data Type	Recommended QoS	Rationale
Telemetry batches	QoS 1	Guaranteed delivery, acceptable duplicate tolerance
Alarm events	QoS 1	Must not be lost
Status messages	QoS 1	Used for device health monitoring
Configuration commands (C2D)	QoS 1	Device must receive and acknowledge

Why not QoS 2? The exactly-once guarantee of QoS 2 requires a 4-step handshake (PUBLISH → PUBREC → PUBREL → PUBCOMP), doubling the round-trips. For industrial telemetry, occasional duplicates are easily handled by the cloud platform (deduplicate by timestamp + device serial), and the reduced latency of QoS 1 is worth it.

Why not QoS 0? Fire-and-forget has no delivery guarantee. For a consumer temperature sensor, losing one reading per hour is acceptable. For a $2M injection molding machine, losing the reading that showed the barrel temperature exceeded safe limits is not.

Cloud-to-Device Commands

Resilient MQTT isn't just about outbound telemetry. Edge gateways need to receive commands from the cloud:

Configuration updates — new tag definitions, changed polling intervals, updated batch sizes
Force read — immediately read and transmit all tag values
Status request — request a full status report including all tag values
Link state — report whether each connected PLC is reachable

Subscribe to the command topic immediately in the on-connect callback, before doing anything else:

void on_connect(status) {
    if (status == 0) {  // Connection successful
        mqtt_subscribe(command_topic, QoS=1);
        send_status(full=false);
        buffer_process_connect();  // Enable data transmission
    }
}

Topic structure for Azure IoT Hub:

Publish: devices/{device_id}/messages/events/
Subscribe: devices/{device_id}/messages/devicebound/#

The # wildcard on the subscribe topic captures all cloud-to-device messages regardless of their property bags.

TLS Configuration for Industrial MQTT

Virtually all cloud MQTT brokers require TLS. The configuration is straightforward but has operational pitfalls:

Certificate Management

Store the CA certificate file on the device filesystem
Monitor the file modification time — if the cert file is updated, reinitialize the MQTT client
Don't embed certificates in firmware — they expire, and firmware updates in factories are expensive

Common TLS Failures

Error	Cause	Fix
Certificate verify failed	CA cert expired or wrong	Update CA cert bundle
Handshake timeout	Firewall blocking port 8883	Check outbound rules for 8883
SNI mismatch	Wrong hostname in TLS SNI	Ensure MQTT host matches cert CN
Memory allocation failed	Insufficient RAM for TLS buffers	Free memory before TLS init

Putting It All Together: The Resilient Edge Stack

The complete architecture for a production-hardened IIoT edge gateway:

┌──────────────────────────────────────────────┐
│                   Cloud                       │
│   ┌──────────────────────────────────┐       │
│   │  MQTT Broker (Azure/AWS/GCP)     │       │
│   └──────────────┬───────────────────┘       │
└──────────────────┼───────────────────────────┘
                   │ TLS + QoS 1
┌──────────────────┼───────────────────────────┐
│  Edge Gateway    │                            │
│   ┌──────────────┴───────────────────┐       │
│   │  MQTT Client (async connect)      │       │
│   │  - Reconnect thread               │       │
│   │  - Delivery watchdog              │       │
│   │  - Token expiry monitor           │       │
│   └──────────────┬───────────────────┘       │
│   ┌──────────────┴───────────────────┐       │
│   │  Page-Based Output Buffer         │       │
│   │  - Ring buffer with overflow      │       │
│   │  - Thread-safe page management    │       │
│   │  - Stop-and-wait delivery         │       │
│   └──────────────┬───────────────────┘       │
│   ┌──────────────┴───────────────────┐       │
│   │  Data Batch Layer                 │       │
│   │  - JSON or binary encoding        │       │
│   │  - Size-based finalization        │       │
│   │  - Timeout-based finalization     │       │
│   └──────────────┬───────────────────┘       │
│   ┌──────────────┴───────────────────┐       │
│   │  PLC Communication Layer          │       │
│   │  - Modbus TCP / RTU              │       │
│   │  - EtherNet/IP                    │       │
│   │  - Link state tracking            │       │
│   └──────────────────────────────────┘       │
└──────────────────────────────────────────────┘

Platforms like machineCDN implement this complete stack, handling the complexity of reliable MQTT delivery so that plant engineers can focus on what matters: understanding their machine data, not debugging network connections.

Key Takeaways

Never block PLC reads for MQTT connections — use asynchronous connection in a separate thread
Buffer everything — page-based ring buffers survive disconnections and minimize memory fragmentation
Deliver one message at a time with QoS 1 — simple, reliable, and sufficient for industrial data rates
Implement watchdogs — delivery watchdog for zombie connections, token expiry watchdog for authentication lifecycle
Report status — edge device health telemetry is as important as machine telemetry
Monitor file changes — detect certificate and configuration updates without restarting
Use MQTT 3.1.1 with QoS 1 — mature, well-supported, and sufficient for all industrial use cases
Design for unattended operation — the gateway must recover from any failure without human intervention

Building resilient MQTT connections isn't about handling the happy path — it's about handling every way the network, the broker, the certificates, and the device itself can fail, and ensuring that when everything comes back online, every data point makes it to the cloud.

RS-485 Serial Communication for IIoT: Modbus RTU Wiring, Timing, and Troubleshooting [2026]

March 2, 2026 · 14 min read

Despite the march toward Ethernet-based protocols, RS-485 serial communication remains the backbone of industrial connectivity. Millions of PLCs, variable frequency drives, temperature controllers, and sensors deployed across factory floors today still communicate exclusively over serial lines. If you're building an IIoT platform that connects to real equipment — not just greenfield installations — you need to understand RS-485 deeply.

This guide covers everything a plant engineer or IIoT integrator needs to know about making RS-485 serial links reliable in production environments.

Why RS-485 Still Matters in 2026

The industrial world moves slowly for good reason: stability matters more than speed when a communication failure could halt a $50,000-per-hour production line. RS-485 has several characteristics that keep it relevant:

Distance: Up to 1,200 meters (4,000 feet) on a single segment — far beyond Ethernet's 100-meter limit without switches
Multi-drop: Up to 32 devices on a single bus (256 with high-impedance receivers)
Noise immunity: Differential signaling rejects common-mode noise from VFDs, motors, and welders
Simplicity: Two wires (plus ground), no switches, no IP configuration, no DHCP servers
Installed base: Tens of millions of Modbus RTU devices deployed globally

The challenge isn't whether RS-485 works — it's making it work reliably in electrically hostile environments while meeting the throughput requirements of modern IIoT platforms.

Modbus RTU Over RS-485: The Protocol Stack

When we talk about RS-485 in industrial settings, we're almost always talking about Modbus RTU. Understanding the relationship between the physical layer and the protocol layer is critical for troubleshooting.

The Physical Layer: RS-485

RS-485 (technically TIA/EIA-485) defines the electrical characteristics:

Parameter	Specification
Signaling	Differential (two-wire)
Voltage swing	±1.5V to ±6V between A and B lines
Receiver threshold	±200mV minimum
Common-mode range	-7V to +12V
Max data rate	10 Mbps (at short distances)
Max distance	1,200m at 100 kbps
Max devices	32 unit loads (standard drivers)

The Protocol Layer: Modbus RTU

Modbus RTU sits on top of the serial link and defines:

Framing: Silent intervals of 3.5 character times delimit frames
Addressing: Slave addresses 1–247 (address 0 is broadcast)
Function codes: Define the operation (read coils, read registers, write registers, etc.)
Error detection: CRC-16 appended to every frame

The critical insight: Modbus RTU framing depends on timing, not special characters. Unlike Modbus ASCII (which uses : and CR/LF delimiters), RTU uses gaps of silence to mark frame boundaries. This makes timing parameters absolutely critical.

Link Parameter Configuration: Getting It Right

Every RS-485 Modbus RTU connection requires five parameters to match between master and slave. Get any one of them wrong, and you'll see zero communication.

Baud Rate

Common industrial baud rates:

Baud Rate	Bytes/sec (8N1)	Typical Use Case
9600	~960	Legacy devices, long cable runs (>500m)
19200	~1,920	Standard industrial default
38400	~3,840	Modern PLCs, shorter runs
57600	~5,760	High-speed data acquisition
115200	~11,520	Point-to-point, short distance

Practical recommendation: Start at 9600 baud for commissioning. It's the most universally supported rate and gives you the best noise margin on long cable runs. Once communication is established and stable, increase the baud rate if throughput requires it.

The relationship between baud rate and maximum reliable distance is approximately:

baud  → 1,200m reliable
baud → 900m reliable
baud → 600m reliable
115200 baud → 200m reliable

These numbers assume proper termination and shielded twisted-pair cable.

Parity and Stop Bits

The Modbus RTU specification requires 11 bits per character:

8E1 (8 data bits, Even parity, 1 stop bit) — Modbus standard default
8O1 (8 data bits, Odd parity, 1 stop bit) — Alternative
8N2 (8 data bits, No parity, 2 stop bits) — Common substitute

Critical note: Many PLCs default to 8N1 (no parity, 1 stop bit = 10 bits), which technically violates the Modbus spec. If a device uses 8N1, the master must match, but be aware that frame timing calculations change because each character is 10 bits instead of 11.

Slave Address (Base Address)

Every device on the RS-485 bus needs a unique address between 1 and 247. This is typically set:

Via DIP switches on the device
Through the device's front-panel menu
In the device's configuration register

Common mistake: Address 0 is broadcast — never assign it to a device. Address 248–255 are reserved.

Byte Timeout and Response Timeout

These two timeout values are critical and often misunderstood:

Byte Timeout (inter-character timeout): The maximum time allowed between consecutive bytes within a single frame. Modbus RTU specifies this as 1.5 character times. For 9600 baud with 8E1 (11 bits per character):

1 character time = 11 bits / 9600 bps = 1.146 ms
1.5 character times = 1.719 ms

In practice, setting the byte timeout to 3–5 ms at 9600 baud provides a safe margin for real-world serial port implementations.

Response Timeout: The maximum time to wait for a slave to begin responding after the master sends a request. The Modbus specification doesn't define this — it depends on the slave device's processing time.

Device Type	Typical Response Time
Simple I/O modules	5–20 ms
PLCs (scan-dependent)	10–100 ms
VFDs	20–50 ms
Smart sensors	50–200 ms
Older/slow devices	100–500 ms

Start conservative: Set response timeout to 100–200 ms initially. Reduce it once you know the actual response time of your devices.

Modbus Address Conventions and Function Code Selection

One of the most confusing aspects of Modbus is the addressing convention. Different manufacturers use different numbering schemes, and getting this wrong means reading from the wrong registers.

The Six-Digit Convention

Many IIoT platforms and configuration tools use a six-digit address convention to encode both the register type and the offset:

Address Range	Modbus Function Code	Register Type	Description
000001–065536	FC 01 (Read Coils)	Coils (bits)	Read/write discrete outputs
100001–165536	FC 02 (Read Discrete Inputs)	Discrete Inputs	Read-only digital inputs
300001–365536	FC 04 (Read Input Registers)	Input Registers	Read-only 16-bit analog values
400001–465536	FC 03 (Read Holding Registers)	Holding Registers	Read/write 16-bit configuration values

Example: An address of 300201 means:

Register type: Input Register (3xxxxx)
Modbus offset: 201 (subtract 300000)
Function code: FC 04

An address of 400006 means:

Register type: Holding Register (4xxxxx)
Modbus offset: 6 (subtract 400000)
Function code: FC 03

The Off-by-One Problem

Modbus protocol uses zero-based addressing on the wire, but many documentation and HMI tools use one-based numbering. Register "40001" in documentation is actually address 0 in the Modbus frame.

Rule of thumb: If you're getting zeros or unexpected values, try shifting your address by ±1. This single issue causes more commissioning headaches than any other Modbus problem.

Contiguous Register Optimization

When polling multiple tags from a Modbus device, the difference between naive polling (one request per tag) and optimized polling (grouped contiguous reads) is enormous.

The Problem with Per-Tag Polling

Consider reading 10 individual holding registers at 9600 baud:

Per request overhead:
  Request frame:  8 bytes (addr + FC + start + count + CRC)
  Response frame: 5 bytes overhead + 2 bytes data = 7 bytes
  Turnaround time: ~100 ms (response timeout)
  
10 individual reads:
  Wire time: 10 × (8 + 7) bytes × 11 bits / 9600 bps = 17.2 ms
  Turnaround: 10 × 100 ms = 1,000 ms
  Total: ~1,017 ms

Optimized Contiguous Read

Reading the same 10 registers in a single request (if they're contiguous):

Single request:
  Request frame:  8 bytes
  Response frame: 5 bytes overhead + 20 bytes data = 25 bytes
  Turnaround: 100 ms
  
  Wire time: (8 + 25) bytes × 11 bits / 9600 bps = 3.8 ms
  Total: ~104 ms

That's a 10× improvement. For IIoT systems polling hundreds of tags across dozens of devices, this optimization is the difference between 1-second and 10-second update cycles.

Grouping Rules

Tags can be grouped into a single Modbus read when:

Same function code — you can't mix coil reads (FC 01) with register reads (FC 03) in one request
Contiguous addresses — no gaps in the address range
Same polling interval — tags polled every 1 second shouldn't be grouped with tags polled every 60 seconds
Within size limits — Modbus limits a single read to 125 registers (FC 03/04) or 2,000 coils (FC 01/02)

A practical maximum for a single grouped read is around 50 registers. Beyond that, the response frame gets large enough that serial transmission time becomes significant, and a single corrupted byte invalidates the entire read.

Handling Data Types Across Registers

Modbus registers are 16-bit words, but real-world values are often 32-bit integers or IEEE 754 floats. This requires reading multiple consecutive registers and assembling them correctly.

32-Bit Integer from Two Registers

For a 32-bit integer stored in registers R and R+1:

// Big-endian (most common — Modbus default byte order):
value = (register[R+1] << 16) | register[R]

// Little-endian (some vendors):
value = (register[R] << 16) | register[R+1]

IEEE 754 Float from Two Registers

Floats are trickier because you need to interpret the raw bits as a floating-point value:

// Read two 16-bit registers
uint16_t reg[2] = { register[R], register[R+1] };

// Assemble into 32-bit value (check vendor byte order!)
uint32_t raw = (reg[0] << 16) | reg[1];

// Reinterpret as float
float value = *(float*)&raw;

Critical warning: Byte ordering (endianness) varies by manufacturer. Siemens PLCs typically use big-endian. Allen-Bradley uses different conventions. Modicon (the original Modbus inventor) uses big-endian for the register order but little-endian within each register. Always consult the device manual and verify with known values.

Element Count Configuration

When configuring a tag that spans multiple registers, you need to specify:

Element count: 1 for a single 16-bit register, 2 for a 32-bit value across two registers
Data type: int16, uint16, int32, uint32, float
Start index: Position within an array (for array tags)

Getting the element count wrong is a common source of garbled data — you'll read a 32-bit float as two separate 16-bit integers, producing nonsensical values.

Compare-on-Change: Reducing Bandwidth

For IIoT systems monitoring hundreds of tags, not every value needs to be transmitted every poll cycle. A compare-on-change strategy dramatically reduces bandwidth:

Read the tag from the PLC at the configured interval
Compare the new value to the last transmitted value
Transmit only if changed — skip transmission for unchanged values
Force-read periodically — every hour, transmit all values regardless of change to ensure the cloud stays synchronized

This approach is especially effective for:

Boolean alarm tags that are "false" 99.9% of the time
Setpoints that rarely change
Status registers that hold steady during normal operation

For analog values like temperatures that fluctuate continuously, compare-on-change is less useful — a deadband (minimum change threshold) is typically needed instead.

Wiring Best Practices

RS-485 wiring errors cause more field failures than any other issue. Follow these rules:

Cable Selection

Use shielded twisted-pair cable (Belden 9841 or equivalent)
Minimum 24 AWG for runs up to 300m, 22 AWG for longer runs
Characteristic impedance should be approximately 120Ω

Topology: Daisy-Chain Only

RS-485 is a bus topology. Every device must be connected in a daisy-chain:

[Master] ---A---[Device 1]---A---[Device 2]---A---[Device 3]
         ---B---            ---B---            ---B---

Never use star topology (home-run wiring from each device back to the master). Star wiring causes signal reflections that corrupt data. If your physical layout requires star wiring, use an RS-485 hub/repeater.

Termination

Place 120Ω termination resistors at both ends of the bus (master and last device). Without termination:

Short runs (<50m at 9600 baud): Usually works without termination
Medium runs (50–300m): Marginal — may work until environmental conditions change
Long runs (>300m): Will not work reliably without termination

Grounding

Connect the cable shield to earth ground at one end only (typically the master end) to avoid ground loops
If devices on the bus have different ground potentials, use isolated RS-485 converters
Always connect a reference ground wire between devices (third conductor)

Routing

Keep RS-485 cables at least 30cm from power cables carrying more than 10A
Cross power cables at 90° when unavoidable
Never route RS-485 in the same conduit as VFD output cables — the PWM noise will destroy signal integrity

Troubleshooting Guide

Symptom: No Communication at All

Verify wiring polarity: A to A, B to B (note: some vendors label these D+ and D-, and the mapping isn't always consistent)
Check baud rate match: Use an oscilloscope to measure the bit width on the wire
Verify slave address: Confirm the device address matches your master configuration
Try a different cable: Eliminate the physical layer first
Disconnect all devices except one: Isolate bus-level problems

Symptom: Intermittent Communication Errors

Check timeouts: Increase response timeout to 200–500 ms
Add delays between requests: Insert a 50 ms delay between consecutive Modbus transactions to give slow devices time to prepare for the next request
Check for electrical noise: Use a scope to look for noise spikes on the A/B lines
Verify termination: Add or adjust 120Ω termination resistors
Check ground connections: Missing reference ground causes common-mode voltage issues

Symptom: Reads Return Wrong Values

Verify byte ordering: Try swapping the high and low registers for 32-bit values
Check address offset: Try ±1 on the register address
Verify element count: Confirm you're reading the right number of registers for the data type
Check scaling: Some devices store temperatures as integer × 10 (e.g., 245 = 24.5°C)
Read the device manual: There's no substitute for the manufacturer's register map

Symptom: Communication Fails After Running for Hours

Check for buffer overflows: Ensure your master flushes the serial port receive buffer between transactions
Check SAS token/certificate expiry: If your edge gateway connects upstream via cloud IoT (MQTT/TLS), expired authentication tokens can cascade back to halt local serial polling when the output buffer fills
Monitor connection state: Track whether your Modbus context shows as connected — some serial port drivers silently drop the connection after errors
Implement reconnection logic: When errors like ETIMEDOUT, ECONNRESET, or EBADF occur, close the serial port, wait 1–5 seconds, and re-establish the connection

Serial Communication in the Age of IIoT

Modern IIoT platforms like machineCDN bridge the gap between serial-connected devices and cloud-based analytics. The edge gateway handles:

Protocol translation: Reading Modbus RTU over RS-485, batching the data, and transmitting to the cloud over MQTT
Buffering: When the cloud connection drops, data is buffered locally and sent when connectivity resumes
Optimization: Contiguous register grouping, compare-on-change filtering, and configurable batch sizes minimize both serial bus utilization and cloud bandwidth
Link state monitoring: The gateway tracks whether each serial device is responding and reports link-up/link-down events as first-class telemetry — so you know immediately when a PLC goes offline

This layered architecture means your RS-485 serial devices don't need to change. The intelligence lives at the edge, where the gateway handles all the complexity of reliable data delivery to the cloud.

Conclusion

RS-485 serial communication isn't glamorous, but it's the foundation that millions of industrial devices depend on. Getting the link parameters right — baud rate, parity, timeouts, and wiring — is the difference between a system that runs for years without intervention and one that generates daily support tickets.

The key takeaways:

Start conservative with 9600 baud and generous timeouts during commissioning
Match every parameter between master and slave — there are no auto-negotiation features
Group contiguous registers to maximize polling throughput
Handle data types carefully — byte ordering varies by manufacturer
Wire correctly — daisy-chain topology, proper termination, and shielded cable
Implement resilience — reconnection logic, buffering, and link state tracking

RS-485 will be with us for decades to come. Master it, and you can connect to virtually any industrial device on the planet.

Store-and-Forward Buffer Design for Reliable Industrial MQTT Telemetry [2026]

March 2, 2026 · 12 min read

Your edge gateway just collected 200 data points from six machines. The MQTT connection to the cloud dropped 47 seconds ago. What happens to that data?

In consumer IoT, the answer is usually "it gets dropped." In industrial IoT, that answer gets you fired. A single missed alarm delivery can mean a $50,000 chiller compressor failure. A gap in temperature logging can invalidate an entire production batch for FDA compliance.

The solution is a store-and-forward buffer — a memory structure that sits between your data collection layer and your MQTT transport, holding telemetry data during disconnections and draining it the moment connectivity returns. It sounds simple. The engineering details are anything but.

This article walks through the design of a production-grade store-and-forward buffer for resource-constrained edge gateways running on embedded Linux.

Store-and-forward buffer architecture for MQTT telemetry

Why MQTT QoS Isn't Enough

The first objection is always: "MQTT already has QoS 1 and QoS 2 — doesn't the broker handle retransmission?"

Technically yes, but only for messages that have already been handed to the MQTT client library. The problem is what happens before the publish call:

The TCP connection is down. mosquitto_publish() returns MOSQ_ERR_NO_CONN. Your data is gone unless you stored it somewhere.
The MQTT library's internal buffer is full. Most MQTT client libraries have a finite send queue. When it fills, new publishes get rejected.
The gateway rebooted. Any data in memory is lost. Only data written to persistent storage survives.

QoS handles message delivery within an established session. Store-and-forward handles data persistence across disconnections, reconnections, and reboots.

The Page-Based Buffer Architecture

A production buffer uses a paged memory pool — a contiguous block of memory divided into fixed-size pages that cycle through three states:

┌─────────────────────────────────────────────────────┐
│                  Buffer Memory Pool                  │
│                                                      │
│  ┌──────┐  ┌──────┐  ┌──────┐  ┌──────┐  ┌──────┐  │
│  │Page 0│  │Page 1│  │Page 2│  │Page 3│  │Page 4│  │
│  │ FREE │  │ USED │  │ USED │  │ WORK │  │ FREE │  │
│  └──────┘  └──────┘  └──────┘  └──────┘  └──────┘  │
│                                                      │
│  FREE = empty, available for writing                 │
│  WORK = currently being filled with incoming data    │
│  USED = full, queued for delivery to MQTT broker     │
└─────────────────────────────────────────────────────┘

Page States

FREE pages form a linked list of available pages. When the buffer needs a new work page, it pulls from the free list.
WORK page is the single page currently accepting incoming data. New telemetry batches get appended here. There is always at most one work page.
USED pages form an ordered queue of pages waiting to be delivered. The buffer sends data from the head of the used queue, one message at a time.

Page Structure

Each page contains multiple messages, packed sequentially:

┌─────────────────────────────────────────────┐
│                    Page N                     │
│                                               │
│  ┌──────────┬──────────┬──────────────────┐  │
│  │ msg_id   │ msg_size │ message_data     │  │
│  │ (4 bytes)│ (4 bytes)│ (variable)       │  │
│  ├──────────┼──────────┼──────────────────┤  │
│  │ msg_id   │ msg_size │ message_data     │  │
│  │ (4 bytes)│ (4 bytes)│ (variable)       │  │
│  ├──────────┼──────────┼──────────────────┤  │
│  │          ... more messages ...          │  │
│  └──────────────────────────────────────────┘  │
│                                               │
│  write_p ──→ next write position              │
│  read_p  ──→ next read position (delivery)    │
│                                               │
└─────────────────────────────────────────────┘

The msg_id field is critical — it gets filled in by the MQTT library's publish() call, which returns a packet ID. When the broker acknowledges delivery (via the PUBACK callback in QoS 1), the buffer matches the acknowledged ID against the head of the delivery queue.

Memory Sizing

The minimum viable buffer needs at least three pages:

One page being filled (WORK)
One page being transmitted (USED, head of queue)
One page available for the next batch (FREE)

In practice, you want more headroom. The formula:

buffer_size = page_size × desired_holdover_time / batch_interval

Example:
- Page size: 32 KB
- Batch interval: 30 seconds
- Desired holdover: 10 minutes
- Pages needed: 32KB × (600s / 30s) = 20 pages = 640 KB

On a typical embedded Linux gateway with 256MB–512MB RAM, dedicating 1–4 MB to the telemetry buffer is reasonable.

The Write Path: Accepting Incoming Data

When the data collection layer finishes a polling cycle and has a batch of tag values ready to deliver, it calls into the buffer:

Step 1: Check the Work Page

If no work page exists, allocate one from the free list. If the free list is empty, steal the oldest used page — this is the overflow strategy (more on this below).

Step 2: Size Check

Before writing, verify that the message (plus its 8-byte header) fits in the remaining space on the work page:

remaining = page_size - (write_p - start_p)
needed = 4 (msg_id) + 4 (msg_size) + payload_size

if needed > remaining:
    move work_page to used_pages queue
    allocate a new work page
    retry

Step 3: Write the Message

Write 4 zero bytes at write_p    (placeholder for msg_id)
Write message size as uint32     (4 bytes)
Write message payload            (N bytes)
Advance write_p by 8 + N

The msg_id is initially zero because we don't know it yet — it gets assigned when the message is actually published to MQTT.

Step 4: Trigger Delivery

After every write, the buffer checks if it can send data. If the connection is up and no message is currently awaiting acknowledgment, it initiates delivery of the next queued message.

The Read Path: Delivering to MQTT

Delivery follows a strict one-message-at-a-time discipline. The buffer maintains a packet_sent flag:

if connected == false:  return
if packet_sent == true: return    (waiting for PUBACK)

message = used_pages[0].read_p
result = mqtt_publish(message.data, message.size, &message.msg_id)

if result == success:
    packet_sent = true
else:
    packet_sent = false           (retry on next opportunity)

Why One at a Time?

Sending multiple messages without waiting for acknowledgment is tempting — it would be faster. But it creates a delivery ordering problem. If messages 1, 2, and 3 are sent simultaneously and message 2's PUBACK arrives first, you don't know whether messages 1 and 3 were delivered. With one-at-a-time, the delivery order is guaranteed to match the insertion order.

For higher throughput, some implementations pipeline 2–3 messages and track a small window of in-flight packet IDs. But for industrial telemetry where data integrity matters more than latency, sequential delivery is the safer choice.

The Delivery Confirmation Callback

When the MQTT library's on_publish callback fires with a packet ID:

Lock the buffer mutex
Check that the packet_id matches used_pages[0].read_p.msg_id
Advance read_p past the delivered message
If read_p >= write_p:
     - Page completely delivered
     - Move page from used_pages to free_pages
     - Reset the page's write_p and read_p
Set packet_sent = false
Attempt to send the next message
Unlock mutex

This is where the msg_id field in the page pays off — it's the correlation key between "we published this" and "the broker confirmed this."

Overflow Handling: When Memory Runs Out

On a constrained device, the buffer will eventually fill up during an extended outage. The question is: what do you sacrifice?

Strategy 1: Drop Newest (Ring Buffer)

When the free list is empty, reject new writes. The data collection layer simply loses the current batch. This preserves historical data but creates gaps at the end of the outage.

Strategy 2: Drop Oldest (FIFO Eviction)

When the free list is empty, steal the oldest used page — the one at the head of the delivery queue. This preserves the most recent data but creates gaps at the beginning of the outage.

Which to Choose?

For industrial monitoring, drop-oldest is almost always correct. The reasoning:

During a long outage, the most recent data is more actionable than data from 20 minutes ago.
When connectivity returns, operators want to see current machine state, not historical state from the beginning of the outage.
Historical data from the outage period can often be reconstructed from PLC internal logs after the fact.

A production implementation logs a warning when it evicts a page:

Buffer: Overflow warning! Extracted USED page (#7)

This warning should be forwarded to the platform's monitoring layer so operators know data was lost.

Thread Safety

The buffer is accessed from two threads:

The polling thread — calls buffer_add_data() after each collection cycle
The MQTT callback thread — calls buffer_process_data_delivered() when PUBACKs arrive

A mutex protects all buffer operations:

// Pseudocode
void buffer_add_data(buffer, data, size) {
    lock(buffer->mutex)
    write_data_to_work_page(buffer, data, size)
    try_send_next_message(buffer)
    unlock(buffer->mutex)
}

void buffer_on_puback(buffer, packet_id) {
    lock(buffer->mutex)
    advance_read_pointer(buffer, packet_id)
    try_send_next_message(buffer)
    unlock(buffer->mutex)
}

The key insight: try_send_next_message() is called from both code paths. After adding data, the buffer checks if it can immediately begin delivery. After confirming delivery, it checks if there's more data to send. This creates a self-draining pipeline that doesn't need a separate timer or polling loop.

Connection State Management

The buffer tracks connectivity through two callbacks:

On Connect

buffer->connected = true
try_send_next_message(buffer)    // Start draining the queue

On Disconnect

buffer->connected = false
buffer->packet_sent = false      // Reset in-flight tracking

The packet_sent = false on disconnect is critical. If a message was in flight when the connection dropped, we have no way of knowing whether the broker received it. Setting packet_sent = false means the message will be re-sent on reconnection. This may result in duplicate delivery — which is fine. Industrial telemetry systems should be idempotent anyway (a repeated temperature reading at timestamp T is the same as the original).

Batch Finalization: When to Flush

Data arrives at the buffer through a batch layer that groups multiple tag values before serialization. The batch finalizes (and writes to the buffer) on two conditions:

1. Size Limit

When the accumulated batch exceeds a configured maximum size (e.g., 32 KB for JSON, or when the binary payload reaches 90% of the maximum), the batch is serialized and written to the buffer immediately:

if current_batch_size > max_batch_size:
    finalize_and_write_to_buffer(batch)
    reset_batch()

2. Time Limit

When the time since the batch started collecting exceeds a configured timeout (e.g., 30 seconds), the batch is finalized regardless of size:

elapsed = now - batch_start_time
if elapsed > max_batch_time:
    finalize_and_write_to_buffer(batch)
    reset_batch()

The time-based trigger is checked at the end of each tag group within a polling cycle, not on a separate timer. This avoids adding another thread and ensures the batch is finalized at a natural boundary in the data stream.

Binary vs. JSON Serialization

Production edge systems typically support two serialization formats:

JSON Format

{
  "groups": [
    {
      "ts": 1709341200,
      "device_type": 1018,
      "serial_number": 12345,
      "values": [
        {"id": 1, "values": [452]},
        {"id": 2, "values": [38]},
        {"id": 162, "error": -5}
      ]
    }
  ]
}

JSON is human-readable and easy to debug but verbose. A batch of 25 tag values in JSON might be 800 bytes.

Binary Format

0xF7              Command byte
[4B] num_groups   Number of timestamp groups
  [4B] timestamp  Unix timestamp
  [2B] dev_type   Device type ID
  [4B] serial     Device serial number
  [4B] num_values Number of values in group
    [2B] tag_id   Tag identifier
    [1B] status   0x00=OK, other=error
    [1B] count    Array size
    [1B] elem_sz  Element size (1, 2, or 4 bytes)
    [N×S bytes]   Packed values (MSB first)

The same 25 tag values in binary format might be 180 bytes — a 4.4× reduction. On cellular connections where bandwidth is metered per megabyte, this matters enormously.

The format choice is configured per device. Many deployments use binary for production and JSON for commissioning/debugging.

Monitoring the Buffer

A healthy buffer should have these characteristics:

Pages cycling regularly — pages move from FREE → WORK → USED → FREE in a steady rhythm
No overflow warnings — if you see "extracted USED page" in the logs, the buffer is undersized or the connection is too unreliable
Delivery timestamps advancing — track the timestamp of the last confirmed delivery. If it stops advancing while data is being collected, something is wrong with the MQTT connection

The edge daemon should publish buffer health as part of its periodic status message:

{
  "buffer": {
    "total_pages": 20,
    "free_pages": 14,
    "used_pages": 5,
    "work_pages": 1,
    "last_delivery_ts": 1709341200,
    "overflow_count": 0
  }
}

How machineCDN Implements Store-and-Forward

machineCDN's edge gateway implements the full page-based buffer architecture described in this article. The buffer sits between the batch serialization layer and the MQTT transport, providing:

Automatic page management — the gateway sizes the buffer based on available memory and configured batch parameters
Drop-oldest overflow — during extended outages, the most recent data is always preserved
Dual-format support — JSON for commissioning, binary for production deployments, configurable per device
Connection-aware delivery — the buffer begins draining immediately when the MQTT connection comes back up, with sequential delivery confirmation via QoS 1 PUBACKs

For multi-machine deployments on cellular gateways, the binary format combined with batch-and-forward typically reduces bandwidth consumption by 70–80% compared to per-tag JSON publishing — which translates directly to lower cellular data costs.

Key Takeaways

MQTT QoS doesn't replace store-and-forward. QoS handles delivery within a session. Store-and-forward handles persistence across disconnections.
Use a paged memory pool. Fixed-size pages with three states (FREE/WORK/USED) give you predictable memory usage and simple overflow handling.
One message at a time for delivery integrity. Sequential delivery with PUBACK confirmation guarantees ordering and makes the system easy to reason about.
Drop oldest on overflow. In industrial monitoring, recent data is more valuable than historical data from the beginning of an outage.
Finalize batches on both size and time. Size limits prevent memory bloat; time limits prevent stale data sitting in an incomplete batch.
Thread safety is non-negotiable. The polling thread and MQTT callback thread both touch the buffer. A mutex with minimal critical sections keeps things safe without impacting throughput.

The store-and-forward buffer is the unsung hero of reliable industrial telemetry. It's not glamorous, it doesn't show up in marketing slides, but it's the component that determines whether your IIoT platform loses data at 2 AM on a Saturday when the cell tower goes down — or quietly holds everything until the connection comes back and delivers it all without anyone ever knowing there was a problem.

Edge Computing Architecture for IIoT: Store-and-Forward, Batch Processing, and Bandwidth Optimization [2026]

February 28, 2026 · 14 min read

MachineCDN Team

Industrial IoT Experts

Here's an uncomfortable truth about industrial IoT: your cloud platform is only as reliable as the worst cellular connection on your factory floor.

And in manufacturing environments — where concrete walls, metal enclosures, and electrical noise are the norm — that connection can drop for minutes, hours, or days. If your edge architecture doesn't account for this, you're not building an IIoT system. You're building a fair-weather dashboard that goes dark exactly when you need it most.

This guide covers the architecture patterns that separate production-grade edge gateways from science projects: store-and-forward buffering, intelligent batch processing, binary serialization, and the MQTT reliability patterns that actually work when deployed on a $200 industrial router with 256MB of RAM.

Modbus TCP vs RTU: A Practical Guide for Plant Engineers [2026]

February 28, 2026 · 14 min read

MachineCDN Team

Industrial IoT Experts

Modbus TCP vs RTU

Modbus has been the lingua franca of industrial automation for over four decades. Despite the rise of OPC-UA, MQTT, and EtherNet/IP, Modbus remains the most widely deployed protocol on factory floors worldwide. If you're connecting PLCs, chillers, temperature controllers, or blenders to any kind of monitoring or cloud platform, you will encounter Modbus — guaranteed.

But Modbus comes in two flavors that behave very differently at the wire level: Modbus RTU (serial) and Modbus TCP (Ethernet). Choosing the wrong one — or misconfiguring either — is the single most common source of data collection failures in IIoT deployments.

This guide covers the real differences that matter when you're wiring up a plant, not textbook definitions.

Getting Started with IIoT: The Complete Beginner's Guide for Manufacturers

January 24, 2026 · 11 min read

MachineCDN Team

Industrial IoT Experts

Industrial IoT sounds complicated. The reality is simpler than most vendors make it appear. At its core, IIoT is about connecting your factory equipment to the internet so you can see what's happening — in real time, from anywhere, with data you can actually use to make better decisions.

If you're a plant manager, maintenance engineer, or operations leader who's been hearing about IIoT but hasn't started yet, this guide is for you. No jargon walls, no PhD-level concepts. Just the practical foundation you need to go from "I should probably look into this" to "we have our first machines connected and delivering value."

The Problem: Why Edge Gateways Fail Silently​

Pattern 1: Configuration File Hot-Reload​

Design: stat() Polling vs. inotify​

Graceful Reload: The Teardown-Rebuild Cycle​

Multi-File Configuration​

Pattern 2: Connection Watchdogs​

The MQTT Delivery Confirmation Watchdog​

Reconnection Strategy: Async with Backoff​

PLC Connection Watchdog​

Link State Telemetry​

Pattern 3: Store-and-Forward Buffering​

Paged Buffer Architecture​

Sizing the Buffer​

The Minimum Three-Page Rule​

Pattern 4: Periodic Forced Reads​

Pattern 5: SAS Token and Certificate Expiry Monitoring​

How machineCDN Implements These Patterns​

Implementation Checklist​

Conclusion​

The Problem with Flat Polling​

Introducing Tag Hierarchies​

1. Parent-Child Dependencies​

2. Calculated Tags​

3. Comparison-Based Delivery​

Register Grouping: The Foundation​

The 32-Bit Float Problem​

Architecture: Tying It Together​

Practical Considerations​

Serial Link Timing​

Alarm Tag Design​

Avoiding Circular Dependencies​

Hourly Full-Refresh​

How machineCDN Handles Tag Hierarchies​

Key Takeaways​

The Industrial MQTT Reliability Challenge​

Asynchronous Connection Architecture​

The Problem with Synchronous Connect​

The Async Pattern​

Reconnection Delay​

Page-Based Output Buffering​

Buffer Architecture​

The Critical Overflow Case​

Page Size Tuning​

Thread Safety​

MQTT Delivery Pipeline: One Packet at a Time​

Stop-and-Wait Protocol​

Watchdog Patterns​

The Zombie Connection Problem​

MQTT Delivery Watchdog​

Upstream Token/Certificate Watchdog​

System Uptime Reporting​

Status Reporting Over MQTT​

Protocol Version and QoS Selection​

MQTT Protocol Version​

QoS Level Selection​

Cloud-to-Device Commands​

Subscribe on Connect​

TLS Configuration for Industrial MQTT​

Certificate Management​

Common TLS Failures​

Putting It All Together: The Resilient Edge Stack​

Key Takeaways​

Why RS-485 Still Matters in 2026​

Modbus RTU Over RS-485: The Protocol Stack​

The Physical Layer: RS-485​

The Protocol Layer: Modbus RTU​

Link Parameter Configuration: Getting It Right​

Baud Rate​

Parity and Stop Bits​

Slave Address (Base Address)​

Byte Timeout and Response Timeout​

Modbus Address Conventions and Function Code Selection​

The Six-Digit Convention​

The Off-by-One Problem​

Contiguous Register Optimization​

The Problem with Per-Tag Polling​

Optimized Contiguous Read​

Grouping Rules​

Handling Data Types Across Registers​

32-Bit Integer from Two Registers​

The Problem: Why Edge Gateways Fail Silently

Pattern 1: Configuration File Hot-Reload

Design: stat() Polling vs. inotify

Graceful Reload: The Teardown-Rebuild Cycle

Multi-File Configuration

Pattern 2: Connection Watchdogs

The MQTT Delivery Confirmation Watchdog

Reconnection Strategy: Async with Backoff

PLC Connection Watchdog

Link State Telemetry

Pattern 3: Store-and-Forward Buffering

Paged Buffer Architecture

Sizing the Buffer

The Minimum Three-Page Rule

Pattern 4: Periodic Forced Reads

Pattern 5: SAS Token and Certificate Expiry Monitoring

How machineCDN Implements These Patterns

Implementation Checklist

Conclusion

The Problem with Flat Polling

Introducing Tag Hierarchies

1. Parent-Child Dependencies

2. Calculated Tags

3. Comparison-Based Delivery

Register Grouping: The Foundation

The 32-Bit Float Problem

Architecture: Tying It Together

Practical Considerations

Serial Link Timing

Alarm Tag Design

Avoiding Circular Dependencies

Hourly Full-Refresh

How machineCDN Handles Tag Hierarchies

Key Takeaways

The Industrial MQTT Reliability Challenge

Asynchronous Connection Architecture

The Problem with Synchronous Connect

The Async Pattern

Reconnection Delay

Page-Based Output Buffering

Buffer Architecture

The Critical Overflow Case

Page Size Tuning

Thread Safety

MQTT Delivery Pipeline: One Packet at a Time

Stop-and-Wait Protocol

Watchdog Patterns

The Zombie Connection Problem

MQTT Delivery Watchdog

Upstream Token/Certificate Watchdog

System Uptime Reporting

Status Reporting Over MQTT

Protocol Version and QoS Selection

MQTT Protocol Version

QoS Level Selection

Cloud-to-Device Commands

Subscribe on Connect

TLS Configuration for Industrial MQTT

Certificate Management

Common TLS Failures

Putting It All Together: The Resilient Edge Stack

Key Takeaways

Why RS-485 Still Matters in 2026

Modbus RTU Over RS-485: The Protocol Stack

The Physical Layer: RS-485

The Protocol Layer: Modbus RTU

Link Parameter Configuration: Getting It Right

Baud Rate

Parity and Stop Bits

Slave Address (Base Address)

Byte Timeout and Response Timeout

Modbus Address Conventions and Function Code Selection

The Six-Digit Convention

The Off-by-One Problem

Contiguous Register Optimization

The Problem with Per-Tag Polling

Optimized Contiguous Read

Grouping Rules

Handling Data Types Across Registers

32-Bit Integer from Two Registers