Skip to main content

6 posts tagged with "architecture"

View All Tags

Best MQTT Broker for Industrial IoT in 2026: Choosing the Right Message Broker for Manufacturing

· 9 min read
MachineCDN Team
Industrial IoT Experts

MQTT has become the dominant messaging protocol for industrial IoT, and for good reason: it's lightweight, handles unreliable networks gracefully, and scales from a single sensor to millions of devices. But choosing the right MQTT broker for manufacturing is a different problem than choosing one for consumer IoT. Factory floor data has different latency requirements, reliability expectations, and security constraints than smart home sensors or fleet telemetry.

Edge Gateway Lifecycle Architecture: From Boot to Steady-State Telemetry in Industrial IoT [2026]

· 14 min read

Most IIoT content treats the edge gateway as a black box: PLC data goes in, cloud data comes out. That's fine for a sales deck. It's useless for the engineer who needs to understand why their gateway loses data during a network flap, or why configuration changes require a full restart, or why it takes 90 seconds after boot before the first telemetry packet reaches the cloud.

This article breaks down the complete lifecycle of a production industrial edge gateway — from the moment it powers on to steady-state telemetry delivery, including every decision point, failure mode, and recovery mechanism in between. These patterns are drawn from real-world gateways running on resource-constrained hardware (64MB RAM, MIPS processors) in plastics manufacturing plants, monitoring TCUs, chillers, blenders, and dryers 24/7.

Phase 1: Boot and Configuration Load

When a gateway boots (or restarts after a configuration change), the first task is loading its configuration. In production deployments, there are typically two configuration layers:

The Daemon Configuration

This is the central configuration that defines what equipment to talk to:

{
"plc": {
"ip": "192.168.5.5",
"modbus_tcp_port": 502
},
"serial_device": {
"port": "/dev/rs232",
"baud": 9600,
"parity": "none",
"data_bits": 8,
"stop_bits": 1,
"byte_timeout_ms": 4,
"response_timeout_ms": 100
},
"batch_size": 4000,
"batch_timeout_sec": 60,
"startup_delay_sec": 30
}

The startup delay is a critical design choice. When a gateway boots simultaneously with the PLCs it monitors (common after a power outage), the PLCs may need 10-30 seconds to initialize their communication stacks. If the gateway immediately tries to connect, it fails, marks the PLC as unreachable, and enters a slow retry loop. A 30-second startup delay avoids this race condition.

The serial link parameters (baud, parity, data bits, stop bits) must match the PLC exactly. A mismatch here produces zero error feedback — you just get silence. The byte timeout (time between consecutive bytes) and response timeout (time to wait for a complete response) are tuned per equipment type. TCUs with slower processors may need 100ms+ response timeouts; modern PLCs respond in 10-20ms.

The Device Configuration Files

Each equipment type gets its own configuration file that defines which registers to read, what data types to expect, and how often to poll. These files are loaded dynamically based on the device type detected during the discovery phase.

A real device configuration for a batch blender might define 40+ tags, each with:

  • A unique tag ID (1-32767)
  • The Modbus register address or EtherNet/IP tag name
  • Data type (bool, int8, uint8, int16, uint16, int32, uint32, float)
  • Element count (1 for scalars, 2+ for arrays or multi-register values)
  • Poll interval in seconds
  • Whether to compare with previous value (change-based delivery)
  • Whether to send immediately or batch with other values

Hot-reload capability is essential for production systems. The gateway should monitor configuration file timestamps and automatically detect changes. When a configuration file is modified (pushed via MQTT from the cloud, or copied via SSH during maintenance), the gateway reloads it without requiring a full restart. This means configuration updates can be deployed remotely to gateways in the field without disrupting data collection.

Phase 2: Device Detection

After configuration loads successfully, the gateway enters the device detection phase. This is where protocol-level intelligence matters.

Multi-Protocol Discovery

A well-designed gateway doesn't assume which protocol the PLC speaks. Instead, it tries multiple protocols in order of preference:

Step 1: Try EtherNet/IP

The gateway sends a CIP (Common Industrial Protocol) request to the configured IP address, attempting to read a device_type tag. EtherNet/IP uses the ab-eip protocol with a micro800 CPU profile (for Allen-Bradley Micro8xx series). If the PLC responds with a valid device type, the gateway knows this is an EtherNet/IP device.

Connection path: protocol=ab-eip, gateway=192.168.5.5, cpu=micro800
Target tag: device_type (uint16)
Timeout: 2000ms

Step 2: Fall back to Modbus TCP

If EtherNet/IP fails (error code -32 = "no connection"), the gateway tries Modbus TCP on port 502. It reads input register 800 (address 300800) which, by convention, stores the device type identifier.

Function code: 4 (Read Input Registers)
Register: 800
Count: 1
Expected: uint16 device type code

Step 3: Serial detection for Modbus RTU

If TCP protocols fail, the gateway probes the serial port for Modbus RTU devices. RTU detection is trickier because there's no auto-discovery mechanism — you must know the slave address. Production gateways typically configure a default address (slave ID 1) and attempt a read.

Serial Number Extraction

After identifying the device type, the gateway reads the equipment's serial number. This is critical for fleet management — each physical machine needs a unique identifier for cloud-side tracking.

Different equipment types store serial numbers in different registers:

Equipment TypeProtocolMonth RegisterYear RegisterUnit Register
Portable ChillerModbus TCPInput 22Input 23Input 24
Central ChillerModbus TCPHolding 520Holding 510Holding 500
TCUModbus RTUEtherNet/IPEtherNet/IPEtherNet/IP
Batch BlenderEtherNet/IPCIP tagCIP tagCIP tag

The serial number is packed into a 32-bit value:

Byte 3: Year  (0x40=2010, 0x41=2011, ...)
Byte 2: Month (0x00=Jan, 0x01=Feb, ...)
Bytes 0-1: Unit number (sequential)

Example: 0x002A0050 = January 2010, unit #80

Fallback serial generation: If the PLC doesn't have a programmed serial number (common with newly installed equipment), the gateway generates one using the router's serial number as a seed, with a prefix byte distinguishing PLCs (0x7F) from TCUs (0x7E). This ensures every device in the fleet has a unique identifier even before the serial number is programmed.

Configuration Loading by Device Type

Once the device type is known, the gateway searches for a matching configuration file. If type 1010 is detected, it loads the batch blender configuration. If type 5000, it loads the TCU configuration. If no matching configuration exists, the gateway logs an error and continues monitoring other ports.

This pattern — detect → identify → configure — means a single gateway binary handles dozens of equipment types. Adding support for a new machine is a configuration file change, not a firmware update.

With devices detected and configured, the gateway establishes its cloud connection via MQTT.

Connection Architecture

Production IIoT gateways use MQTT 3.1.1 over TLS (port 8883) for cloud connectivity. The connection setup involves:

  1. Certificate verification — the gateway validates the cloud broker's certificate against a CA root cert stored locally
  2. SAS token authentication — using a device-specific Shared Access Signature that encodes the hostname, device ID, and expiration timestamp
  3. Topic subscription — after connecting, the gateway subscribes to its command topic for receiving configuration updates and control commands from the cloud
Publish topic:  devices/{deviceId}/messages/events/
Subscribe topic: devices/{deviceId}/messages/devicebound/#
QoS: 1 (at least once delivery)

QoS 1 is the standard choice for industrial telemetry — it guarantees message delivery while avoiding the overhead and complexity of QoS 2 (exactly once). Since the data pipeline is designed to handle duplicates (via timestamp deduplication at the cloud layer), QoS 1 provides the right balance of reliability and performance.

The Async Connection Thread

MQTT connection can take 5-30 seconds depending on network conditions, DNS resolution, and TLS handshake time. A naive implementation blocks the main loop during connection, which means no PLC data is read during this time.

The solution: run mosquitto_connect_async() in a separate thread. The main loop continues reading PLC tags and buffering data while the MQTT connection establishes in the background. Once the connection callback fires, buffered data starts flowing to the cloud.

This is implemented using a semaphore-based producer-consumer pattern:

  1. Main thread prepares connection parameters and posts to a semaphore
  2. Connection thread wakes up, calls connect_async(), and signals completion
  3. Main thread checks semaphore state before attempting reconnection (prevents double-connect)

Connection Watchdog

Network connections fail. Cell modems lose signal. Cloud brokers restart. A production gateway needs a watchdog that detects stale connections and forces reconnection.

The watchdog pattern:

Every 120 seconds:
1. Check: have we received ANY confirmation from the broker?
(delivery ACK, PUBACK, SUBACK — anything)
2. If yes → connection is healthy, reset watchdog timer
3. If no → connection is stale. Destroy MQTT client and reinitiate.

The 120-second timeout is tuned for cellular networks where intermittent connectivity is expected. On wired Ethernet, you could reduce this to 30-60 seconds. The key insight: don't just check "is the TCP socket open?" — check "has the broker confirmed any data delivery recently?" A half-open socket can persist for hours without either side knowing.

Phase 4: Steady-State Tag Reading

Once PLC connections and MQTT are established, the gateway enters its main polling loop. This is where it spends 99.9% of its runtime.

The Main Loop (1-second resolution)

The core loop runs every second and performs three operations:

  1. Configuration check — detect if any configuration file has been modified (via file stat monitoring)
  2. Tag read cycle — iterate through all configured tags and read those whose polling interval has elapsed
  3. Command processing — check the incoming command queue for cloud-side instructions (config updates, manual reads, interval changes)

Interval-Based Polling

Each tag has a polling interval in seconds. The gateway maintains a monotonic clock timestamp of the last read for each tag. On each loop iteration:

for each tag in device.tags:
elapsed = now - tag.last_read_time
if elapsed >= tag.interval_sec:
read_tag(tag)
tag.last_read_time = now

Typical intervals by data category:

Data TypeIntervalRationale
Temperatures, pressures60sSlow-changing process values
Alarm states (booleans)1sImmediate awareness needed
Machine state (running/idle)1sOEE calculation accuracy
Batch counts1sProduction tracking
Version, serial number3600sStatic values, verify hourly

Compare Mode: Change-Based Delivery

For many tags, sending the same value every second is wasteful. If a chiller alarm bit is false for 8 hours straight, that's 28,800 redundant messages.

Compare mode solves this: the gateway stores the last-read value and only delivers to the cloud when the value changes. This is configured per tag:

{
"name": "Compressor Fault Alarm",
"type": "bool",
"interval": 1,
"compare": true,
"do_not_batch": true
}

This tag is read every second, but only transmitted when it changes. The do_not_batch flag means changes are sent immediately rather than waiting for the next batch finalization — critical for alarm states where latency matters.

Hourly Full Refresh

There's a subtle problem with pure change-based delivery: if a value changes while the MQTT connection is down, the cloud never learns about the transition. And if a value stays constant for days, the cloud has no heartbeat confirming the sensor is still alive.

The solution: every hour (on the hour change), the gateway resets all "read once" flags, forcing a complete re-read and re-delivery of all tags. This guarantees the cloud has fresh values at least hourly, regardless of change activity.

Phase 5: Data Batching and Delivery

Raw tag values don't get sent individually (except high-priority alarms). Instead, they're collected into batches for efficient delivery.

Binary Encoding

Production gateways use binary encoding rather than JSON to minimize bandwidth. The binary format packs values tightly:

Header:        1 byte  (0xF7 = tag values)
Group count: 4 bytes (number of timestamp groups)

Per group:
Timestamp: 4 bytes
Device type: 2 bytes
Serial num: 4 bytes
Value count: 4 bytes

Per value:
Tag ID: 2 bytes
Status: 1 byte (0x00=OK, else error code)
Array size: 1 byte (if status=OK)
Elem size: 1 byte (1, 2, or 4 bytes per element)
Data: size × count bytes

A batch containing 20 float values uses about 200 bytes in binary vs. ~2,000 bytes in JSON — a 10× bandwidth reduction that matters on cellular connections billed per megabyte.

Batch Finalization Triggers

A batch is finalized (sent to MQTT) when either:

  1. Size threshold — the batch reaches the configured maximum size (default: 4,000 bytes)
  2. Time threshold — the batch has been collecting for longer than batch_timeout_sec (default: 60 seconds)

This ensures data reaches the cloud within 60 seconds even during low-activity periods, while maximizing batch efficiency during high-activity periods (like a blender running a batch cycle that triggers many dependent tag reads).

The Paged Ring Buffer

Between the batching layer and the MQTT publish layer sits a paged ring buffer. This is the gateway's resilience layer against network outages.

The buffer divides available memory into fixed-size pages. Each page holds one or more complete MQTT messages. The buffer operates as a queue:

  • Write side: Finalized batches are written to the current work page. When a page fills up, it moves to the "used" queue.
  • Read side: When MQTT is connected, the gateway publishes the oldest used page. Upon receiving a PUBACK (delivery confirmation), the page moves to the "free" pool.
  • Overflow: If all pages are used (network down too long), the gateway overwrites the oldest used page — losing the oldest data to preserve the newest.

This design means the gateway can buffer 15-60 minutes of telemetry data during a network outage (depending on available memory and data density), then drain the buffer once connectivity restores.

Disconnect Recovery

When the MQTT connection drops:

  1. The buffer's "connected" flag is cleared
  2. All pending publish operations are halted
  3. Incoming PLC data continues to be read, batched, and buffered
  4. The MQTT async thread begins reconnection
  5. On reconnection, the buffer's "connected" flag is set, and data delivery resumes from the oldest undelivered page

This means zero data loss during short outages (up to the buffer capacity), and newest-data-preserved during long outages (the overflow policy drops oldest data first).

Phase 6: Remote Configuration and Control

A production gateway accepts commands from the cloud over its MQTT subscription topic. This enables remote management without SSH access.

Supported Command Types

CommandDirectionDescription
daemon_configCloud → DeviceUpdate central configuration (IP addresses, serial params)
device_configCloud → DeviceUpdate device-specific tag configuration
get_statusCloud → DeviceRequest current daemon/PLC/TCU status report
get_status_extCloud → DeviceRequest extended status with last tag values
read_now_plcCloud → DeviceForce immediate read of a specific tag
tag_updateCloud → DeviceChange a tag's polling interval remotely

Remote Interval Adjustment

This is a powerful production feature: the cloud can remotely change how often specific tags are polled. During a quality investigation, an engineer might temporarily increase temperature polling from 60s to 5s to capture rapid transients. After the investigation, they reset to 60s via another command.

The gateway applies interval changes immediately and persists them to the configuration file, so they survive a restart. The modified_intervals flag in status reports tells the cloud that intervals have been manually adjusted.

Designing for Constrained Hardware

These gateways often run on embedded Linux routers with severely constrained resources:

  • RAM: 64-128MB (of which 30-40MB is available after OS)
  • CPU: MIPS or ARM, 500-800 MHz, single core
  • Storage: 16-32MB flash (no disk)
  • Network: Cellular (LTE Cat 4/Cat M1) or Ethernet

Design constraints this imposes:

  1. Fixed memory allocation — allocate all buffers at startup, never malloc() during runtime. A memory fragmentation crash at 3 AM in a factory with no IT staff is unrecoverable.

  2. No floating-point unit — older MIPS processors do software float emulation. Keep float operations to a minimum; do heavy math in the cloud.

  3. Flash wear — don't write configuration changes to flash more than necessary. Batch writes, use write-ahead logging if needed.

  4. Watchdog timer — use the hardware watchdog timer. If the main loop hangs, the hardware reboots the gateway automatically.

How machineCDN Implements These Patterns

machineCDN's ACS (Auxiliary Communication System) gateway embodies all of these lifecycle patterns in a production-hardened implementation that's been running on thousands of plastics manufacturing machines for years.

The gateway runs on Teltonika RUT9XX industrial cellular routers, providing cellular connectivity for machines in facilities without available Ethernet. It supports EtherNet/IP and Modbus (both TCP and RTU) simultaneously, auto-detecting device types at boot and loading the appropriate configuration from a library of pre-built equipment profiles.

For manufacturers deploying machineCDN, the complexity described in this article — protocol detection, configuration management, MQTT buffering, recovery — is entirely handled by the platform. The result is that plant engineers get reliable, continuous telemetry from their equipment without needing to understand (or debug) the edge gateway's internal lifecycle.


Understanding how edge gateways actually work — not just what they do, but how they manage their lifecycle — is essential for building reliable IIoT infrastructure. The patterns described here (startup sequencing, multi-protocol detection, buffered delivery, watchdog recovery) separate toy deployments from production systems that run for years without intervention.

ISA-95 and IIoT Integration: Bridging IT and OT in Modern Manufacturing

· 9 min read
MachineCDN Team
Industrial IoT Experts

ISA-95 was created in the late 1990s to solve a simple problem: how should enterprise systems (ERP) communicate with plant floor systems (PLCs and SCADA)? Two decades later, IIoT platforms have disrupted the neat hierarchical model that ISA-95 defined. Data now flows from sensors directly to the cloud, bypassing every layer in between. The question for manufacturing engineers in 2026 isn't whether ISA-95 is still relevant — it's how to reconcile a framework built for hierarchical, on-premises architectures with the reality of cloud-native, edge-computing IIoT platforms.

Thread-Safe Telemetry Pipelines: Building Concurrent IIoT Edge Gateways That Don't Lose Data [2026]

· 17 min read

An edge gateway on a factory floor isn't a REST API handling one request at a time. It's a real-time system juggling multiple competing demands simultaneously: polling a PLC for tag values every second, buffering data locally when the cloud connection drops, transmitting batched telemetry over MQTT, processing incoming configuration commands from the cloud, and monitoring its own health — all at once, on hardware with the computing power of a ten-year-old smartphone.

Get the concurrency wrong, and you don't get a 500 error in your logs. You get silent data loss, corrupted telemetry batches, or — worst case — a watchdog reboot loop that takes your monitoring offline during a critical production run.

This guide covers the architecture patterns that make industrial edge gateways reliable under real-world conditions: concurrent PLC polling, thread-safe buffering, MQTT delivery guarantees, and the store-and-forward patterns that keep data flowing when the network doesn't.

Thread-safe edge gateway architecture with concurrent data pipelines

The Concurrency Challenge in Industrial Edge Gateways

A typical edge gateway has at least three threads running concurrently:

  1. The polling thread — reads tags from PLCs at configured intervals (1-second to 60-second cycles)
  2. The MQTT network thread — manages the broker connection, handles publish/subscribe, reconnection
  3. The main control thread — processes incoming commands, monitors watchdog timers, manages configuration

These threads all share one critical resource: the outgoing data buffer. The polling thread writes telemetry into the buffer. The MQTT thread reads from the buffer and transmits data. When the connection drops, the buffer must hold data without the polling thread stalling. When the connection recovers, the buffer must drain in order without losing or duplicating messages.

This is a classic producer-consumer problem, but with industrial constraints that make textbook solutions insufficient.

Why Standard Queues Fall Short

Your first instinct might be to use a thread-safe queue — a ConcurrentLinkedQueue in Java, a queue.Queue in Python, or a lock-free ring buffer. These work fine for web applications, but industrial edge gateways have constraints that break standard queue implementations:

1. Memory Is Fixed and Finite

Edge gateways run on embedded hardware with 64 MB to 512 MB of RAM — no swap space, no dynamic allocation after startup. An unbounded queue will eventually exhaust memory during a long network outage. A fixed-size queue forces you to choose: block the producer (stalling PLC polling) or drop the oldest data.

2. Network Outages Last Hours, Not Seconds

In a factory, network outages aren't transient blips. A fiber cut, a misconfigured switch, or a power surge on the network infrastructure can take connectivity down for hours. Your buffer needs to hold potentially thousands of telemetry batches — not just a few dozen.

3. Delivery Confirmation Is Asynchronous

MQTT QoS 1 guarantees at-least-once delivery, but the PUBACK confirmation comes back asynchronously — possibly hundreds of milliseconds after the PUBLISH. During that window, you can't release the buffer space (the message might need retransmission), and you can't stall the producer (PLC data keeps flowing).

4. Data Must Survive Process Restarts

If the edge gateway daemon restarts (due to a configuration update, a watchdog trigger, or a power cycle), buffered-but-undelivered data must be recoverable. Purely in-memory queues lose everything.

The Paged Ring Buffer Pattern

The pattern that works in production is a paged ring buffer — a fixed-size memory region divided into pages, with explicit state tracking for each page. Here's how it works:

Memory Layout

At startup, the gateway allocates a single contiguous memory block and divides it into equal-sized pages:

┌─────────┬─────────┬─────────┬─────────┬─────────┐
│ Page 0 │ Page 1 │ Page 2 │ Page 3 │ Page 4 │
│ FREE │ FREE │ FREE │ FREE │ FREE │
└─────────┴─────────┴─────────┴─────────┴─────────┘

Each page has its own header tracking:

  • A page number (for logging and debugging)
  • A start_p pointer (beginning of writable space)
  • A write_p pointer (current write position)
  • A read_p pointer (current read position for transmission)
  • A next pointer (linking to the next page in whatever list it's in)

Three Page Lists

Pages move between three linked lists:

  1. Free pages — available for the producer to write into
  2. Used pages — full of data, queued for transmission
  3. Work page — the single page currently being written to
Producer (Polling Thread)          Consumer (MQTT Thread)
│ │
▼ │
┌──────────┐ │
│Work Page │──────── When full ──────►┌──────────┐
│(writing) │ │Used Pages│──► MQTT Publish
└──────────┘ │(queued) │
▲ └──────────┘
│ │
│ When delivered │
│◄──────────────────────────────────────┘
┌──────────┐
│Free Pages│
│(empty) │
└──────────┘

The Producer Path

When the polling thread has a new batch of tag values to store:

  1. Check the work page — if there's no current work page, grab one from the free list
  2. Calculate space — check if the new data fits in the remaining space on the work page
  3. If it fits — write the data (with a size header) and advance write_p
  4. If it doesn't fit — move the work page to the used list, grab a new page (from free, or steal the oldest from used if free is empty), and write there
  5. After writing — check if there's data ready to transmit and kick the consumer

The critical detail: if the free list is empty, the producer steals the oldest used page. This means during extended outages, the buffer wraps around and overwrites the oldest data — exactly the behavior you want. Recent data is more valuable than stale data in industrial monitoring.

The Consumer Path

When the MQTT connection is active and there's data to send:

  1. Check the used page list — if empty, check if the work page has unsent data and promote it
  2. Read the next message from the first used page's read_p position
  3. Publish via MQTT with QoS 1
  4. Set a "packet sent" flag — this prevents sending the next message until the current one is acknowledged
  5. Wait for PUBACK — when the broker confirms receipt, advance read_p
  6. If read_p reaches write_p — the page is fully delivered; move it back to the free list
  7. Repeat — grab the next message from the next used page

The Mutex Strategy

The entire buffer is protected by a single mutex. This might seem like a bottleneck, but in practice:

  • Write operations (adding data) take microseconds
  • Read operations (preparing to transmit) take microseconds
  • The actual MQTT transmission happens outside the mutex — only the buffer state management is locked

The mutex is held for a few microseconds at a time, never during network I/O. This keeps the polling thread from ever blocking on network latency.

Polling Thread:               MQTT Thread:
lock(mutex) lock(mutex)
write data to page read data from page
check if page full mark as sent
maybe promote page unlock(mutex)
trigger send check ─── MQTT publish ───
unlock(mutex) (outside mutex!)
lock(mutex)
process PUBACK
maybe free page
unlock(mutex)

Message Framing Inside Pages

Each page holds multiple messages packed sequentially. Each message has a simple header:

┌──────────────┬──────────────┬─────────────────────┐
│ Message ID │ Message Size │ Message Body │
│ (4 bytes) │ (4 bytes) │ (variable) │
└──────────────┴──────────────┴─────────────────────┘

The Message ID field is initially zero. When the MQTT library publishes the message, it fills in the packet ID assigned by the broker. This is how the consumer tracks which specific message was acknowledged — when the PUBACK callback fires with a packet ID, it can match it to the message at read_p and advance.

This framing makes the buffer self-describing. During recovery after a restart, the gateway can scan page contents by reading size headers sequentially.

Handling Disconnections Gracefully

When the MQTT connection drops, the consumer thread must handle it without corrupting the buffer:

Connection Lost:
1. Set connected = 0
2. Clear "packet sent" flag
3. Do NOT touch any page pointers

That's it. The producer keeps writing — it doesn't know or care about the connection state. The buffer absorbs data normally.

When the connection recovers:

Connection Restored:
1. Set connected = 1
2. Trigger send check (under mutex)
3. Consumer picks up where it left off

The key insight: the "packet sent" flag prevents double-sending. If a PUBLISH was in flight when the connection dropped, the PUBACK never arrived. The flag remains set, but the disconnection handler clears it. When the connection recovers, the consumer re-reads the same message from read_p (which was never advanced) and re-publishes it. The broker either receives a duplicate (handled by QoS 1 dedup) or receives it for the first time.

Binary vs. JSON Batch Encoding

The telemetry data written into the buffer can be encoded in two formats, and the choice affects both bandwidth and reliability.

JSON Format

Each batch is a JSON object containing groups of timestamped values:

{
"groups": [
{
"ts": 1709424000,
"device_type": 1017,
"serial_number": 123456,
"values": [
{"id": 80, "values": [725]},
{"id": 81, "values": [680]},
{"id": 82, "values": [285]}
]
}
]
}

Pros: Human-readable, easy to debug, parseable by any language. Cons: 5-8× larger than binary, float precision loss (decimal representation), size estimation is rough.

Binary Format

A compact binary encoding with a header byte (0xF7), followed by big-endian packed groups:

F7                              ← Header
00 00 00 01 ← Number of groups (1)
65 E8 2C 00 ← Timestamp (Unix epoch)
03 F9 ← Device type (1017)
00 01 E2 40 ← Serial number
00 00 00 03 ← Number of values (3)
00 50 00 01 02 02 D5 ← Tag 80: status=0, 1 value, 2 bytes, 725
00 51 00 01 02 02 A8 ← Tag 81: status=0, 1 value, 2 bytes, 680
00 52 00 01 02 01 1D ← Tag 82: status=0, 1 value, 2 bytes, 285

Pros: 5-8× smaller, perfect float fidelity (raw bytes preserved), exact size calculation. Cons: Requires matching decoder on the cloud side, harder to debug without tools.

For gateways communicating over cellular connections — common in remote facilities like water treatment plants, oil wells, or distributed renewable energy sites — binary encoding is essentially mandatory. A gateway polling 100 tags every 10 seconds generates about 260 MB/month in JSON versus 35 MB/month in binary. At typical IoT cellular rates ($0.50-$2.00/MB), that's the difference between $130/month and $17/month per gateway.

The MQTT Watchdog Pattern

MQTT connections can enter a zombie state — technically connected according to the TCP stack, but the broker has stopped responding. This is especially common behind industrial firewalls and NAT devices with aggressive connection timeout policies.

The Problem

The MQTT library reports the connection as alive. The gateway publishes messages. No PUBACK comes back — ever. The buffer fills up because the consumer thinks each message is "in flight" (the packet_sent flag is set). Eventually the buffer wraps and data loss begins.

The Solution: Last-Delivered Timestamp

Track the timestamp of the last successful PUBACK. If more than N seconds have passed since the last acknowledged delivery, and there are messages waiting to be sent, the connection is stale:

monitor_watchdog():
if connected AND packet_sent:
elapsed = now - last_delivered_packet_timestamp
if elapsed > WATCHDOG_THRESHOLD:
// Force disconnect and reconnect
force_disconnect()
// Disconnection handler clears packet_sent
// Reconnection handler will re-deliver from read_p

A typical threshold is 60 seconds for LAN connections and 120 seconds for cellular. This catches zombie connections that the TCP stack and MQTT keep-alive miss.

Reconnection with Backoff

When the watchdog (or a genuine disconnection) triggers a reconnect, use a dedicated thread for the connection attempt. The connect_async call can block for the TCP timeout duration (potentially 30+ seconds), and you don't want that blocking the main loop or the polling thread.

A semaphore controls the reconnection thread:

Main Thread:                Reconnection Thread:
Detects need to (blocked on semaphore)
reconnect │
Posts semaphore ──────► Wakes up
Calls connect_async()
(may block 30s)
Success or failure
Posts "done" semaphore
Waits for "done" ◄──────
Checks result

The reconnect delay should be fixed and short (5 seconds is typical) for industrial applications, not exponential backoff. In a factory, the network outage either resolves quickly (a transient) or it's a hard failure that needs human intervention. Exponential backoff just delays reconnection after the network recovers.

Batching Strategy: Size vs. Time

Telemetry batches should be finalized and queued for transmission based on whichever threshold hits first: size or time.

Size-Based Finalization

When the accumulated batch data exceeds a configured maximum (typically 4-500 KB for JSON, 50-100 KB for binary), finalize and queue it. This prevents any single MQTT message from being too large for the broker or the network MTU.

Time-Based Finalization

When the batch has been collecting data for more than a configured timeout (typically 30-60 seconds), finalize it regardless of size. This ensures that even slowly-changing tags get transmitted within a bounded time window.

The Interaction Between Batching and Buffering

Batching and buffering are separate concerns that interact:

PLC Tags ──► Batch (collecting) ──► Buffer Page (queued) ──► MQTT (transmitted)

Tag reads accumulate When batch finalizes, Pages are transmitted
in the batch structure the encoded batch goes one at a time with
into the ring buffer PUBACK confirmation

A batch contains one or more "groups" — each group is a set of tag values read at the same timestamp. Multiple polling cycles might go into a single batch before it's finalized by size or time. The finalized batch then goes into the ring buffer as a single message.

Dependent Tag Reads and Atomic Groups

In many PLC configurations, certain tags are only meaningful when read together. For example:

  • Alarm word tags — a uint16 register where each bit represents a different alarm. You read the alarm word, then extract the individual bits. If the alarm word changes, you need to read and deliver the extracted bits atomically with the parent.

  • Machine state transitions — when a "blender running" tag changes from 0 to 1, you might need to immediately read all associated process values (RPM, temperatures, pressures) to capture the startup snapshot.

The architecture handles this through dependent tag chains:

Parent Tag (alarm_word, interval=1s, compare=true)
└── Calculated Tag (alarm_bit_0, shift=0, mask=0x01)
└── Calculated Tag (alarm_bit_1, shift=1, mask=0x01)
└── Dependent Tag (motor_speed, read_on_change=true)
└── Dependent Tag (temperature, read_on_change=true)

When the parent tag changes, the polling thread:

  1. Finalizes the current batch
  2. Recursively reads all dependent tags (forced read, ignoring intervals)
  3. Starts a new batch group with the same timestamp

This ensures that the dependent values are timestamped identically with the trigger event and delivered together.

Hourly Full-Read Reset

Change-of-value (COV) filtering dramatically reduces bandwidth, but it introduces a subtle failure mode: if a value changes during a transient read error, the gateway might never know it changed.

Here's the scenario:

  1. At 10:00:00, tag value = 72.5 → transmitted
  2. At 10:00:01, PLC returns an error for that tag → not transmitted
  3. At 10:00:02, tag value = 73.0 → compared against last successful read (72.5), change detected, transmitted
  4. But if the error at 10:00:01 was actually a valid read of 73.0 that was misinterpreted as an error, and the value stayed at 73.0, then at 10:00:02 the comparison against the last known value (72.5) correctly catches it.

The real problem is when:

  1. At 10:00:00, tag value = 72.5 → transmitted
  2. The PLC program changes the tag to 73.0 and then back to 72.5 between polling cycles
  3. The gateway never sees 73.0 — it polls at 10:00:00 and 10:00:01 and gets 72.5 both times

For most industrial applications, this sub-second transient is irrelevant. But to guard against drift — where small rounding differences accumulate between the gateway's cached value and the PLC's actual value — a full reset is performed every hour:

Every hour boundary (when the system clock's hour changes):
1. Clear the "read once" flag on every tag
2. Clear all last-known values
3. Force read and transmit every tag regardless of COV

This guarantees that the cloud platform has a complete snapshot of every tag value at least once per hour, even for tags that haven't changed.

Putting It All Together: The Polling Loop

Here's the complete polling loop architecture that ties all these patterns together:

main_polling_loop():
FOREVER:
current_time = monotonic_clock()

FOR each configured device:
// Hourly reset check
if hour(current_time) != hour(last_poll_time):
reset_all_tags(device)

// Start a new batch group
start_group(device.batch, unix_timestamp())

FOR each tag in device.tags:
// Check if this tag needs reading now
if not tag.read_once OR elapsed(tag.last_read) >= tag.interval:

value, status = read_tag(device, tag)

if status == LINK_ERROR:
set_link_state(device, DOWN)
break // Stop reading this device

set_link_state(device, UP)

// COV check
if tag.compare AND tag.read_once:
if value == tag.last_value AND status == tag.last_status:
continue // No change, skip

// Deliver value
if tag.do_not_batch:
deliver_immediately(device, tag, value)
else:
add_to_batch(device.batch, tag, value)

// Check dependent tags
if value_changed AND tag.has_dependents:
finalize_batch()
read_dependents(device, tag)
start_new_group()

// Update tracking
tag.last_value = value
tag.last_status = status
tag.read_once = true
tag.last_read = current_time

// Finalize batch group
stop_group(device.batch, output_buffer)
// ↑ This checks size/time thresholds and may
// queue the batch into the ring buffer

sleep(polling_interval)

Performance Characteristics

On a typical industrial edge gateway (ARM Cortex-A9, 512 MB RAM, Linux):

OperationTimeNotes
Mutex lock/unlock~1 µsPer buffer operation
Modbus TCP read (10 registers)5-15 msNetwork dependent
Modbus RTU read (10 registers)20-50 msBaud rate dependent (9600-115200)
EtherNet/IP tag read2-8 msCIP overhead
JSON batch encoding0.5-2 ms100 tags
Binary batch encoding0.1-0.5 ms100 tags
MQTT publish (QoS 1)1-5 msLAN broker
Buffer page write5-20 µsmemcpy only

The bottleneck is always the PLC protocol reads, not the buffer or transmission logic. A gateway polling 200 Modbus TCP tags can complete a full cycle in under 200 ms, leaving plenty of headroom for a 1-second polling interval.

For Modbus RTU (serial), the bottleneck shifts to the baud rate. At 9600 baud, a single register read takes ~15 ms including response. Polling 50 registers individually would take 750 ms — too close to a 1-second interval. This is why contiguous register grouping matters: reading 50 consecutive registers in a single request takes about 50 ms, a 15× improvement.

How machineCDN Implements These Patterns

machineCDN's edge gateway uses exactly these patterns — paged ring buffers with mutex-protected page management, QoS 1 MQTT with PUBACK-based buffer advancement, and both binary and JSON encoding depending on the deployment's bandwidth constraints.

The platform's gateway daemon runs on Linux-based edge hardware (including cellular routers like the Teltonika RUT series) and handles simultaneous Modbus RTU, Modbus TCP, and EtherNet/IP connections to mixed-vendor equipment. The buffer is sized during commissioning based on the expected outage duration — a 64 KB buffer holds roughly 4 hours of data at typical polling rates; a 512 KB buffer extends that to over 24 hours.

The result: plants running machineCDN don't lose telemetry during network outages. When connectivity recovers, the buffered data drains automatically and fills in the gaps in trending charts and analytics — no manual intervention, no missing data points.

Key Takeaways

  1. Use paged ring buffers, not unbounded queues — fixed memory, graceful overflow (oldest data dropped first)
  2. Protect buffer operations with a mutex, but never hold it during network I/O — microsecond lock durations keep producers and consumers non-blocking
  3. Track PUBACK per-message to prevent double-sending and enable reliable buffer advancement
  4. Implement a MQTT watchdog using last-delivery timestamps to catch zombie connections
  5. Batch by size OR time (whichever hits first) to balance bandwidth and latency
  6. Reset all tags hourly to guarantee complete snapshots and prevent drift
  7. Binary encoding saves 5-8× bandwidth with zero precision loss — essential for cellular-connected gateways
  8. Group contiguous Modbus registers into single requests — 15× faster than individual reads on RTU

Building a reliable IIoT edge gateway is fundamentally a systems programming challenge. The protocols, the buffering, the concurrency — each one is manageable alone, but getting them all right together, on constrained hardware, with zero tolerance for data loss, is what separates toy prototypes from production infrastructure.


See machineCDN's store-and-forward buffering in action with real factory data. Request a demo to explore the platform.

MQTT Topic Architecture for Multi-Site Manufacturing: Designing Scalable Namespaces That Don't Collapse at 10,000 Devices [2026]

· 14 min read
MachineCDN Team
Industrial IoT Experts

Every MQTT tutorial starts the same way: sensor/temperature. Clean, simple, obvious. Then you ship to production and discover that topic architecture is to MQTT what database schema is to SQL — get it wrong early and you'll spend the next two years paying for it.

Manufacturing environments are particularly brutal to bad topic design. A single plant might have 200 machines, each with 30–100 tags, across 8 production lines, reporting to 4 different consuming systems (historian, SCADA, analytics, alerting). Multiply by 5 plants across 3 countries, and your MQTT broker is routing messages across a topic tree with 50,000+ leaf nodes. The topic hierarchy you chose in month one determines whether this scales gracefully or becomes an operational nightmare.

Edge vs Cloud for Industrial Data: Where Should You Process Your Manufacturing Data?

· 9 min read
MachineCDN Team
Industrial IoT Experts

The edge vs. cloud debate in industrial IoT has been argued for years, and both sides have valid points. Edge advocates emphasize latency, reliability, and bandwidth costs. Cloud advocates point to scalability, advanced analytics, and reduced on-site infrastructure. The reality — as experienced by anyone who's actually deployed IIoT in a manufacturing environment — is that the answer is almost always "both."

But "both" isn't helpful without specifics. Which data should be processed at the edge? What belongs in the cloud? How should the two layers communicate? And what does this architecture actually look like when you're connecting PLCs on a factory floor to AI-powered analytics?

This guide provides practical answers for manufacturing engineers and plant managers who need to make architecture decisions without a PhD in distributed systems.