32 posts tagged with "mqtt"

Industrial Network Security for OT Engineers: TLS, Certificates, and Zero-Trust on the Plant Floor [2026]

March 1, 2026 · 15 min read

Industrial security used to mean padlocking the control room and keeping the plant network air-gapped. Those days ended the moment someone plugged a cellular gateway into the PLC cabinet. Now every edge device streaming telemetry to the cloud is an attack surface — and the cryptominer that quietly hijacked your VM last month was the gentle reminder.

This guide covers the practical security mechanisms you need to protect industrial data in transit — MQTT over TLS, certificate management for OPC-UA and cloud brokers, SAS token lifecycle, network segmentation patterns, and what zero-trust actually means when your "users" are PLC gateways running on ARM processors with 256MB of RAM.

MQTT Broker Architecture for Industrial Deployments: Clustering, Persistence, and High Availability [2026]

March 1, 2026 · 11 min read

MQTT Broker Architecture

Every IIoT tutorial makes MQTT look simple: connect, subscribe, publish. Three calls and you're streaming telemetry. What those tutorials don't tell you is what happens when your broker goes down at 2 AM, your edge gateway's cellular connection drops for 40 minutes, or your plant generates 50,000 messages per second and you need every single one to reach the historian.

Industrial MQTT isn't a protocol problem. It's an architecture problem. The protocol itself is elegant and well-specified. The hard part is designing the broker infrastructure — clustering, persistence, session management, and failover — so that zero messages are lost when (not if) something fails.

This article is for engineers who've gotten past "hello world" and need to build MQTT infrastructure that meets manufacturing reliability requirements. We'll cover the internal mechanics that matter, the failure modes you'll actually hit, and the architecture patterns that work at scale.

How MQTT Brokers Actually Handle Messages

Before discussing architecture, let's nail down what the broker is actually doing internally. This understanding is critical for sizing, troubleshooting, and making sensible design choices.

The Session State Machine

When a client connects with CleanSession=false (MQTT 3.1.1) or CleanStart=false with a non-zero SessionExpiryInterval (MQTT 5.0), the broker creates a persistent session bound to the client ID. This session maintains:

The set of subscriptions (topic filters + QoS levels)
QoS 1 and QoS 2 messages queued while the client is offline
In-flight QoS 2 message state (PUBLISH received, PUBREC sent, waiting for PUBREL)
The packet identifier namespace

This is the mechanism that makes MQTT suitable for unreliable networks — and it's the mechanism that will eat your broker's memory and disk if you don't manage it carefully.

Message Flow at QoS 1

Most industrial deployments use QoS 1 (at least once delivery). Here's what actually happens inside the broker:

Publisher sends PUBLISH with QoS 1 and a packet identifier
Broker receives the message and must:
- Match the topic against all active subscription filters
- For each matching subscription, enqueue the message
- For connected subscribers with matching QoS, deliver immediately
- For disconnected subscribers with persistent sessions, store in the session queue
- Persist the message to disk (if persistence is enabled) before acknowledging
Broker sends PUBACK to the publisher — only after all storage operations complete
For each connected subscriber, broker sends PUBLISH and waits for PUBACK
If PUBACK isn't received, broker retransmits on reconnection

The critical detail: step 3 is the durability guarantee. If the broker crashes between receiving the PUBLISH and sending the PUBACK, the publisher will retransmit. If the broker crashes after PUBACK but before delivering to all subscribers, the message must survive the crash — which means it must be on disk.

QoS 2: The Four-Phase Handshake

QoS 2 (exactly once) uses a four-message handshake: PUBLISH → PUBREC → PUBREL → PUBCOMP. The broker must maintain state for each in-flight QoS 2 transaction. In industrial settings, this is occasionally used for critical state changes (machine start/stop commands, recipe downloads) where duplicate delivery would cause real damage.

The operational cost: each QoS 2 message requires 4x the network round trips of QoS 0, and the broker must maintain per-message transaction state. For high-frequency telemetry, this is almost never worth the overhead. QoS 1 with application-level deduplication (using message timestamps or sequence numbers) is the standard industrial approach.

Broker Persistence: What Gets Stored and Where

In-Memory vs Disk-Backed

A broker with no persistence is a broker that loses messages on restart. Period. For development and testing, in-memory operation is fine. For production industrial deployments, you need disk-backed persistence.

What needs to be persisted:

Data	Purpose	Storage Impact
Retained messages	Last-known-good value per topic	Grows with topic count
Session state	Offline subscriber queues	Grows with offline duration × message rate
Inflight messages	QoS 1/2 messages awaiting acknowledgment	Usually small, bounded by `max_inflight`
Will messages	Last-will-and-testament per client	One per connected client

The session queue is where most storage problems originate. Consider: an edge gateway publishes 100 tags at 1-second intervals. Each message is ~200 bytes. If the cloud subscriber goes offline for 1 hour, that's 360,000 messages × 200 bytes = ~72 MB queued for that single client. Now multiply by 50 gateways across a plant.

Practical Queue Management

Every production broker deployment needs queue limits:

Maximum queue depth — Cap the number of messages per session queue. When the queue is full, either drop the oldest message (most common for telemetry) or reject new publishes (appropriate for control messages).
Maximum queue size in bytes — A secondary safeguard when message sizes vary.
Message expiry — MQTT 5.0 supports per-message expiry intervals. For telemetry data, 1-hour expiry is typical — a temperature reading from 3 hours ago has no operational value.

A well-configured broker with 4 GB of RAM can handle approximately:

100,000 active sessions
500,000 subscriptions
10,000 messages/second throughput
50 MB of retained messages

These are ballpark figures that vary enormously with message size, topic tree depth, and subscription overlap. Always benchmark with your actual traffic profile.

Clustering: Why and How

A single broker is a single point of failure. For industrial deployments where telemetry loss means blind spots in production monitoring, you need broker clustering.

Active-Active vs Active-Passive

Active-passive (warm standby): One broker handles all traffic. A secondary broker synchronizes state and takes over on failure. Failover time: typically 5-30 seconds depending on detection mechanism.

Active-active (load sharing): Multiple brokers share the client load. Messages published to any broker are replicated to subscribers on other brokers. This provides both high availability and horizontal scalability.

The Shared Subscription Problem

In a clustered setup, if three subscribers share a subscription (e.g., three historian instances for redundancy), each message should be delivered to exactly one of them — not all three. MQTT 5.0's shared subscriptions ($share/group/topic) handle this, distributing messages round-robin among group members.

Without shared subscriptions, each historian instance receives every message, tripling your write load. This is one of the strongest arguments for MQTT 5.0 over 3.1.1 in industrial architectures.

Message Ordering Guarantees

MQTT guarantees message ordering per publisher, per topic, per QoS level. In a clustered broker, maintaining this guarantee across brokers requires careful replication design. Most broker clusters provide:

Strong ordering for messages within a single broker node
Eventual ordering for messages replicated across nodes (typically < 100ms delay)

For industrial telemetry where timestamps are embedded in the payload, eventual ordering is almost always acceptable. For control messages where sequencing matters, route the publisher and subscriber to the same broker node.

Designing the Edge-to-Cloud Pipeline

The most common industrial MQTT architecture has three layers:

Layer 1: Edge Broker (On-Premises)

Runs on the edge gateway or a local server within the plant network. Responsibilities:

Local subscribers — HMI panels, local alarm engines, historian
Store-and-forward buffer — Queues messages when cloud connectivity is lost
Protocol translation — Accepts data from Modbus/EtherNet/IP collectors and publishes to MQTT
Data reduction — Filters unchanged values, aggregates high-frequency data

The edge broker must run on reliable storage (SSD, not SD card) because it's your buffer against network outages. Size the storage for your worst-case outage duration:

Storage needed = (messages/sec) × (avg message size) × (max outage seconds)

Example: 500 msg/s × 200 bytes × 3600 sec = 360 MB per hour of outage

Layer 2: Bridge to Cloud

The edge broker bridges selected topics to a cloud-hosted broker or IoT hub. Key configuration decisions:

Bridge QoS — Use QoS 1 for the bridge connection. QoS 0 means any TCP reset loses messages in transit. QoS 2 adds overhead with minimal benefit since telemetry is naturally idempotent.
Topic remapping — Prefix bridged topics with a plant/location identifier. A local topic machines/chiller-01/temperature becomes plant-detroit/machines/chiller-01/temperature in the cloud.
Bandwidth throttling — Limit the bridge's publish rate to avoid saturating the WAN link. If local collection runs at 500 msg/s but your link can sustain 200 msg/s, the edge broker must buffer or aggregate the difference.

Layer 3: Cloud Broker Cluster

Receives bridged data from all plants. Serves cloud-hosted consumers: analytics pipelines, dashboards, ML training jobs. This layer typically uses a managed service (Azure IoT Hub, AWS IoT Core, HiveMQ Cloud) or a self-hosted cluster.

Key sizing for cloud brokers:

Concurrent connections — One per edge gateway, plus cloud consumers
Message throughput — Sum of all edge bridge rates
Retention — Typically short (minutes to hours). Long-term storage is the historian's job.

Connection Management: The Details That Bite You

Keep-Alive and Half-Open Connections

MQTT's keep-alive mechanism is your primary tool for detecting dead connections. When a client sets keepAlive=60, it must send a PINGREQ within 60 seconds if no other packets are sent. The broker will close the connection after 1.5× the keep-alive interval with no activity.

In industrial environments, be aware of:

NAT timeouts — Many firewalls and NAT devices close idle TCP connections after 30-120 seconds. Set keep-alive below your NAT timeout.
Cellular networks — 4G/5G connections can silently disconnect. A keep-alive of 30 seconds is aggressive but appropriate for cellular gateways.
Half-open connections — The TCP connection is dead but neither side has detected it. Until keep-alive expires, the broker maintains the session and queues messages that will never be delivered. This is why aggressive keep-alive matters.

Last Will and Testament for Device Health

Configure every edge gateway with a Last Will and Testament (LWT):

Topic: devices/{device-id}/status
Payload: {"status": "offline", "timestamp": 1709251200}
QoS: 1
Retain: true

On clean connection, publish a retained "online" message to the same topic. Now any subscriber can check device status by reading the retained message on the status topic. If the device disconnects uncleanly (network failure, power loss), the broker publishes the LWT automatically.

This pattern provides a real-time device health map across your entire fleet without any polling or heartbeat logic in your application.

Authentication and Authorization at Scale

Certificate-Based Authentication

For fleets of 100+ edge gateways, username/password authentication becomes an operational burden. Certificate-based TLS client authentication scales better:

Issue each gateway a unique X.509 certificate from your PKI
Configure the broker to extract the client identity from the certificate's Common Name (CN) or Subject Alternative Name (SAN)
Revoke compromised devices by updating the Certificate Revocation List (CRL) — no password rotation needed

Topic-Level Authorization

Not every device should publish to every topic. A well-designed ACL (Access Control List) restricts:

Each gateway can only publish to plants/{plant-id}/devices/{device-id}/#
Each gateway can only subscribe to plants/{plant-id}/devices/{device-id}/commands/#
Cloud services can subscribe to plants/+/devices/+/# (wildcard across all plants)
No device can subscribe to another device's command topics

This contains the blast radius of a compromised device. It can only pollute its own data stream, not inject false data into other devices' telemetry.

Monitoring Your Broker: The Metrics That Matter

$SYS Topics

Most MQTT brokers expose internal metrics via $SYS/ topics:

$SYS/broker/messages/received — Total messages received (track rate, not absolute)
$SYS/broker/clients/connected — Current connected client count
$SYS/broker/subscriptions/count — Active subscription count
$SYS/broker/retained/messages/count — Retained message store size
$SYS/broker/heap/current — Memory usage

Operational Alerts

Set alerts for:

Connected client count drops > 10% in 5 minutes → possible network issue
Message rate drops > 50% vs rolling average → possible edge gateway failure
Heap usage > 80% of available → approaching memory limit, check session queue sizes
Subscription count anomaly → possible subscription leak (client reconnecting without cleaning up)

Where machineCDN Fits

All of this broker infrastructure complexity is why industrial IIoT platforms exist. machineCDN's edge software handles the protocol collection layer (Modbus, EtherNet/IP, and more), implements the store-and-forward buffering that keeps data safe during connectivity gaps, and manages the secure delivery pipeline to cloud infrastructure. The goal is to let plant engineers focus on what the data means rather than how to transport it reliably.

Whether you build your own MQTT infrastructure or use a managed platform, the principles in this article apply. Understand your persistence requirements, size your queues for realistic outage durations, and test failover before you need it in production. The protocol is simple. The architecture is where the engineering happens.

Quick Reference: Broker Sizing Calculator

Plant Size	Edge Gateways	Tags/Gateway	Msgs/sec (total)	Min Broker RAM	Storage (1hr buffer)
Small	10	50	500	1 GB	360 MB
Medium	50	100	5,000	4 GB	3.6 GB
Large	200	200	40,000	16 GB	28.8 GB
Enterprise	500+	500	250,000	64 GB+	180 GB+

These assume 200-byte average message size, QoS 1, and 1-second publishing intervals per tag. Your mileage will vary — always benchmark with representative traffic.

MQTT Store-and-Forward for IIoT: Building Bulletproof Edge-to-Cloud Pipelines [2026]

March 1, 2026 · 12 min read

Factory networks go down. Cellular modems lose signal. Cloud endpoints hit capacity limits. VPN tunnels drop for seconds or hours. And through all of it, your PLCs keep generating data that cannot be lost.

Store-and-forward buffering is the difference between an IIoT platform that works in lab demos and one that survives a real factory. This guide covers the engineering patterns — memory buffer design, connection watchdogs, batch queuing, and delivery confirmation — that keep telemetry flowing even when the network doesn't.

MQTT store-and-forward buffering for industrial IoT

Protocol Bridging: Translating Modbus to MQTT at the Industrial Edge [2026]

March 1, 2026 · 15 min read

Protocol Bridging Architecture

Every plant floor speaks Modbus. Every cloud platform speaks MQTT. The 20 inches of Ethernet cable between them is where industrial IoT projects succeed or fail.

Protocol bridging — the act of reading data from one industrial protocol and publishing it via another — sounds trivial on paper. Poll a register, format a JSON payload, publish to a topic. Three lines of pseudocode. But the engineers who've actually deployed these bridges at scale know the truth: the hard problems aren't in the translation. They're in the timing, the buffering, the failure modes, and the dozens of edge cases that only surface when a PLC reboots at 2 AM while your MQTT broker is mid-failover.

This guide covers the real engineering of Modbus-to-MQTT bridges — from register-level data mapping to store-and-forward architectures that survive weeks of disconnection.

Why Bridging Is Harder Than It Looks

Modbus and MQTT are fundamentally different communication paradigms. Understanding these differences is critical to building a bridge that doesn't collapse under production conditions.

Modbus is synchronous and polled. The master (your gateway) initiates every transaction. It sends a request frame, waits for a response, processes the data, and moves on. There's no concept of subscriptions, push notifications, or asynchronous updates. If you want a value, you ask for it. Every. Single. Time.

MQTT is asynchronous and event-driven. Publishers send messages whenever they have data. Subscribers receive messages whenever they arrive. The broker decouples producers from consumers. There's no concept of polling — data flows when it's ready.

Bridging these two paradigms means your gateway must act as a Modbus master on one side (issuing timed read requests) and an MQTT client on the other (publishing messages asynchronously). The gateway is the only component that speaks both languages, and it bears the full burden of timing, error handling, and data integrity.

The Timing Mismatch

Modbus RTU on RS-485 at 9600 baud takes roughly 20ms per single-register transaction (request frame + inter-frame delay + response frame + turnaround time). Reading 100 registers individually would take 2 seconds — an eternity if you need sub-second update rates.

Modbus TCP eliminates the serial timing constraints but introduces TCP socket management, connection timeouts, and the possibility of the PLC's TCP stack running out of connections (most PLCs support only 4–8 simultaneous TCP connections).

MQTT, meanwhile, can handle thousands of messages per second. The bottleneck is never the MQTT side — it's always the Modbus side. Your bridge architecture must respect the slower protocol's constraints while maximizing throughput.

Register Mapping: The Foundation

The first engineering decision is how to map Modbus registers to MQTT topics and payloads. There are three common approaches, each with trade-offs.

Approach 1: One Register, One Message

Topic: plant/line3/plc1/holding/40001
Payload: {"value": 1847, "ts": 1709312400, "type": "uint16"}

Pros: Simple, granular, easy to subscribe to individual data points. Cons: Catastrophic at scale. 200 registers means 200 MQTT publishes per poll cycle. At a 1-second poll rate, that's 200 messages/second — sustainable for the broker, but wasteful in bandwidth and processing overhead on constrained gateways.

Approach 2: Batched JSON Messages

Topic: plant/line3/plc1/batch
Payload: {
  "ts": 1709312400,
  "device_type": 1010,
  "tags": [
    {"id": 1, "value": 1847, "type": "uint16"},
    {"id": 2, "value": 23.45, "type": "float"},
    {"id": 3, "value": true, "type": "bool"}
  ]
}

Pros: Drastically fewer MQTT messages. One publish carries an entire poll cycle's worth of data. Cons: JSON encoding adds CPU overhead on embedded gateways. Payload size can grow large if you have hundreds of tags.

Approach 3: Binary-Encoded Batches

Instead of JSON, encode tag values in a compact binary format: a header with timestamp and device metadata, followed by packed tag records (tag ID + status + type + value). A single 16-bit register value takes 2 bytes in binary vs. ~30 bytes in JSON.

Pros: Minimum bandwidth. Critical for cellular-connected gateways where data costs money per megabyte. Cons: Requires matching decoders on the cloud side. Harder to debug.

The right approach depends on your constraints. For Ethernet-connected gateways with ample bandwidth, batched JSON is the sweet spot. For cellular or satellite links, binary encoding can reduce data costs by 10–15x.

Contiguous Register Coalescing

The single most impactful optimization in any Modbus-to-MQTT bridge is contiguous register coalescing: instead of reading registers one at a time, group adjacent registers into a single Modbus read request.

Consider a tag list where you need registers at addresses 40100, 40101, 40102, 40103, and 40110. A naive implementation makes 5 read requests. A smart bridge recognizes that 40100–40103 are contiguous and reads them in one Read Holding Registers (function code 03) call with a quantity of 4. That's 2 transactions instead of 5.

The coalescing logic must respect several constraints:

Same function code. You can't coalesce a coil read (FC 01) with a holding register read (FC 03). The bridge must group tags by their Modbus register type — coils (0xxxxx), discrete inputs (1xxxxx), input registers (3xxxxx), and holding registers (4xxxxx) — and coalesce within each group.
Maximum register count per transaction. The Modbus specification limits a single read to 125 registers (for 16-bit registers) or 2000 coils. In practice, keeping blocks under 50 registers reduces the risk of timeout errors on slower PLCs.
Addressing gaps. If registers 40100 and 40150 both need reading, coalescing them into a single 51-register read wastes 49 registers worth of response data. Set a maximum gap threshold (e.g., 10 registers) — if the gap exceeds it, split into separate transactions.
Same polling interval. Tags polled every second shouldn't be grouped with tags polled every 60 seconds. Coalescing must respect per-tag timing configuration.

// Pseudocode: Coalescing algorithm
sort tags by address ascending
group_head = first_tag
group_count = 1

for each subsequent tag:
    if tag.function_code == group_head.function_code
       AND tag.address == group_head.address + group_registers
       AND group_registers < MAX_BLOCK_SIZE
       AND tag.interval == group_head.interval:
        // extend current group
        group_registers += tag.elem_count
        group_count += 1
    else:
        // read current group, start new one
        read_modbus_block(group_head, group_count, group_registers)
        group_head = tag
        group_count = 1

In production deployments, contiguous coalescing routinely reduces Modbus transaction counts by 5–10x, which directly translates to faster poll cycles and fresher data.

Data Type Handling: Where the Devils Live

Modbus registers are 16-bit words. Everything else — 32-bit integers, IEEE 754 floats, booleans packed into bit fields — is a convention imposed by the PLC programmer. Your bridge must handle all of these correctly.

32-Bit Values Across Two Registers

A 32-bit float or integer spans two consecutive 16-bit Modbus registers. The critical question: which register contains the high word?

There's no standard. Some PLCs use big-endian word order (high word first, often called "ABCD" byte order). Others use little-endian word order (low word first, "CDAB"). Some use mid-endian orders ("BADC" or "DCBA"). You must know your PLC's convention, or your 23.45°C temperature reading becomes 1.7e+38 garbage.

For IEEE 754 floats specifically, the conversion from two 16-bit registers to a float is:

// Big-endian word order (ABCD)
float_value = ieee754_decode(register[n] << 16 | register[n+1])

// Little-endian word order (CDAB)
float_value = ieee754_decode(register[n+1] << 16 | register[n])

Production bridges must support configurable byte/word ordering on a per-tag basis, because it's common to have PLCs from different manufacturers on the same network.

Boolean Extraction From Status Words

PLCs frequently pack multiple boolean states into a single 16-bit register — machine running, alarm active, door open, etc. Extracting individual bits requires configurable shift-and-mask operations:

bit_value = (register_value >> shift_count) & mask

Where shift_count identifies the bit position (0–15) and mask is typically 0x01 for a single bit. The bridge's tag configuration should support this as a first-class feature, not a post-processing hack.

Type Safety Across the Bridge

When values cross from Modbus to MQTT, type information must be preserved. A uint16 register value of 65535 means something very different from a signed int16 value of -1 — even though the raw bits are identical. Your MQTT payload must carry the type alongside the value, whether in JSON field names or binary format headers.

Connection Resilience: The Store-and-Forward Pattern

The Modbus side of a protocol bridge is local — wired directly to PLCs over Ethernet or RS-485. It rarely fails. The MQTT side connects to a remote broker over a WAN link that will fail. Cellular drops out. VPN tunnels collapse. Cloud brokers restart for maintenance.

A production bridge must implement store-and-forward: continue reading from Modbus during MQTT outages, buffer the data locally, and drain the buffer when connectivity returns.

Page-Based Ring Buffers

The most robust buffering approach for embedded gateways uses a page-based ring buffer in pre-allocated memory:

Format a fixed memory region into equal-sized pages at startup.
Write incoming Modbus data to the current "work page." When a page fills, move it to the "used" queue.
Send pages from the "used" queue to MQTT, one message at a time. Wait for the MQTT publish acknowledgment (at QoS 1) before advancing the read pointer.
Recycle fully-delivered pages back to the "free" list.

If the MQTT connection drops:

Stop sending, but keep writing to new pages.
If all pages fill up (true buffer overflow), start overwriting the oldest used page. You lose the oldest data, but never the newest.

This design has several properties that matter for industrial deployments:

No dynamic memory allocation. The entire buffer is pre-allocated. No malloc, no fragmentation, no out-of-memory crashes at 3 AM.
Bounded memory usage. You know exactly how much RAM the buffer consumes. Critical on gateways with 64–256 MB.
Delivery guarantees. Each page tracks its own read pointer. If the gateway crashes mid-delivery, the page is re-sent on restart (at-least-once semantics).

How Long Can You Buffer?

Quick math: A gateway reading 100 tags every 5 seconds generates roughly 2 KB of batched JSON per poll cycle. That's 24 KB/minute, 1.4 MB/hour, 34 MB/day. A 256 MB buffer holds 7+ days of data. In binary format, that extends to 50+ days.

For most industrial applications, 24–48 hours of buffering is sufficient to survive maintenance windows, network outages, and firmware upgrades.

MQTT Connection Management

The MQTT side of the bridge deserves careful engineering. Industrial connections aren't like web applications — they run for months without restart, traverse multiple NATs and firewalls, and must recover automatically from every failure mode.

Async Connection With Threaded Reconnect

Never block the Modbus polling loop waiting for an MQTT connection. The correct architecture uses a separate thread for MQTT connection management:

The main thread polls Modbus on a tight timer and writes data to the buffer.
A connection thread handles MQTT connect/reconnect attempts asynchronously.
The buffer drains automatically when the MQTT connection becomes available.

This separation ensures that a 30-second MQTT connection timeout doesn't stall your 1-second Modbus poll cycle. Data keeps flowing into the buffer regardless of MQTT state.

Reconnect Strategy

Use a fixed reconnect delay (5 seconds works well for most deployments) rather than exponential backoff. Industrial MQTT connections are long-lived — the overhead of a 5-second retry is negligible compared to the cost of missing data during a 60-second exponential backoff.

However, protect against connection storms: if the broker is down for an extended period, ensure reconnect attempts don't overwhelm the gateway's CPU or the broker's TCP listener.

TLS Certificate Management

Production MQTT bridges almost always use TLS (port 8883 rather than 1883). The bridge must handle:

Certificate expiration. Monitor the TLS certificate file's modification timestamp. If the cert file changes on disk, tear down the current MQTT connection and reinitialize with the new certificate. Don't wait for the existing connection to fail — proactively reconnect.
SAS token rotation. When using Azure IoT Hub or similar services with time-limited tokens, parse the token's expiration timestamp and reconnect before it expires.
CA certificate bundles. Embedded gateways often ship with minimal CA stores. Ensure your IoT hub's root CA is explicitly included in the gateway's certificate chain.

Change-of-Value vs. Periodic Reporting

Not all tags need the same reporting strategy. A bridge should support both:

Periodic reporting publishes every tag value at a fixed interval, regardless of whether the value changed. Simple, predictable, but wasteful for slowly-changing values like ambient temperature or firmware version.

Change-of-value (COV) reporting compares each newly read value against the previous value and only publishes when a change is detected. This dramatically reduces MQTT traffic for boolean states (machine on/off), setpoints, and alarm registers that change infrequently.

The implementation stores the last-read value for each tag and performs a comparison before deciding whether to publish:

if tag.compare_enabled:
    if new_value != tag.last_value:
        publish(tag, new_value)
        tag.last_value = new_value
else:
    publish(tag, new_value)  # always publish

A hybrid approach works best: use COV for digital signals and alarm words, periodic for analog measurements like temperature and pressure. Some tags (critical alarms, safety interlocks) should always be published immediately — bypassing both the normal comparison logic and the batching system — to minimize latency.

Calculated and Dependent Tags

Real-world PLCs don't always expose data in the format you need. A bridge should support calculated tags — values derived from raw register data through mathematical or bitwise operations.

Common patterns include:

Bit extraction from status words. A 16-bit register contains 16 individual boolean states. The bridge extracts each bit as a separate tag using shift-and-mask operations.
Scaling and offset. Raw register value 4000 represents 400.0°F when divided by 10. The bridge applies a linear transformation (value × k1 / k2) to produce engineering units.
Dependent tag chains. When a parent tag's value changes, the bridge automatically reads and publishes a set of dependent tags. Example: when the "recipe number" register changes, immediately read all recipe parameter registers.

These calculations must happen at the edge, inside the bridge, before data is published to MQTT. Pushing raw register values to the cloud and calculating there wastes bandwidth and adds latency.

Link State Monitoring

A bridge should publish its own health status alongside machine data. The most critical metric is link state — whether the gateway can actually communicate with the PLC.

When a Modbus read fails with a connection error (timeout, connection reset, connection refused, or broken pipe), the bridge should:

Set the link state to "down" and publish immediately (not batched).
Close the existing Modbus connection and attempt reconnection.
Continue publishing link-down status at intervals so the cloud system knows the gateway is alive but the PLC is unreachable.
When reconnection succeeds, set link state to "up" and force-read all tags to re-establish baseline values.

This link state telemetry is invaluable for distinguishing between "the machine is off" and "the network cable is unplugged" — two very different problems that look identical without gateway-level diagnostics.

How machineCDN Handles Protocol Bridging

machineCDN's edge gateway was built from the ground up for exactly this problem. The gateway daemon handles Modbus RTU (serial), Modbus TCP, and EtherNet/IP on the device side, and publishes all data over MQTT with TLS to the cloud.

Key architectural decisions in the machineCDN gateway:

Pre-allocated page buffer with configurable page sizes for zero-allocation runtime operation.
Automatic contiguous register coalescing that respects function code boundaries, tag intervals, and register limits.
Per-tag COV comparison with an option to bypass batching for latency-critical values.
Calculated tag chains for bit extraction and dependent tag reads.
Hourly full refresh — every 60 minutes, the gateway resets all COV baselines and publishes every tag value, ensuring the cloud always has a complete snapshot even if individual change events were missed.
Async MQTT reconnection with certificate hot-reloading and SAS token expiration monitoring.

The result is a bridge that reliably moves data from plant-floor PLCs to cloud dashboards with sub-second latency during normal operation and zero data loss during outages lasting hours or days.

Deployment Checklist

Before deploying a Modbus-to-MQTT bridge in production:

Map every register — document address, data type, byte order, scaling factor, and engineering units
Set appropriate poll intervals — 1s for process-critical, 5–60s for environmental, 300s+ for configuration data
Size the buffer — calculate daily data volume and ensure the buffer can hold 24+ hours
Test byte ordering — verify float and 32-bit integer decoding against known PLC values before trusting the data
Configure COV vs periodic — boolean and alarm tags = COV, analog = periodic
Enable TLS — never run MQTT unencrypted on production networks
Monitor link state — alert on PLC disconnections, not just missing data
Test failover — unplug the WAN cable for 4 hours and verify data drains correctly when it reconnects

Protocol bridging isn't glamorous work. It's plumbing. But it's the plumbing that determines whether your IIoT deployment delivers reliable data or expensive noise. Get the bridge right, and everything downstream — analytics, dashboards, predictive maintenance — just works.

Reliable Telemetry Delivery in IIoT: Page Buffers, Batch Finalization, and Disconnection Recovery [2026]

March 1, 2026 · 13 min read

Your edge gateway reads 200 tags from a PLC every second. The MQTT connection to your cloud broker drops for 3 minutes because someone bumped the cellular antenna. What happens to the 36,000 data points collected during the outage?

If your answer is "they're gone," you have a toy system, not an industrial one.

Reliable telemetry delivery is the hardest unsolved problem in most IIoT architectures. Everyone focuses on the protocol layer — Modbus reads, EtherNet/IP connections, OPC-UA subscriptions — but the real engineering is in what happens between reading a value and confirming it reached the cloud. This article breaks down the buffer architecture that makes zero-data-loss telemetry possible on resource-constrained edge hardware.

Reliable telemetry delivery buffer architecture

The Problem: Three Asynchronous Timelines

In any edge-to-cloud telemetry system, you're managing three independent timelines:

PLC read cycle — Tags are read at fixed intervals (1s, 60s, etc.). This never stops. The PLC doesn't care if your cloud connection is down.
Batch collection — Raw tag values are grouped into batches by timestamp and device. Batches accumulate until they hit a size limit or a timeout.
MQTT delivery — Batches are published to the broker. The broker acknowledges receipt. At QoS 1, the MQTT library handles retransmission, but only if you give it data in the right form.

These three timelines run independently. The PLC read loop runs on a tight 1-second cycle. Batch finalization might happen every 30–60 seconds. MQTT delivery depends on network availability. If any one of these stalls, the others must keep running without data loss.

This is fundamentally a producer-consumer problem with a twist: the consumer (MQTT) can disappear for minutes at a time, and the producer (PLC reads) cannot slow down.

The Batch Layer: Grouping Values for Efficient Transport

Raw tag values are tiny — a temperature reading is 4 bytes, a boolean is 1 byte. Sending each value as an individual MQTT message would be absurdly wasteful. Instead, values are collected into batches — structured payloads that contain multiple timestamped readings from one or more devices.

Batch Structure

A batch is organized as a series of groups, where each group represents one polling cycle (one timestamp, one device):

Batch
├── Group 0: { timestamp: 1709284800, device_type: 5000, serial: 12345 }
│   ├── Value: { id: 2, values: [72.4] }          // Delivery Temp
│   ├── Value: { id: 3, values: [68.1] }          // Mold Temp
│   └── Value: { id: 5, values: [12.6] }          // Flow Value
├── Group 1: { timestamp: 1709284860, device_type: 5000, serial: 12345 }
│   ├── Value: { id: 2, values: [72.8] }
│   ├── Value: { id: 3, values: [68.3] }
│   └── Value: { id: 5, values: [12.4] }
└── ...

Dual-Format Encoding: JSON vs Binary

Production edge daemons typically support two encoding formats for batches, and the choice has massive implications for bandwidth:

JSON format:

{
  "groups": [
    {
      "ts": 1709284800,
      "device_type": 5000,
      "serial_number": 12345,
      "values": [
        {"id": 2, "values": [72.4]},
        {"id": 3, "values": [68.1]}
      ]
    }
  ]
}

Binary format (same data):

Header:  F7                           (1 byte - magic)
Groups:  00 00 00 01                  (4 bytes - group count)
Group 0: 65 E5 A0 00                  (4 bytes - timestamp)
         13 88                        (2 bytes - device type: 5000)
         00 00 30 39                  (4 bytes - serial number)
         00 00 00 02                  (4 bytes - value count)
Value 0: 00 02                        (2 bytes - tag id)
         00                           (1 byte  - status: OK)
         01                           (1 byte  - values count)
         04                           (1 byte  - element size: 4 bytes)
         42 90 CC CD                  (4 bytes - float 72.4)
Value 1: 00 03
         00
         01
         04
         42 88 33 33                  (4 bytes - float 68.1)

The JSON version of this payload: ~120 bytes. The binary version: ~38 bytes. That's a 3.2x reduction — and on a metered cellular connection at $0.01/MB, that savings compounds quickly when you're transmitting every 30 seconds 24/7.

The binary format uses a simple TLV-like structure: magic byte, group count (big-endian uint32), then for each group: timestamp (uint32), device type (uint16), serial number (uint32), value count (uint32), then for each value: tag ID (uint16), status byte, value count, element size, and raw value bytes. No field names, no delimiters, no escaping — just packed binary data.

Batch Finalization Triggers

A batch should be finalized (sealed and queued for delivery) when either condition is met:

Size limit exceeded — When the accumulated batch size exceeds a configured maximum (e.g., 500KB for JSON, or when the binary buffer is 90%+ full). The 90% threshold for binary avoids the edge case where the next value would overflow the buffer.
Collection timeout expired — When elapsed time since the batch started exceeds a configured maximum (e.g., 60 seconds). This ensures data flows even during quiet periods with few value changes.

if (elapsed_seconds > max_collection_time) → finalize
if (batch_size > max_batch_size)          → finalize

Both checks happen after every group is closed (after every polling cycle). This means finalization granularity is tied to your polling interval — if you poll every 1 second and your batch timeout is 60 seconds, each batch will contain roughly 60 groups.

The "Do Not Batch" Exception

Some values are too important to wait for batch finalization. Equipment alarms, pump state changes, emergency stops — these need to reach the cloud immediately. These tags are flagged as "do not batch" in the configuration.

When a do-not-batch tag changes value, it bypasses the normal batch pipeline entirely. A mini-batch is created on the spot — containing just that single value — and pushed directly to the outgoing buffer. This ensures sub-second cloud visibility for critical state changes, while bulk telemetry still benefits from batch efficiency.

Tag: "Pump Status"     interval: 1s    do_not_batch: true
Tag: "Heater Status"   interval: 1s    do_not_batch: true
Tag: "Delivery Temp"   interval: 60s   do_not_batch: false  ← normal batching

The Buffer Layer: Surviving Disconnections

This is where most IIoT implementations fail. The batch layer produces data. The MQTT layer consumes it. But what sits between them? If it's just an in-memory queue, you'll lose everything on disconnect.

Page-Based Ring Buffer Architecture

The production-grade answer is a page-based ring buffer — a fixed-size memory region divided into equal-sized pages that cycle through three states:

States:
  FREE  → Available for writing
  WORK  → Currently being filled with batch data
  USED  → Filled, waiting for MQTT delivery

Lifecycle:
  FREE → WORK (when first data is added)
  WORK → USED (when page is full or batch is finalized)
  USED → transmit → delivery ACK → FREE (recycled)

Here's how it works:

Memory layout: At startup, a contiguous block of memory is allocated (e.g., 2MB). This block is divided into pages of a configured size (matching the MQTT max packet size, typically matching the batch size). Each page has a small header tracking its state and a data area.

┌──────────────────────────────────────────────┐
│ [Page 0: USED] [Page 1: USED] [Page 2: WORK]│
│ [Page 3: FREE] [Page 4: FREE] [Page 5: FREE]│
│ [Page 6: FREE] ... [Page N: FREE]            │
└──────────────────────────────────────────────┘

Writing data: When a batch is finalized, its serialized bytes are written to the current WORK page. Each message gets a small header: a 4-byte message ID slot (filled later by the MQTT library) and a 4-byte size field. If the current page can't fit the next message, it transitions to USED and a fresh FREE page becomes the new WORK page.

Overflow handling: When all FREE pages are exhausted, the buffer reclaims the oldest USED page — the one that's been waiting for delivery the longest. This means you lose old data rather than new data, which is the right trade-off: the most recent readings are the most valuable. An overflow warning is logged so operators know the buffer is under pressure.

Delivery: When the MQTT connection is active, the buffer walks through USED pages and publishes their contents. Each publish gets a packet ID from the MQTT library. When the broker ACKs the packet (via the PUBACK callback for QoS 1), the corresponding page is recycled to FREE.

Disconnection recovery: When the MQTT connection drops:

The disconnect callback fires
The buffer marks itself as disconnected
Data continues accumulating in pages (WORK → USED)
When reconnected, the buffer immediately starts draining USED pages

No data is lost unless the buffer physically overflows. With 2MB of buffer and 500KB page size, you get 4 pages of headroom — enough to survive several minutes of disconnection at typical telemetry rates.

Thread Safety

The PLC read loop and the MQTT event loop run on different threads. The buffer must be thread-safe. Every buffer operation acquires a mutex:

buffer_add_data() — called from the PLC read thread after batch finalization
buffer_process_data_delivered() — called from the MQTT callback thread on PUBACK
buffer_process_connect() / buffer_process_disconnect() — called from MQTT lifecycle callbacks

Without proper locking, you'll see corrupted pages, double-free crashes, and mysterious data loss under load. This is non-negotiable.

Sizing the Buffer

Buffer sizing depends on three variables:

Data rate: How many bytes per second does your polling loop produce?
Expected outage duration: How long do you need to survive without MQTT?
Available memory: Edge devices (especially industrial routers) have limited RAM

Example calculation:

200 tags, average 6 bytes each (including binary overhead) = 1,200 bytes/group
Polling every 1 second = 1,200 bytes/second = 72KB/minute
Target: survive 30-minute outage = 2.16MB buffer
With 500KB pages = 5 pages minimum (round up for safety)

In practice, 2–4MB covers most scenarios. On a 32MB industrial router, that's well within budget.

The MQTT Layer: QoS, Reconnection, and Watchdogs

QoS 1: At-Least-Once Delivery

For industrial telemetry, QoS 1 is the right choice:

QoS 0 (fire and forget): No delivery guarantee. Unacceptable for production data.
QoS 1 (at least once): Broker ACKs every message. Duplicates possible but data loss prevented. Good trade-off.
QoS 2 (exactly once): Eliminates duplicates but doubles the handshake overhead. Rarely worth it for telemetry.

The page buffer's recycling logic depends on QoS 1: pages are only freed when the PUBACK arrives. If the ACK never comes (connection drops mid-transmission), the page stays in USED state and will be retransmitted after reconnection.

Connection Watchdog

MQTT connections can enter a zombie state — the TCP socket is open, the MQTT loop is running, but no data is actually flowing. This happens when network routing changes, firewalls silently drop the connection, or the broker becomes unresponsive.

The fix: a watchdog timer that monitors delivery acknowledgments. If no PUBACK has been received within a timeout window (e.g., 120 seconds) and data has been queued for transmission, force a reconnect:

if (now - last_delivered_packet_time > 120s) {
  if (has_pending_data) {
    // Force MQTT reconnection
    reset_mqtt_client();
  }
}

This catches the edge case where the MQTT library thinks it's connected but the network is actually dead. Without this watchdog, your edge daemon could silently accumulate hours of undelivered data in the buffer, eventually overflowing and losing it all.

Asynchronous Connection

MQTT connection establishment (DNS resolution, TLS handshake, CONNACK) can take several seconds, especially over cellular links. This must not block the PLC read loop. The connection should happen on a separate thread:

Main thread detects connection is needed
Connection thread starts connect_async()
Main thread continues reading PLCs
On successful connect, the callback fires and buffer delivery begins

If the connection thread is still working when a new connection attempt is needed, skip it — don't queue multiple connection attempts or you'll thrash the network stack.

TLS for Production

Any MQTT connection leaving your plant network must use TLS. Period. Industrial telemetry data — temperatures, pressures, equipment states, alarm conditions — is operationally sensitive. On the wire without encryption, anyone on the network path can see (and potentially modify) your readings.

For cloud brokers like Azure IoT Hub, TLS is mandatory. The edge daemon should:

Load the CA certificate from a PEM file
Use MQTT v3.1.1 protocol (widely supported, well-tested)
Monitor the SAS token expiration timestamp and alert before it expires
Automatically reinitialize the MQTT client when the certificate or connection string changes (file modification detected via stat())

Daemon Status Reporting

A well-designed edge daemon reports its own health back through the same MQTT channel it uses for telemetry. A periodic status message should include:

System uptime and daemon uptime — detect restarts
PLC link state — is the PLC connection healthy?
Buffer state — how full is the outgoing buffer?
MQTT state — connected/disconnected, last ACK time
SAS token expiration — days until credentials expire
Software version — for remote fleet management

An extended status format can include per-tag state: last read time, last delivery time, current value, and error count. This is invaluable for remote troubleshooting — you can see from the cloud exactly which tags are stale and why.

Value Comparison and Change Detection

Not all values need to be sent every polling cycle. A temperature that's been 72.4°F for the last hour doesn't need to be transmitted 3,600 times. Change detection — comparing the current value to the last sent value — can dramatically reduce bandwidth.

The implementation: each tag stores its last transmitted value. After reading, compare:

if (tag.compare_enabled && tag.has_been_read_once) {
  if (current_value == tag.last_value) {
    skip_this_value();  // Don't add to batch
  }
}

Important caveats:

Not all tags should use comparison. Continuous process variables (temperatures, flows) should always send, even if unchanged — the recipient needs the full time series to calculate trends and detect flatlines (a stuck sensor reads the same value forever, which is itself a fault condition).
Discrete state tags (booleans, enums) are ideal for comparison — they change rarely and each change is significant.
Floating-point comparison should use an epsilon threshold, not exact equality, to avoid sending noise from ADC jitter.

Putting It All Together: The Main Loop

The complete edge daemon main loop ties all these layers together:

1. Parse configuration (device addresses, tag lists, MQTT credentials)
2. Allocate memory (PLC config pool + output buffer)
3. Format output buffer into pages
4. Start MQTT connection thread
5. Detect PLC device (probe address, determine type/protocol)
6. Load device-specific tag configuration

MAIN LOOP (runs every 1 second):
  a. Check for config file changes → restart if changed
  b. Read PLC tags (coalesced Modbus/EtherNet/IP)
  c. Add values to batch (with comparison filtering)
  d. Check batch finalization triggers (size/timeout)
  e. Process incoming commands (config updates, force reads)
  f. Check MQTT connection watchdog
  g. Sleep 1 second

Every component — polling, batching, buffering, delivery — operates within this single loop iteration, keeping the system deterministic and debuggable.

How machineCDN Implements This

The machineCDN edge runtime implements this full stack natively on resource-constrained industrial routers. The page-based ring buffer runs in pre-allocated memory (no dynamic allocation after startup), the MQTT layer handles Azure IoT Hub and local broker configurations interchangeably, and the batch layer supports both JSON and binary encoding selectable per-device.

On a Teltonika RUT9xx router with 256MB RAM, the daemon typically uses under 4MB total — including 2MB of buffer space that can store 20+ minutes of telemetry during a connectivity outage. Tags are automatically sorted, coalesced, and dispatched with zero configuration beyond listing the tag names and addresses.

The result: edge gateways that have been running continuously for years in production environments, surviving cellular dropouts, network reconfigurations, and even firmware updates without losing a single data point.

Conclusion

Reliable telemetry delivery isn't about the protocol — it's about the pipeline. Modbus reads are the easy part. The hard engineering is in the layers between: batching values efficiently, buffering them through disconnections, and confirming delivery before recycling memory.

The key design principles:

Never block the read loop — PLC polling is sacred
Buffer with finite, pre-allocated memory — dynamic allocation on embedded systems is asking for trouble
Reclaim oldest data first — in overflow, recent values matter more
Acknowledge before recycling — a page stays USED until the broker confirms receipt
Watch for zombie connections — a connected socket doesn't mean data is flowing

Get these right, and your edge infrastructure becomes invisible — which is exactly what production IIoT should be.

Edge Computing Architecture for IIoT: Store-and-Forward, Batch Processing, and Bandwidth Optimization [2026]

February 28, 2026 · 14 min read

MachineCDN Team

Industrial IoT Experts

Here's an uncomfortable truth about industrial IoT: your cloud platform is only as reliable as the worst cellular connection on your factory floor.

And in manufacturing environments — where concrete walls, metal enclosures, and electrical noise are the norm — that connection can drop for minutes, hours, or days. If your edge architecture doesn't account for this, you're not building an IIoT system. You're building a fair-weather dashboard that goes dark exactly when you need it most.

This guide covers the architecture patterns that separate production-grade edge gateways from science projects: store-and-forward buffering, intelligent batch processing, binary serialization, and the MQTT reliability patterns that actually work when deployed on a $200 industrial router with 256MB of RAM.

Industrial OT Security for IIoT: TLS, Certificates, Network Segmentation, and Zero Trust at the Edge [2026 Guide]

February 28, 2026 · 14 min read

MachineCDN Team

Industrial IoT Experts

There's a persistent myth in manufacturing that "air-gapped" OT networks don't need security. The moment you connect a PLC to an edge gateway that publishes data to the cloud via MQTT, that air gap is gone. You've built a bridge between your operational technology and the internet, and every decision you make about that bridge — TLS configuration, certificate management, authentication, network architecture — determines whether you've built a secure connection or an open door.

This guide covers the practical security decisions for IIoT deployments, based on hard-won experience connecting industrial equipment in environments where a misconfiguration doesn't just leak data — it can affect physical processes.

Securing Industrial IoT: TLS for MQTT, OPC-UA Certificates, and Zero-Trust OT Networks [2026]

February 28, 2026 · 12 min read

Industrial OT Security Architecture

Here's a uncomfortable truth from the field: most industrial IoT deployments I've seen have at least one Modbus TCP device exposed without any authentication. No TLS. No access control. Just port 502, wide open, on a "segmented" network that's one misconfigured switch from the corporate LAN.

The excuse is always the same: "It's air-gapped." It never actually is.

This guide covers what securing industrial protocol communications looks like in practice — not the compliance checkbox version, but the engineering decisions that determine whether an attacker who lands on your OT network can read holding registers, inject false sensor data, or shut down a production line.

MQTT for Industrial IoT: QoS, Sparkplug B, and Broker Architecture Explained [2026]

February 28, 2026 · 15 min read

MQTT Industrial IoT Architecture

MQTT has become the dominant messaging protocol for Industrial IoT — and for good reason. It's lightweight enough to run on resource-constrained edge gateways, resilient enough to handle flaky cellular connections on remote sites, and flexible enough to carry everything from a single boolean alarm bit to a 500-tag batch payload from a production line.

But deploying MQTT in an industrial environment is fundamentally different from using it for consumer IoT. The stakes are higher, the data patterns are more complex, and getting the architecture wrong can mean lost production data or, worse, missed safety alarms.

This guide covers everything a plant engineer or controls integrator needs to know about running MQTT in production — from QoS level selection to broker architecture to the Sparkplug B specification that's finally bringing standardization to industrial MQTT payloads.

Why MQTT Won the Industrial IoT Protocol War

Before diving into the technical details, it's worth understanding why MQTT displaced so many competing approaches. Traditional industrial data collection relied on polling — a SCADA system or historian would periodically query PLCs via Modbus or OPC-DA, pulling register values on a fixed schedule.

This polling model has several problems at scale:

Bandwidth waste: Most register values don't change between polls. A temperature sensor reading 72.4°F doesn't need to be transmitted every second if it hasn't moved.
Latency on critical events: If a compressor fault fires 500ms after the last poll, you won't see it for another 500ms — or longer if the poll cycle is slow.
Scaling headaches: Every additional client polling the same PLC adds load. With 20 systems all querying the same controller, you're burning CPU cycles on the PLC answering redundant requests.

MQTT inverts this model. Instead of clients pulling data, edge devices publish data when it changes (or on a configurable interval), and any number of subscribers can consume that data without adding load to the source device.

The key insight that makes this work in industrial settings is change-of-value detection combined with periodic heartbeats. A well-designed edge gateway will:

Read PLC tags on a fast cycle (typically 1-second intervals for critical tags)
Compare each reading against the last delivered value
Only publish to MQTT when a value actually changes
Still publish unchanged values periodically (hourly is common) to confirm the connection is alive

This approach dramatically reduces bandwidth — often by 80-90% compared to blind periodic polling — while actually reducing latency for state changes since they're published immediately rather than waiting for the next poll window.

QoS Levels: Why QoS 1 Is Almost Always the Right Choice

MQTT defines three Quality of Service levels, and choosing the right one is critical in industrial deployments:

QoS 0 — Fire and Forget

The broker delivers the message at most once, with no acknowledgment. If the subscriber is disconnected, the message is lost.

When to use it: Almost never in industrial settings. The only exception is high-frequency telemetry where individual samples are expendable — vibration data at 1kHz, for example, where losing a few samples in a burst doesn't affect the analysis.

QoS 1 — At Least Once Delivery

The broker guarantees delivery but may deliver duplicates. The publisher sends the message, waits for a PUBACK from the broker, and retransmits if the acknowledgment doesn't arrive within a timeout.

When to use it: This is the standard for industrial IoT. It guarantees your alarm states and production data reach the broker, and the duplicate delivery risk is easily handled by idempotent processing on the subscriber side (if you receive the same batch timestamp twice, just ignore the duplicate).

In practice, the "at least once" guarantee is exactly what you need for event-driven tag data. When a PLC tag transitions from false to true — say a compressor fault alarm — you need assurance that transition reaches the cloud. QoS 1 provides that assurance with minimal overhead.

QoS 2 — Exactly Once Delivery

A four-step handshake (PUBLISH → PUBREC → PUBREL → PUBCOMP) guarantees exactly-once delivery. The overhead is significant — roughly 2x the round trips of QoS 1.

When to use it: Rarely justified in IIoT. The scenarios where duplicate delivery actually causes problems (financial transactions, one-time commands) are uncommon on the factory floor. The extra latency and bandwidth are almost never worth the guarantee.

The QoS 1 + Idempotent Subscriber Pattern

The production-proven pattern for industrial MQTT looks like this:

Edge Gateway → MQTT Broker (QoS 1) → Cloud Subscriber
     ↓                                      ↓
  Publish with                    Deduplicate by
  message ID                    batch timestamp +
  + retry on                    device serial number
  no PUBACK

Your edge device publishes each batch with a timestamp and a unique device identifier. On the subscriber side, you check whether you've already processed a message with that exact timestamp from that device. If yes, discard. If no, process and store.

This gives you effectively exactly-once semantics with QoS 1 performance.

Retained Messages and Last Will: The Industrial Essentials

Two MQTT features are particularly important for industrial deployments:

Retained Messages

When a message is published with the retained flag set, the broker stores the last message on that topic and delivers it immediately to any new subscriber. This is essential for device status.

Consider the scenario: your cloud dashboard reconnects after a network outage. Without retained messages, you have no idea whether 50 devices on the factory floor are online or offline until each one publishes its next status update. With retained messages on the status topic, the dashboard gets the current state of every device the instant it subscribes.

Best practice is to publish retained messages on status/heartbeat topics, but not on telemetry topics. You don't want a new subscriber to receive a stale temperature reading from 3 hours ago as if it were current.

Last Will and Testament (LWT)

When an MQTT client connects to the broker, it can register a "last will" message — a message the broker will automatically publish if the client disconnects ungracefully (network failure, power loss, crash).

For edge gateways, the LWT should publish a status message indicating the device is offline:

{
  "cmd": "status",
  "status": "offline",
  "ts": 0
}

Combined with periodic status heartbeats (every 60 seconds is typical), this gives you a reliable presence detection system:

Normal operation: Edge gateway publishes status every 60 seconds → subscribers know device is online
Graceful shutdown: Edge gateway publishes "offline" status before disconnecting
Crash/power loss: Broker publishes LWT "offline" message after keepalive timeout

The keepalive interval is critical here. Too short (under 30 seconds) and you'll get false offline detections from temporary network hiccups. Too long (over 120 seconds) and there's an unacceptable delay between device failure and detection. 60 seconds is the sweet spot for most industrial deployments.

Sparkplug B: Standardizing Industrial MQTT Payloads

The biggest challenge with raw MQTT in industrial settings has always been payload format. MQTT is transport-agnostic — it doesn't care whether you're sending JSON, binary, Protobuf, or plain text. This flexibility is a double-edged sword.

Without a standard, every integration becomes bespoke. One vendor sends JSON with camelCase keys, another uses snake_case, a third sends raw binary with a custom header format. Your cloud platform needs custom parsers for each.

Sparkplug B (now an Eclipse Foundation specification) solves this by defining:

Topic namespace: spBv1.0/{group_id}/{message_type}/{edge_node_id}/{device_id}
Payload format: Google Protocol Buffers (Protobuf) with a defined schema
State management: Birth/death certificates, metric definitions, and state machines
Data types: Boolean, integer (8/16/32/64 bit signed and unsigned), float, double, string, bytes, datetime

The Sparkplug State Machine

Sparkplug introduces a formal state machine for edge nodes and devices:

                    ┌─────────┐
    Power On ──────►│  OFFLINE │
                    └────┬────┘
                         │ NBIRTH published
                         ▼
                    ┌─────────┐
                    │  ONLINE  │◄──── NDATA published
                    └────┬────┘      (periodic updates)
                         │
              ┌──────────┼──────────┐
              │          │          │
         Lost Conn    NDEATH    Broker
         (LWT fires)  published  restart
              │          │          │
              ▼          ▼          ▼
         ┌─────────┐
         │  OFFLINE │──── Reconnect ────► NBIRTH
         └─────────┘

The birth certificate (NBIRTH) contains the complete metric definition for the edge node — every tag name, data type, and current value. This means a new subscriber can immediately understand the full data model without any out-of-band configuration.

Why Sparkplug B Matters for Scale

If you're connecting 5 devices to a single cloud platform, the payload format barely matters. At 500 or 5,000 devices across multiple sites, standardization becomes critical.

Sparkplug's use of Protobuf also provides significant bandwidth savings over JSON. A typical 50-tag batch that might be 2-3KB in JSON compresses to 400-600 bytes in Sparkplug Protobuf format — a 4-5x reduction that matters when you're pushing data over cellular connections with per-MB pricing.

Broker Architecture for Industrial Deployments

The MQTT broker is the single most critical component in your IIoT data pipeline. Every message flows through it, and if it goes down, your entire data collection stops.

Single Broker vs. Broker Cluster

For a single-site deployment with under 100 devices, a single broker instance (Mosquitto, HiveMQ, EMQX) on a dedicated VM is sufficient. Mosquitto can comfortably handle 10,000+ concurrent connections and 50,000+ messages/second on modest hardware (2 cores, 4GB RAM).

For multi-site or high-availability deployments, you need a clustered broker:

Site A Edge Gateways ──► Local Broker ──► Cloud Broker Cluster
                              │              (3-node minimum)
                              │                    │
Site B Edge Gateways ──► Local Broker ──────────►  │
                                                   ▼
                                            Cloud Subscribers
                                          (Dashboards, Analytics,
                                           Historians, Alerting)

The local broker pattern is important: each site runs its own MQTT broker, which bridges to the cloud cluster. This provides:

Store-and-forward: If the WAN connection drops, the local broker queues messages and delivers them when connectivity returns
Local subscribers: Site-level dashboards and alarm systems can subscribe to the local broker with sub-millisecond latency
Reduced WAN traffic: The local broker can aggregate and compress data before forwarding

TLS Configuration for Industrial MQTT

MQTT over TLS (port 8883) is non-negotiable for any production deployment. The configuration details matter:

Certificate management: Use device-specific certificates, not shared keys. Each edge gateway should have its own client certificate signed by your CA. When a device is decommissioned, revoke its certificate without affecting the rest of the fleet.
Protocol version: TLS 1.2 minimum. TLS 1.3 preferred where both client and broker support it.
Certificate rotation: Plan for certificate expiry. In industrial environments, devices may run for years. Set certificate validity to 2-5 years and implement a rotation mechanism (OPC-UA has built-in certificate management; for MQTT, you'll need a custom solution or a device management platform).
Token expiry monitoring: If you're using SAS tokens (common with Azure IoT Hub), monitor the expiry timestamp. An expired token means silent disconnection — your edge gateway will fail to reconnect and you won't get an error unless you're checking. Best practice: compare the token's se (expiry) timestamp against current system time on every connection attempt and log a warning when within 7 days of expiry.

Connection Resilience

Industrial networks are unreliable. Cellular connections drop, site VPNs flap, firewalls time out idle connections. Your MQTT client implementation must handle all of these gracefully:

Automatic reconnection: Use mosquitto_reconnect_delay_set() or equivalent to configure exponential backoff. A fixed 5-second retry is appropriate for most industrial deployments — fast enough to recover quickly but not so aggressive that it hammers the broker during extended outages.
Asynchronous connection: Never block the main data collection loop waiting for MQTT to connect. Run the connection process in a background thread so PLC tag reading continues even when MQTT is down. Buffer the data locally and deliver it when connectivity returns.
Clean session = false: Set clean_session to false (MQTT 3.1.1) or use persistent sessions (MQTT 5.0) so the broker maintains your subscription state across reconnections. This prevents missing messages during brief disconnections.

Batching: The Performance Multiplier Nobody Talks About

One of the most impactful optimizations for industrial MQTT is intelligent batching — grouping multiple tag values into a single MQTT publish rather than publishing each tag individually.

Why Batching Matters

Consider a device with 100 tags, all updating every second. Without batching, that's 100 MQTT publishes per second — 100 TCP round trips, 100 broker message handling operations, 100 subscriber deliveries.

With batching, you group all tags that changed in the same read cycle into a single message. The structure typically looks like:

{
  "cmd": "data",
  "ts": 1709136000,
  "sn": 16842753,
  "type": 1017,
  "groups": [
    {
      "ts": 1709136000,
      "values": [
        [1, 0, 0, 0, [1]],
        [80, 0, 0, 0, [724]],
        [82, 0, 0, 0, [185]]
      ]
    }
  ]
}

Each value entry carries the tag ID, status, and value(s) — compact enough that 50 tags fit in under 1KB. The result: 1 MQTT publish per second instead of 100, with identical data delivered.

Batch Size and Timeout Tuning

Two parameters control batching behavior:

Max batch size (bytes): The maximum payload size before the batch is flushed. 500KB is a reasonable upper limit — large enough to hold hundreds of tags but small enough to avoid memory pressure on constrained edge hardware.
Batch timeout (seconds): The maximum time a batch can be held open before flushing, regardless of size. This ensures low-frequency data gets delivered promptly. 5-10 seconds is typical.

The Exception: Critical Alarms

Not every tag should be batched. Safety-critical alarms — compressor faults, high-pressure switches, flow switch failures — should bypass the batch entirely and be published immediately as individual messages.

The pattern: tag your alarm points with a "do not batch" flag. When these tags change value, publish them immediately via a direct MQTT publish, bypassing the batching layer. The latency difference between a batched delivery (up to 10 seconds) and a direct publish (under 100ms) can be the difference between catching a fault early and a costly shutdown.

Binary vs. JSON Payloads: The Bandwidth Tradeoff

For industrial MQTT, you have two practical payload format choices:

JSON

Pros: Human-readable, easy to debug, universally parsed
Cons: Verbose, ~3-5x larger than binary equivalents
Best for: Development, debugging, small deployments, or when bandwidth isn't a concern

Binary (Custom or Protobuf)

Pros: Compact (often 4-5x smaller than JSON), faster to serialize/deserialize
Cons: Requires schema documentation, harder to debug
Best for: Production deployments with cellular connectivity, large tag counts, or bandwidth-constrained environments

A well-designed binary format packs each tag value into a fixed-width structure: 2 bytes for tag ID, 1 byte for status, 1 byte for type, and 2-4 bytes for the value. A 50-tag batch becomes ~300 bytes instead of 2-3KB in JSON.

The practical recommendation: start with JSON during development and commissioning (the ability to read raw payloads in a debug tool is invaluable), then switch to binary for production when bandwidth matters.

Store-and-Forward: Don't Lose Data During Outages

The most common failure mode in industrial MQTT is losing data during connectivity outages. The edge gateway reads values from PLCs, tries to publish to MQTT, fails because the broker is unreachable, and... drops the data.

A production-grade edge gateway needs a local buffer that stores data when MQTT is disconnected and delivers it in order when connectivity returns.

The buffer architecture should:

Pre-allocate memory: Don't dynamically allocate during operation. Pre-allocate a fixed buffer (512KB to 8MB depending on available RAM) and divide it into fixed-size pages.
Use a page-based queue: Data flows into a "work page" until it's full, then the page moves to a "ready" queue. When MQTT is connected, pages are transmitted in order.
Handle overflow gracefully: When the buffer is full and new data arrives, overwrite the oldest undelivered page (not the newest). In an extended outage, you want the most recent data, not the oldest.
Track delivery confirmation: Don't free a buffer page until the MQTT PUBACK confirms the broker received it. If the connection drops mid-delivery, the page stays in the queue for retry.

This architecture ensures zero data loss during outages of minutes to hours (depending on buffer size and data rate) without any disk I/O — critical for edge devices running on flash storage where write endurance is a concern.

How machineCDN Handles Industrial MQTT

machineCDN's edge infrastructure implements all of the patterns described above. The edge gateway handles multi-protocol tag reading (Modbus RTU, Modbus TCP, EtherNet/IP), intelligent batching with change-of-value detection, and resilient MQTT delivery with a page-based store-and-forward buffer.

The platform supports both JSON and binary payload formats, configurable per device. Critical alarm tags can be flagged for immediate delivery, bypassing the batch. And the MQTT connection layer handles automatic reconnection with proper keepalive management — including SAS token expiry monitoring for Azure IoT Hub deployments.

For teams deploying MQTT in industrial environments, the combination of protocol-native tag reading and production-grade MQTT delivery eliminates the most common integration pitfalls — and lets engineers focus on the process data rather than the plumbing.

Key Takeaways

Use QoS 1 with idempotent subscribers — it's the right balance for industrial data
Implement change-of-value detection at the edge to reduce bandwidth by 80-90%
Batch tag values into single publishes, but bypass the batch for critical alarms
Build a store-and-forward buffer that pre-allocates memory and tracks delivery confirmation
Use TLS with device-specific certificates — shared keys are a security liability at scale
Deploy local brokers at each site to provide resilience and local subscriptions
Consider Sparkplug B if you're connecting devices from multiple vendors or scaling past 100 endpoints
Monitor connection health actively — check keepalive timers, token expiry, and buffer utilization

MQTT is not just a protocol choice — it's an architecture decision. Get the broker topology, QoS level, and buffering strategy right, and you'll have a data pipeline that's resilient enough for real industrial operations.

Protocol Bridging in IIoT: Translating Between Modbus, EtherNet/IP, and MQTT at the Edge [2026]

February 28, 2026 · 14 min read

Every manufacturing plant is a polyglot. Modbus RTU on the serial bus. Modbus TCP on the local network. EtherNet/IP talking to Allen-Bradley PLCs. And now someone wants all of that data in the cloud via MQTT.

Protocol bridging at the edge is the unglamorous but critical work that makes IIoT actually function. Get it right, and you have a seamless data pipeline from a 20-year-old Modbus RTU device to a modern cloud analytics platform. Get it wrong, and you have data gaps, crashed connections, and a plant floor that's lost trust in your "smart factory" initiative.

This guide covers the architecture, pitfalls, and hard-won lessons from building protocol bridges that run in production — not just in proof-of-concepts.

How MQTT Brokers Actually Handle Messages​

The Session State Machine​

Message Flow at QoS 1​

QoS 2: The Four-Phase Handshake​

Broker Persistence: What Gets Stored and Where​

In-Memory vs Disk-Backed​

Practical Queue Management​

Clustering: Why and How​

Active-Active vs Active-Passive​

The Shared Subscription Problem​

Message Ordering Guarantees​

Designing the Edge-to-Cloud Pipeline​

Layer 1: Edge Broker (On-Premises)​

Layer 2: Bridge to Cloud​

Layer 3: Cloud Broker Cluster​

Connection Management: The Details That Bite You​

Keep-Alive and Half-Open Connections​

Last Will and Testament for Device Health​

Authentication and Authorization at Scale​

Certificate-Based Authentication​

Topic-Level Authorization​

Monitoring Your Broker: The Metrics That Matter​

$SYS Topics​

Operational Alerts​

Where machineCDN Fits​

Quick Reference: Broker Sizing Calculator​

Why Bridging Is Harder Than It Looks​

The Timing Mismatch​

Register Mapping: The Foundation​

Approach 1: One Register, One Message​

Approach 2: Batched JSON Messages​

Approach 3: Binary-Encoded Batches​

Contiguous Register Coalescing​

Data Type Handling: Where the Devils Live​

32-Bit Values Across Two Registers​

Boolean Extraction From Status Words​

Type Safety Across the Bridge​

Connection Resilience: The Store-and-Forward Pattern​

Page-Based Ring Buffers​

How Long Can You Buffer?​

MQTT Connection Management​

Async Connection With Threaded Reconnect​

Reconnect Strategy​

TLS Certificate Management​

Change-of-Value vs. Periodic Reporting​

Calculated and Dependent Tags​

Link State Monitoring​

How machineCDN Handles Protocol Bridging​

Deployment Checklist​

The Problem: Three Asynchronous Timelines​

The Batch Layer: Grouping Values for Efficient Transport​

Batch Structure​

Dual-Format Encoding: JSON vs Binary​

Batch Finalization Triggers​

The "Do Not Batch" Exception​

The Buffer Layer: Surviving Disconnections​

Page-Based Ring Buffer Architecture​

Thread Safety​

Sizing the Buffer​

The MQTT Layer: QoS, Reconnection, and Watchdogs​

QoS 1: At-Least-Once Delivery​

Connection Watchdog​

Asynchronous Connection​

TLS for Production​

Daemon Status Reporting​

Value Comparison and Change Detection​

Putting It All Together: The Main Loop​

How machineCDN Implements This​

Conclusion​

Why MQTT Won the Industrial IoT Protocol War​

QoS Levels: Why QoS 1 Is Almost Always the Right Choice​

QoS 0 — Fire and Forget​

QoS 1 — At Least Once Delivery​

QoS 2 — Exactly Once Delivery​

The QoS 1 + Idempotent Subscriber Pattern​

Retained Messages and Last Will: The Industrial Essentials​

Retained Messages​

Last Will and Testament (LWT)​

Sparkplug B: Standardizing Industrial MQTT Payloads​

The Sparkplug State Machine​

How MQTT Brokers Actually Handle Messages

The Session State Machine

Message Flow at QoS 1

QoS 2: The Four-Phase Handshake

Broker Persistence: What Gets Stored and Where

In-Memory vs Disk-Backed

Practical Queue Management

Clustering: Why and How

Active-Active vs Active-Passive

The Shared Subscription Problem

Message Ordering Guarantees

Designing the Edge-to-Cloud Pipeline

Layer 1: Edge Broker (On-Premises)

Layer 2: Bridge to Cloud

Layer 3: Cloud Broker Cluster

Connection Management: The Details That Bite You

Keep-Alive and Half-Open Connections

Last Will and Testament for Device Health

Authentication and Authorization at Scale

Certificate-Based Authentication

Topic-Level Authorization

Monitoring Your Broker: The Metrics That Matter

$SYS Topics

Operational Alerts

Where machineCDN Fits

Quick Reference: Broker Sizing Calculator

Why Bridging Is Harder Than It Looks

The Timing Mismatch

Register Mapping: The Foundation

Approach 1: One Register, One Message

Approach 2: Batched JSON Messages

Approach 3: Binary-Encoded Batches

Contiguous Register Coalescing

Data Type Handling: Where the Devils Live

32-Bit Values Across Two Registers

Boolean Extraction From Status Words

Type Safety Across the Bridge

Connection Resilience: The Store-and-Forward Pattern

Page-Based Ring Buffers

How Long Can You Buffer?

MQTT Connection Management

Async Connection With Threaded Reconnect

Reconnect Strategy

TLS Certificate Management

Change-of-Value vs. Periodic Reporting

Calculated and Dependent Tags

Link State Monitoring

How machineCDN Handles Protocol Bridging

Deployment Checklist

The Problem: Three Asynchronous Timelines

The Batch Layer: Grouping Values for Efficient Transport

Batch Structure

Dual-Format Encoding: JSON vs Binary

Batch Finalization Triggers

The "Do Not Batch" Exception

The Buffer Layer: Surviving Disconnections

Page-Based Ring Buffer Architecture

Thread Safety

Sizing the Buffer

The MQTT Layer: QoS, Reconnection, and Watchdogs

QoS 1: At-Least-Once Delivery

Connection Watchdog

Asynchronous Connection

TLS for Production

Daemon Status Reporting

Value Comparison and Change Detection

Putting It All Together: The Main Loop

How machineCDN Implements This

Conclusion

Why MQTT Won the Industrial IoT Protocol War

QoS Levels: Why QoS 1 Is Almost Always the Right Choice

QoS 0 — Fire and Forget

QoS 1 — At Least Once Delivery

QoS 2 — Exactly Once Delivery

The QoS 1 + Idempotent Subscriber Pattern

Retained Messages and Last Will: The Industrial Essentials

Retained Messages

Last Will and Testament (LWT)

Sparkplug B: Standardizing Industrial MQTT Payloads

The Sparkplug State Machine