Skip to main content

MQTT Store-and-Forward for IIoT: Building Bulletproof Edge-to-Cloud Pipelines [2026]

· 12 min read

Factory networks go down. Cellular modems lose signal. Cloud endpoints hit capacity limits. VPN tunnels drop for seconds or hours. And through all of it, your PLCs keep generating data that cannot be lost.

Store-and-forward buffering is the difference between an IIoT platform that works in lab demos and one that survives a real factory. This guide covers the engineering patterns — memory buffer design, connection watchdogs, batch queuing, and delivery confirmation — that keep telemetry flowing even when the network doesn't.

MQTT store-and-forward buffering for industrial IoT

Why MQTT QoS Isn't Enough

MQTT Quality of Service levels provide delivery guarantees:

  • QoS 0: Fire and forget — no acknowledgment
  • QoS 1: At least once — broker ACKs receipt
  • QoS 2: Exactly once — four-step handshake

Most IIoT deployments use QoS 1 for telemetry. The broker confirms receipt, and the client can safely discard the message. Problem solved, right?

Not quite. QoS guarantees only work while the MQTT connection is alive. When the connection drops — and on factory cellular links, it drops often — you have three problems:

  1. In-flight messages are lost. The client published, but the broker never received the ACK.
  2. New data keeps arriving. The PLC doesn't stop reading tags because the cloud is unreachable.
  3. Reconnection takes time. TLS handshakes over cellular can take 5–15 seconds. DNS resolution, certificate validation, and SAS token exchange add more.

During a 2-minute network outage on a system polling 40 tags every 5 seconds, you generate roughly 960 tag values that MQTT QoS cannot protect. Store-and-forward bridges this gap.

The Architecture: Buffer Between Producer and Transport

The fundamental pattern places a persistent buffer between the data producer (PLC reader) and the data transport (MQTT client):

┌──────────┐    ┌──────────────────┐    ┌───────────┐    ┌─────────┐
│ PLC │───▶│ Batch │───▶│ Buffer │───▶│ MQTT │
│ Reader │ │ Accumulator │ │ (S&F) │ │ Client │
└──────────┘ └──────────────────┘ └───────────┘ └─────────┘

┌────┴────┐
│ Page │
│ Pool │
│ (RAM) │
└─────────┘

The PLC reader polls tags on fixed intervals. The batch accumulator groups values by timestamp and serializes them into binary frames. When a batch is complete, it's written to the buffer — not directly to MQTT. The buffer manages a pool of memory pages and feeds them to the MQTT client as the network allows.

Why Not Just Use the MQTT Client's Internal Queue?

Most MQTT libraries (Mosquitto, Paho, etc.) have internal message queues. But they have critical limitations:

  1. Fixed queue depth, not byte-based — you can't predict how many messages fit in available RAM
  2. No delivery confirmation backpressure — when the queue fills, new messages are silently dropped
  3. No page-based memory management — memory fragmentation on long-running embedded systems causes crashes
  4. No visibility — you can't inspect, count, or prioritize queued messages

A purpose-built store-and-forward buffer solves all four problems.

Page-Based Memory Buffer Design

On resource-constrained edge gateways (32–128 MB RAM total, running Linux + MQTT + PLC drivers), memory management must be deterministic. Malloc/free patterns that work on servers cause fragmentation-induced OOM kills on embedded systems running for months without restart.

The Page Pool Architecture

Pre-allocate a fixed memory region at startup (e.g., 2 MB) and divide it into fixed-size pages:

┌────────────────────────────────────────────────┐
│ 2 MB Buffer Memory │
├──────┬──────┬──────┬──────┬──────┬──────┬──────┤
│ Page │ Page │ Page │ Page │ Page │ Page │ Page │
│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │
│ 256K │ 256K │ 256K │ 256K │ 256K │ 256K │ ... │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┘

Each page has a fixed structure:

┌─────────────────────────────────┐
│ Page Header │
│ ├─ Message ID (4 bytes) │
│ ├─ Message Size (4 bytes) │
│ ├─ Write Pointer │
│ ├─ Read Pointer │
│ ├─ Page Number │
│ └─ Next Page Pointer │
├─────────────────────────────────┤
│ Payload Data │
│ (up to max_block_size bytes) │
└─────────────────────────────────┘

Pages are organized into three linked lists:

  • Free list: Available pages ready for writing
  • Write queue: Pages currently being filled with batch data
  • Send queue: Pages waiting for MQTT transmission + ACK

Minimum Page Count

The buffer needs at minimum 3 pages to function:

  1. One page being written (current batch accumulation)
  2. One page being sent (in-flight to MQTT broker)
  3. One page free (ready for the next write when the current page fills)

In practice, allocate 8–16 pages. The more pages, the longer you can buffer during extended outages.

Page Lifecycle

FREE → WRITING → QUEUED → SENDING → (ACK received) → FREE
→ (NACK/timeout) → QUEUED (retry)
  1. FREE → WRITING: Batch accumulator requests a page from the free list. Writes serialized tag data into the payload area.
  2. WRITING → QUEUED: Batch finalization triggers (size limit or timeout). Page moves to the send queue.
  3. QUEUED → SENDING: MQTT client is connected and idle. Buffer dequeues the oldest page and publishes its payload.
  4. SENDING → FREE: MQTT broker sends PUBACK (QoS 1). Page is recycled to the free list.
  5. SENDING → QUEUED: No PUBACK received within timeout. Page returns to the send queue for retry.

What Happens When the Buffer Is Full

When all pages are occupied and a new batch needs a page, you have two options:

Option A: Drop oldest data. Recycle the oldest queued page to the free list. You lose historical data but keep current readings flowing. This is the right choice for process monitoring where recent data matters most.

Option B: Block the producer. Stop polling PLCs until a page frees up. This prevents data loss but creates gaps in the time series. This is the right choice for compliance-critical applications (FDA, EPA) where every reading must be preserved.

Most production systems use Option A with an alert — if the buffer fills, something is seriously wrong with the network, and the operations team needs to know.

Connection Management: The Watchdog Pattern

MQTT connections on industrial networks fail in subtle ways. The TCP socket might stay open while the broker stops responding. The TLS session might expire silently. The cellular modem might report "connected" while actually in a radio dead zone.

The Idle Watchdog

Track the timestamp of the last meaningful MQTT activity — any incoming message, any PUBACK, any SUBACK. If no activity occurs for a configurable timeout (typically 120 seconds), force a disconnect and reconnect:

on every MQTT activity:
watchdog_timestamp = now()

every 10 seconds:
elapsed = now() - watchdog_timestamp
if elapsed > WATCHDOG_TIMEOUT:
force_disconnect()
schedule_reconnect()

This catches the "zombie connection" scenario that MQTT keepalive alone doesn't handle reliably over NAT and cellular networks.

Asynchronous Reconnection

Reconnection must be non-blocking. If the main PLC reading loop blocks on mqtt_connect(), you stop collecting data during the reconnection attempt — defeating the entire purpose of store-and-forward.

Run the MQTT reconnection in a separate thread:

Main Thread:
while true:
read PLC tags
accumulate batch
if batch complete:
write to buffer ← always succeeds (local memory)

MQTT Thread:
while true:
if not connected:
attempt_connect() ← may block for seconds
if connected:
notify_buffer(CONNECTED)
subscribe_to_topics()
else:
mqtt_loop() ← handles keepalive, PUBACK, incoming messages

The main thread never waits for the network. It writes to the buffer regardless. The MQTT thread drains the buffer when connectivity permits.

Connection State Events

When the MQTT connection changes state, the buffer needs to know:

On connect:

  • Start draining the send queue
  • Deliver a status report (daemon version, device info, uptime) so the cloud knows the gateway is alive
  • Resume normal telemetry flow

On disconnect:

  • Stop attempting to publish (avoid blocking on dead sockets)
  • Continue accumulating data into buffer pages
  • Let the watchdog/reconnect loop handle recovery

Command Queue: Bidirectional Communication

Store-and-forward isn't just for outbound telemetry. The cloud also sends commands to edge devices — configuration updates, tag interval changes, on-demand read requests, status queries.

The Inbound Command Pattern

Incoming MQTT messages are parsed and placed into a command queue (a simple linked list of command objects):

┌─────────────┐
│ MQTT Message│
│ (JSON) │
└──────┬──────┘

┌──────────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Parse JSON │───▶│ Command │───▶│ Command │───▶│ Command │
│ Extract "cmd"│ │ Queue │ │ (next) │ │ (next) │
└──────────────┘ └─────────┘ └─────────┘ └─────────┘


┌───────────────┐
│ Main Loop │
│ (dispatch) │
└───────────────┘

Commands are dispatched in the main loop — not in the MQTT callback thread. This prevents long-running operations (like reconfiguring PLC tags) from blocking MQTT keepalive processing.

Supported Command Types in Typical IIoT Systems

CommandDirectionPurpose
daemon_configCloud → EdgeUpdate gateway configuration (IP, ports, timeouts)
device_configCloud → EdgeUpdate PLC tag definitions (add/remove/modify tags)
get_statusCloud → EdgeRequest current gateway status report
tag_updateCloud → EdgeChange polling interval for specific tag
read_nowCloud → EdgeImmediately read and transmit a specific tag value

The beauty of this pattern is that configuration changes are persisted to the edge device's filesystem. If the gateway reboots, it loads the latest configuration — including any cloud-pushed updates — without requiring another round-trip.

TLS and Authentication for Industrial MQTT

MQTT connections from edge gateways to cloud brokers must be encrypted and authenticated. In industrial settings, this typically means:

Certificate-Based Authentication

Each edge gateway carries:

  1. A device certificate (PEM format) signed by the cloud platform's CA
  2. A shared access signature (SAS) with an expiration timestamp
  3. The broker's root CA certificate for TLS verification

The SAS token has a limited lifetime (often 1–2 years). Your edge software must track the expiration timestamp and alert operations teams well before it expires:

SAS Token Structure:
SharedAccessSignature sr=<hub>.azure-devices.net/devices/<device-id>
&sig=<signature>
&se=<expiration-unix-timestamp>

Production tip: Include the SAS expiration timestamp in every status report. Have your cloud platform alert when any device's token is within 90 days of expiration. Expired tokens are the #1 cause of "sudden" fleet-wide connectivity failures.

TLS Over Cellular: Performance Considerations

TLS handshakes over cellular add 2–5 seconds to every reconnection. On a gateway that reconnects frequently (poor signal area), this overhead compounds. Strategies to minimize it:

  1. Session resumption: Many MQTT libraries support TLS session caching. A resumed session skips the full handshake and completes in ~200ms.
  2. Connection persistence: Use MQTT keepalive aggressively (30–60 seconds) to prevent NAT timeout disconnections.
  3. Watchdog over keepalive: Don't rely solely on MQTT PINGREQ/PINGRESP. Supplement with application-level activity tracking.

Real-World Reliability Numbers

Based on production deployments across manufacturing facilities with cellular connectivity:

MetricWithout Store-and-ForwardWith Store-and-Forward
Data delivery rate94–97%99.7–99.99%
Avg gap during outageFull outage duration0 (buffered)
Max tolerable outage0 seconds10–30 minutes (RAM buffer)
Recovery timeManual restartAutomatic (< 30 seconds)
Data orderingUnpredictableFIFO guaranteed

The 99.99% isn't marketing fluff — it's achievable because the only data loss scenarios are:

  1. Buffer overflow (outage exceeds buffer capacity — typically 10–30 minutes of data)
  2. Power loss (RAM buffer is volatile — disk-backed buffers solve this at cost of I/O latency)
  3. Hardware failure (edge device dies entirely)

Scenarios 1 and 2 can be addressed with disk-backed buffers (SQLite, flat files), but most production deployments accept the RAM-only tradeoff for simplicity and performance.

Implementing Store-and-Forward: Architecture Checklist

If you're building or evaluating an IIoT edge platform, here's what the store-and-forward layer must include:

Must Have

  • Pre-allocated memory pool (no runtime malloc for data pages)
  • Page-based buffer with free/write/send lists
  • Non-blocking MQTT reconnection (separate thread)
  • Activity-based connection watchdog (not just MQTT keepalive)
  • FIFO delivery ordering
  • Automatic retry on PUBACK failure
  • Status reporting on connect (uptime, version, buffer depth)

Should Have

  • Buffer utilization metrics (% full, pages in use)
  • Configurable behavior on buffer full (drop oldest vs. block)
  • Command queue for inbound cloud-to-edge messages
  • Configuration persistence (survive reboot with latest config)
  • SAS/certificate expiration monitoring

Nice to Have

  • Disk-backed overflow buffer for extended outages
  • Message prioritization (alarms before routine telemetry)
  • Compression for large batches
  • Local MQTT broker failover (write to localhost:1883 if cloud is down)

Common Pitfalls and How to Avoid Them

Pitfall 1: Blocking on MQTT Publish

Symptom: PLC data stops updating during network outages.

Cause: The MQTT publish call blocks until the broker responds or times out. If the network is down, this can block for 30+ seconds per message.

Fix: Never publish from the PLC reading thread. Write to the buffer, let the MQTT thread drain it. The buffer write is always a local memory operation — it completes in microseconds.

Pitfall 2: Unbounded Memory Growth

Symptom: Edge gateway OOM-kills after running for weeks.

Cause: Dynamic memory allocation for MQTT messages without bounds. Each new message allocates, and if delivery is slow, allocations outpace frees.

Fix: Pre-allocate all buffer memory at startup. The page pool has a fixed ceiling. When it's full, you make an explicit decision (drop oldest or block) instead of silently growing until the OS kills you.

Pitfall 3: Reconnection Storms

Symptom: Gateway hammers the broker with connection attempts during outages, consuming CPU and battery.

Cause: Reconnection loop without exponential backoff.

Fix: Implement exponential backoff with jitter: 1s, 2s, 4s, 8s, 16s, 32s (cap). Add ±20% random jitter to prevent thundering herd when power is restored to an entire factory.

Pitfall 4: Split-Brain After Long Outage

Symptom: After a 30-minute outage, the cloud receives 30 minutes of buffered historical data mixed with real-time data, confusing dashboards and alerts.

Cause: Buffered data is timestamped correctly but arrives all at once, triggering alert rules that check "most recent value."

Fix: Cloud-side ingestion must respect the embedded timestamp, not the arrival time. Alert rules should use the tag's timestamp field, not the message receipt time. This is a cloud architecture decision, but the edge must include accurate timestamps in every batch group.

How machineCDN Handles Store-and-Forward

machineCDN's edge agent implements the full store-and-forward pattern described here: page-based memory buffers, non-blocking MQTT with activity watchdog, automatic reconnection, and bidirectional command queuing. The buffer survives cellular outages, VPN drops, and cloud endpoint maintenance windows without losing a single tag value.

For teams running equipment in locations with unreliable connectivity — remote water treatment plants, offshore platforms, mobile fleets — this buffering layer is the difference between an IIoT deployment that delivers 95% of data and one that delivers 99.9%.

Key Takeaways

  1. MQTT QoS doesn't protect data during disconnections. You need application-level store-and-forward buffering.
  2. Pre-allocate all buffer memory at startup to avoid fragmentation and OOM on long-running embedded systems.
  3. Use page-based buffer pools with explicit free/write/send lifecycle management.
  4. Never block the PLC reader on network operations. Separate the data producer from the transport layer.
  5. Implement an activity-based watchdog beyond MQTT keepalive to catch zombie connections.
  6. Run MQTT reconnection asynchronously with exponential backoff and jitter.
  7. Include accurate timestamps in every data group so cloud-side processing can distinguish buffered historical data from real-time data.

The edge-to-cloud pipeline is only as reliable as its weakest link. Store-and-forward buffering makes that link the network itself — not your software.