Skip to main content

187 posts tagged with "Industrial IoT"

Industrial Internet of Things insights and best practices

View All Tags

MachineCDN vs Savigent: IIoT Analytics Platform vs Manufacturing Execution System

· 8 min read
MachineCDN Team
Industrial IoT Experts

When manufacturing engineers evaluate platforms to digitize their factory floor, two very different approaches emerge: IIoT analytics platforms like MachineCDN that connect directly to your machines, and Manufacturing Execution Systems (MES) like Savigent that orchestrate workflows across your production process. Understanding the difference is critical before you commit budget and engineering time.

Modbus Float Encoding: How to Correctly Read IEEE 754 Values from Industrial PLCs [2026]

· 11 min read

If you've spent any time integrating PLCs with an IIoT platform, you've encountered the moment: you read a temperature register that should show 72.5°F, but instead you get 1,118,044,160. Or worse — NaN. Or a negative number that makes zero physical sense.

Welcome to the Modbus float encoding problem. It's the #1 source of confusion in industrial data integration, and it trips up experienced engineers just as often as beginners.

This guide goes deep on how 32-bit floating-point values are actually stored and transmitted over Modbus — covering register pairing, word-swap variants, byte ordering, and the practical techniques that production IIoT systems use to get correct readings from heterogeneous equipment fleets.

Why Modbus and Floats Don't Play Nicely Together

The original Modbus specification (1979) defined only 16-bit registers. Each holding register (4xxxx) or input register (3xxxx) stores exactly one unsigned 16-bit word — values from 0 to 65,535.

But modern PLCs need to represent temperatures like 215.7°F, flow rates like 3.847 GPM, and pressures like 127.42 PSI. A 16-bit integer can't hold these values with the precision operators need.

The solution: pack an IEEE 754 single-precision float (32 bits) across two consecutive Modbus registers. Simple enough in theory. In practice, it's a minefield.

The IEEE 754 Layout

A 32-bit float uses this bit structure:

Bit:  31  30..23   22..0
S EEEEEEEE MMMMMMMMMMMMMMMMMMMMMMM
│ │ └── Mantissa (23 bits)
│ └── Exponent (8 bits, biased by 127)
└── Sign (1 bit: 0=positive, 1=negative)

The float value 72.5 encodes as 0x42910000:

  • Sign: 0 (positive)
  • Exponent: 10000101 (133 - 127 = 6)
  • Mantissa: 00100010000000000000000

That 32-bit value needs to be split across two 16-bit registers. Here's where the problems start.

The Four Word-Order Variants

Different PLC manufacturers split 32-bit floats into register pairs using different byte and word ordering. There are four possible arrangements, and encountering all four in a single plant is common:

Variant 1: Big-Endian (AB CD) — "Network Order"

The most intuitive layout. The high word occupies the lower register address.

Register N  :  0x4291  (bytes A, B)
Register N+1: 0x0000 (bytes C, D)

Reconstruct: (Register_N << 16) | Register_N+10x42910000 → 72.5

Used by: Many Allen-Bradley/Rockwell PLCs, Schneider Modicon M340/M580, some Siemens devices.

Variant 2: Little-Endian Word Swap (CD AB)

The low word comes first. This is surprisingly common.

Register N  :  0x0000  (bytes C, D)
Register N+1: 0x4291 (bytes A, B)

Reconstruct: (Register_N+1 << 16) | Register_N0x42910000 → 72.5

Used by: Many Modbus TCP devices, Conch controls, various Asian-manufactured PLCs.

Variant 3: Byte-Swapped Big-Endian (BA DC)

Each 16-bit word has its bytes reversed, but word order is normal.

Register N  :  0x9142  (bytes B, A)
Register N+1: 0x0000 (bytes D, C)

This requires swapping bytes within each word before combining.

Used by: Some older Emerson/Fisher devices, certain Yokogawa controllers.

Variant 4: Byte-Swapped Little-Endian (DC BA)

The least intuitive: both word order and byte order are reversed.

Register N  :  0x0000  (bytes D, C)
Register N+1: 0x9142 (bytes B, A)

Used by: Rare, but you'll find it in some legacy Fuji and Honeywell equipment.

How Production IIoT Systems Handle This

In a real manufacturing environment, you don't get to choose which word order your equipment uses. A single plant might have:

  • TCU (Temperature Control Units) using Modbus RTU at 9600 baud, storing floats in registers 404000-404056 with big-endian word order
  • Portable chillers on Modbus TCP port 502, using 16-bit integers (no float encoding needed)
  • Batch blenders speaking EtherNet/IP natively, where float handling is built into the CIP protocol
  • Dryers with Modbus TCP and CD-AB word swapping

A well-designed edge gateway handles this with per-device configuration. The key insight: float decoding is a device-level property, not a global setting. Each equipment type gets its own configuration that specifies:

  1. Protocol (Modbus RTU, Modbus TCP, or EtherNet/IP)
  2. Register address (which pair of registers holds the float)
  3. Element count — set to 2 for a 32-bit float spanning two registers
  4. Data type — explicitly declared as float vs. int16 vs. uint32

Here's a generic configuration example for a temperature control unit reading float values over Modbus RTU:

{
"protocol": "modbus-rtu",
"tags": [
{
"name": "Delivery Temperature",
"register": 4002,
"type": "float",
"element_count": 2,
"poll_interval_sec": 60
},
{
"name": "Mold Temperature",
"register": 4004,
"type": "float",
"element_count": 2,
"poll_interval_sec": 60
},
{
"name": "Flow Rate",
"register": 4008,
"type": "float",
"element_count": 2,
"poll_interval_sec": 60
}
]
}

Notice the element_count: 2. This tells the gateway: "read two consecutive registers starting at this address, then combine them into a single 32-bit float." Getting this wrong is the most common source of incorrect readings.

The modbus_get_float() Trap

If you're using libmodbus (the most common C library for Modbus), you'll encounter modbus_get_float() and its variants:

  • modbus_get_float_abcd() — big-endian (most standard)
  • modbus_get_float_dcba() — fully reversed
  • modbus_get_float_badc() — byte-swapped, word-normal
  • modbus_get_float_cdab() — word-swapped, byte-normal

The default modbus_get_float() function uses CDAB ordering (word-swapped). This catches many engineers off guard — they read two registers, call modbus_get_float(), and get garbage because their PLC uses ABCD ordering.

Rule of thumb: Always test with a known value. Write 72.5 to a register pair in your PLC, read both registers as raw uint16 values, and observe which bytes are where. Then select the appropriate decode function.

Practical Decoding in C

Here's how you'd manually decode a float from two Modbus registers, handling the common big-endian case:

// Big-endian (ABCD): high word in register[0], low word in register[1]
float decode_float_be(uint16_t reg_high, uint16_t reg_low) {
uint32_t combined = ((uint32_t)reg_high << 16) | (uint32_t)reg_low;
float result;
memcpy(&result, &combined, sizeof(float));
return result;
}

// Word-swapped (CDAB): low word in register[0], high word in register[1]
float decode_float_ws(uint16_t reg_low, uint16_t reg_high) {
uint32_t combined = ((uint32_t)reg_high << 16) | (uint32_t)reg_low;
float result;
memcpy(&result, &combined, sizeof(float));
return result;
}

Never use pointer casting (*(float*)&combined). It violates strict aliasing rules and can produce incorrect results on optimizing compilers. Always use memcpy.

Element Count and Register Math

One subtle but critical detail: when you configure a tag to read a float, the element count tells the gateway how many 16-bit registers to request in a single Modbus transaction.

For a single float:

  • Element count = 2 (two 16-bit registers = 32 bits)
  • Read function code 3 (holding registers) or 4 (input registers)
  • The response contains 4 bytes of data

For an array of 8 floats (e.g., reading recipe values from a batch blender):

  • Element count = 16 (8 floats × 2 registers each)
  • Single Modbus read request for 16 consecutive registers
  • Far more efficient than 8 separate read requests

This is where contiguous register optimization matters. If you have tags at registers 4000, 4002, 4004, 4006, 4008 — all 2-element floats — a smart gateway combines them into a single Modbus read of 10 registers instead of 5 separate reads. This reduces bus traffic by 60-80% on RTU networks where every transaction costs 5-20ms of serial turnaround time.

Modbus RTU vs TCP: Float Handling Differences

RTU (Serial)

Serial Modbus has strict timing requirements. The inter-frame gap (3.5 character times of silence) separates messages. At 9600 baud with 8N1 encoding:

  • 1 character = 11 bits (start + 8 data + parity + stop)
  • 1 character time = 11/9600 = 1.146ms
  • 3.5 character silence = ~4ms

When reading float values over RTU, response timeout configuration matters. A typical setup:

Baud:             9600
Parity: None
Data bits: 8
Stop bits: 1
Byte timeout: 4ms (gap between consecutive bytes)
Response timeout: 100ms (total time to receive response)

If your byte timeout is too tight, the response may be split into two frames, and the second register of your float pair gets dropped. If you're seeing correct first-register values but garbage in the combined float, increase byte timeout to 5-8ms.

TCP (Ethernet)

Modbus TCP eliminates timing issues but introduces transaction ID management. Each request gets a transaction ID that the slave echoes back. For float reads, the process is identical — request 2 registers, get 4 bytes back — but the framing is handled by TCP, so there's no byte-timeout concern.

The default Modbus TCP port is 502. Some devices use non-standard ports; always verify with the equipment manual.

Common Pitfalls and Troubleshooting

1. Reading Zero Where You Expect a Float

Symptom: Register pair returns 0x0000 0x0000 → 0.0

Likely cause: Wrong register address. Remember the Modbus address convention:

  • Addresses 400001-465536 use function code 3 (read holding registers)
  • Addresses 300001-365536 use function code 4 (read input registers)
  • The actual register number = address - 400001 (for holding) or address - 300001 (for input)

A tag configured at address 404000 maps to holding register 4000 (function code 3). If you accidentally use function code 4, you're reading input register 4000 instead — a completely different value.

2. Reading Extreme Values

Symptom: You get values like 4.5e+28 or -3.2e-15

Likely cause: Wrong word order. You're combining registers in the wrong sequence. Try swapping the two registers and recomputing.

3. Getting NaN or Inf

Symptom: NaN (0x7FC00000) or Inf (0x7F800000)

Likely causes:

  • Word-order mismatch producing an exponent field of all 1s
  • Reading a register that doesn't actually contain a float (it's a raw integer)
  • Sensor disconnected — some PLCs write NaN to indicate a failed sensor

4. Values That Are Close But Off By a Factor

Symptom: You read 7250.0 instead of 72.5

Likely cause: The PLC stores values as scaled integers, not floats. Many older PLCs store temperature as an integer × 100 (so 72.5°F = 7250). Check the PLC documentation for scaling factors. This is especially common with Modbus devices that use single registers (element count = 1) for process values.

5. Intermittent Corrupt Readings

Symptom: 99% of readings are correct, but occasionally you get wild values.

Likely cause: On Modbus RTU, this is usually CRC errors that weren't caught, or electrical noise on the RS-485 bus. Add retry logic — read the registers, if the float value is outside physical bounds (e.g., temperature > 500°F for a plastics process), retry up to 3 times before logging an error.

Real-World Benchmarks

In production IIoT deployments monitoring plastics manufacturing equipment, typical float-read performance:

ProtocolFloat Read TimeRegisters per RequestEffective Throughput
Modbus RTU @ 960015-25ms2 (single float)~40 floats/sec
Modbus RTU @ 960030-45ms50 (contiguous block)~1,000 values/sec
Modbus TCP2-5ms2 (single float)~200 floats/sec
Modbus TCP3-8ms125 (max block)~15,000 values/sec
EtherNet/IP1-3msN/A (native types)~5,000+ tags/sec

The lesson: Modbus RTU float reads are slow individually but scale well with contiguous reads. If you have 30 float tags spread across non-contiguous addresses, it's 30 × 20ms = 600ms per polling cycle. Group your tags by contiguous address blocks to minimize transactions.

Best Practices for Production Systems

  1. Declare types explicitly in configuration. Never auto-detect float vs. integer — always specify the data type per tag.

  2. Use element count = 2 for floats. This is the most common source of misconfiguration. A float is 2 registers, always.

  3. Test with known values during commissioning. Before going live, write a known float (like 123.456) to the PLC and verify the IIoT platform reads it correctly.

  4. Document word order per device type. Build a device-specific configuration library. A TrueTemp TCU uses ABCD, a GP Chiller uses raw int16 — capture this per equipment model.

  5. Implement bounds checking. If a temperature reading suddenly shows 10,000°F, that's not a process event — it's a decode error. Log it, don't alert on it.

  6. Add retry logic for RTU reads. Serial networks are noisy. Retry failed reads up to 3 times before reporting an error status.

  7. Batch contiguous registers. Instead of reading registers 4000-4001, then 4002-4003, then 4004-4005 as three separate transactions, read 4000-4005 as a single 6-register request.

How machineCDN Handles Float Encoding

machineCDN's edge gateway is built to handle the float encoding problem across heterogeneous equipment fleets. Each device type gets a configuration profile that explicitly declares register addresses, data types, element counts, and polling intervals — eliminating the guesswork that causes most float decoding failures.

The platform supports Modbus RTU, Modbus TCP, and EtherNet/IP natively, with automatic protocol detection during initial device discovery. When a new PLC is connected, the gateway attempts EtherNet/IP first (reading the device type tag directly), then falls back to Modbus TCP on port 502. This dual-protocol detection means a single gateway can service mixed equipment floors without manual protocol configuration.

For plastics manufacturers running TCUs, chillers, blenders, dryers, and conveying systems, machineCDN provides pre-built device profiles that include correct register maps, data types, and word-order settings — so the float encoding problem is solved before commissioning begins.


Getting float encoding right is the foundation of trustworthy IIoT data. Every OEE calculation, every alarm threshold, every predictive maintenance model depends on correct readings from the plant floor. Invest the time to verify your decoding — the downstream value is enormous.

MQTT Last Will and Testament for Industrial Device Health Monitoring [2026]

· 12 min read

MQTT Last Will and Testament for Industrial Device Health

In industrial environments, knowing that a device is offline is just as important as knowing what it reports when it's online. A temperature sensor that silently stops publishing doesn't trigger alarms — it creates a blind spot. And in manufacturing, blind spots kill uptime.

MQTT's Last Will and Testament (LWT) mechanism solves this problem at the protocol level. When properly implemented alongside birth certificates, status heartbeats, and connection watchdogs, LWT transforms MQTT from a simple pub/sub pipe into a self-diagnosing industrial nervous system.

This guide covers the practical engineering behind LWT in industrial deployments — not just the theory, but the real-world patterns that survive noisy factory networks.

MQTT QoS Levels for Industrial Telemetry: Choosing the Right Delivery Guarantee [2026]

· 11 min read

When an edge gateway publishes a temperature reading from a plastics extruder running at 230°C, does it matter if that message arrives exactly once, at least once, or possibly not at all? The answer depends on what you're doing with the data — and getting it wrong can mean either lost production insights or a network drowning in redundant traffic.

MQTT's Quality of Service (QoS) levels are one of the most misunderstood aspects of industrial IoT deployments. Most engineers default to QoS 1 for everything, which is rarely optimal. This guide breaks down each level with real industrial scenarios, bandwidth math, and patterns that actually work on factory floors where cellular links drop and PLCs generate thousands of data points per second.

OPC-UA Information Modeling and Subscriptions: A Deep Dive for IIoT Engineers [2026]

· 12 min read

If you've spent time wiring Modbus registers to cloud platforms, you know the pain: flat address spaces, no built-in semantics, and endless spreadsheets mapping register 40004 to "Mold Temperature Zone 2." OPC-UA was designed to solve exactly this problem — but its information modeling layer is far richer (and more complex) than most engineers realize when they first encounter it.

This guide goes deep on how OPC-UA structures industrial data, how subscriptions efficiently deliver changes to clients, and how security policies protect the entire stack. Whether you're evaluating OPC-UA for a greenfield deployment or bridging it into an existing Modbus/EtherNet-IP environment, this is the practical knowledge you need.

Paged Ring Buffers for Industrial MQTT: How to Never Lose a Data Point [2026]

· 10 min read

Here's the scenario every IIoT engineer dreads: your edge gateway is collecting temperature, pressure, and vibration data from 200 tags across 15 PLCs. The cellular modem on the factory roof drops its connection — maybe for 30 seconds during a handover, maybe for 4 hours because a backhoe hit a fiber line. When connectivity returns, what happens to the data?

If your answer is "it's gone," you have a buffer management problem. And fixing it properly requires understanding paged ring buffers — the unsung hero of reliable industrial telemetry.

Why Naive Buffering Fails

The simplest approach — queue MQTT messages in memory and retry on reconnect — has three fatal flaws:

  1. Memory exhaustion: A gateway reading 200 tags at 1-second intervals generates ~12,000 readings per minute. At ~100 bytes per JSON reading, that's 1.2 MB/minute. A 4-hour outage accumulates ~288 MB. Your 256 MB embedded gateway just died.

  2. No delivery confirmation: MQTT QoS 1 guarantees "at least once" delivery, but the Mosquitto client library's in-flight message queue is finite. If you publish 50,000 messages into a disconnected client, most will be silently dropped by the client library's internal buffer long before the broker sees them.

  3. Thundering herd on reconnect: When connectivity returns, dumping 288 MB of queued messages simultaneously will choke the cellular uplink (typically 1–5 Mbps), cause broker-side backpressure, and likely trigger another disconnect.

The Paged Ring Buffer Architecture

The solution is a fixed-size, page-based circular buffer that sits between the data collection layer and the MQTT client. Here's how it works:

Memory Layout

The buffer is allocated as a single contiguous block — typically 2 MB on an embedded gateway. This block is divided into equal-sized pages, where each page can hold one complete MQTT payload.

┌─────────────────────────────────────────────────┐
│ 2 MB Buffer Memory │
├────────┬────────┬────────┬────────┬────────┬────┤
│ Page 0 │ Page 1 │ Page 2 │ Page 3 │ Page 4 │ ...│
│ 4 KB │ 4 KB │ 4 KB │ 4 KB │ 4 KB │ │
└────────┴────────┴────────┴────────┴────────┴────┘

With a 4 KB page size and 2 MB total buffer, you get approximately 500 pages. Each page holds multiple MQTT messages packed sequentially.

Page States

Every page exists in exactly one of three states:

  • Free: Available for new data. Part of a singly-linked free list.
  • Work: Currently being filled with incoming data. Only one work page exists at a time.
  • Used: Full of data, waiting to be transmitted. Part of a singly-linked FIFO queue.
Free Pages → [P5] → [P6] → [P7] → null
Work Page → [P3] (currently filling)
Used Pages → [P0] → [P1] → [P2] → null
↑ sending waiting →

The Write Path

When a batch of PLC tag values arrives from the data collection layer:

  1. Check the work page: If there's no current work page, pop one from the free list. If the free list is empty, steal the oldest used page (overflow — we're losing old data to make room for new data, which is the correct trade-off for operational monitoring).

  2. Calculate fit: Each message is packed as: [4-byte message ID] [4-byte message size] [message payload]. Check if the current work page has enough remaining space for this overhead plus the payload.

  3. If it fits: Write the message ID (initially zero — will be filled by the MQTT client), the size, and the payload. Advance the write pointer.

  4. If it doesn't fit: Move the current work page to the tail of the used queue. Pop a new page from the free list (or steal from used queue). Write into the new page.

Page Internal Layout:
┌──────────┬──────────┬─────────────┬──────────┬──────────┬─────────────┐
│ msg_id_1 │ msg_sz_1 │ payload_1 │ msg_id_2 │ msg_sz_2 │ payload_2 │
│ (4 bytes) │ (4 bytes) │ (N bytes) │ (4 bytes) │ (4 bytes) │ (M bytes) │
└──────────┴──────────┴─────────────┴──────────┴──────────┴─────────────┘
↑ write_p (current position)

The Send Path

The MQTT send logic runs after every write operation and follows strict rules:

  1. Check prerequisites: Connection must be up (connected == 1) AND no packet currently in-flight (packet_sent == 0). If either fails, do nothing — the data is safely buffered.

  2. Select the send source: If there are used pages, send from the first one in the FIFO. If no used pages exist but the work page has data, promote the work page to used and send from it.

  3. Read the next message from the current page's read pointer: extract the size, get the data pointer, and call mosquitto_publish() with QoS 1.

  4. Mark packet as in-flight: Set packet_sent = 1. This is critical — only one message can be in-flight at a time. This prevents the thundering herd problem and ensures ordered delivery.

  5. Wait for acknowledgment: The MQTT client library calls the publish callback when the broker confirms receipt (PUBACK for QoS 1). Only then do we advance the read pointer and send the next message.

The Acknowledgment Path

When the Mosquitto library fires the on_publish callback with a packet ID:

  1. Verify the ID matches the in-flight message on the current used page
  2. Advance the read pointer past the delivered message (skip message ID + size + payload bytes)
  3. Check if page is fully delivered: If read_p >= write_p, move the page back to the free list
  4. Clear the in-flight flag: Set packet_sent = 0
  5. Immediately attempt to send the next message — this creates a natural flow control where messages are delivered as fast as the broker can acknowledge them
Delivery Flow:
publish()
[Used Page] ──────────────────→ [MQTT Broker]
↑ │
│ PUBACK │
└────────────────────────────────┘
advance read_p, try next

Thread Safety: The Mutex Dance

In a real gateway, data collection and MQTT delivery run on different threads. The PLC polling loop writes data every second, while the Mosquitto client library fires callbacks from its own network thread. Every buffer operation — add, send, acknowledge, connect, disconnect — must be wrapped in a mutex:

// Data collection thread:
mutex_lock(buffer)
add_data(payload)
try_send_next() // opportunistic send
mutex_unlock(buffer)

// MQTT callback thread:
mutex_lock(buffer)
mark_delivered(packet_id)
try_send_next() // chain next send
mutex_unlock(buffer)

The key insight is that try_send_next() is called from both threads — after every write (in case we're connected and idle) and after every acknowledgment (to chain the next message). This ensures maximum throughput without busy-waiting.

Handling Disconnects Gracefully

When the MQTT connection drops, two things happen:

  1. The disconnect callback fires: Set connected = 0 and packet_sent = 0. The in-flight message is NOT lost — it's still in the page at the current read pointer. When connectivity returns, it will be re-sent.

  2. Data keeps flowing in: The PLC polling loop doesn't stop. New data continues to fill pages. The used queue grows. If it fills all available pages, new pages will steal from the oldest used pages — but this only happens under extreme sustained outages.

When the connection re-establishes:

  1. The connect callback fires: Set connected = 1 and trigger try_send_next()
  2. Buffered data starts flowing: Messages are delivered in FIFO order, one at a time, with acknowledgment pacing

This means the broker receives data in chronological order, with timestamps embedded in each batch. Analytics systems downstream can seamlessly handle the gap — they see a burst of historical data followed by real-time data, all correctly timestamped.

The Cloud Watchdog: Detecting Silent Failures

There's a subtle failure mode: the MQTT connection appears healthy (no disconnect callback), but data isn't actually being delivered. This can happen with certain TLS middlebox issues, stale TCP connections that haven't timed out, or Azure IoT Hub token expirations.

The solution is a delivery watchdog:

  1. Track the timestamp of the last successful packet delivery
  2. On a periodic check (every 120 seconds), compare the current time against the last delivery timestamp
  3. If no data has been delivered in 120 seconds AND the connection claims to be up, force a reconnection:
    • Reset the MQTT configuration timestamp (triggers config reload)
    • Clear the watchdog timer
    • The main loop will detect the stale configuration and restart the MQTT client
if (now - last_delivery_time > 120s) AND (connected) {
log("No data delivered in 120s — forcing MQTT reconnect")
force_mqtt_restart()
}

This catches the "zombie connection" problem that plagues many IIoT deployments — the gateway thinks it's sending, but nothing is actually arriving at the cloud.

Binary vs. JSON: The Bandwidth Trade-off

The paged buffer doesn't care about the payload format — it stores raw bytes. But the choice between JSON and binary encoding has massive implications for buffer utilization:

JSON payload for one tag reading:

{"id":42,"values":[23.7],"ts":1709337600}

~45 bytes per reading.

Binary payload for the same reading:

Tag ID:    2 bytes (uint16)
Status: 1 byte
Value Cnt: 1 byte
Value Sz: 1 byte
Value: 4 bytes (float32)
─────────────────────
Total: 9 bytes per reading

That's a 5x reduction. With batching (multiple readings per batch header), the per-reading overhead drops further because the timestamp and device identity are shared across a group of values.

On a cellular connection billing per megabyte, this isn't academic — it's the difference between $15/month and $75/month per gateway. On satellite connections (Iridium, Starlink maritime), it can be $50 vs. $250.

Binary Batch Wire Format

A binary batch on the wire follows this structure:

[0xF7]                          — 1 byte, magic/version marker
[num_groups] — 4 bytes, big-endian uint32
For each group:
[timestamp] — 4 bytes, big-endian time_t
[device_type] — 2 bytes, big-endian uint16
[serial_number] — 4 bytes, big-endian uint32
[num_values] — 4 bytes, big-endian uint32
For each value:
[tag_id] — 2 bytes, big-endian uint16
[status] — 1 byte (0 = OK, else error code)
If status == 0:
[values_count] — 1 byte
[value_size] — 1 byte (1, 2, or 4)
[values...] — values_count × value_size bytes

A batch of 50 tag readings fits in ~600 bytes binary versus ~3,000 bytes JSON. Over a 4-hour outage with 200 tags at 60-second intervals, that's the difference between buffering ~4.8 MB (binary) versus ~24 MB (JSON) — within or far exceeding a typical gateway's buffer.

Sizing Your Buffer: The Math

For a given deployment, calculate your buffer needs:

Tags: 200
Read interval: 60 seconds
Binary payload per reading: ~9 bytes
Readings per minute: 200
Bytes per minute: 200 × 9 = 1,800 bytes
With batch overhead (~15 bytes per group): ~1,815 bytes/min

Buffer size: 2 MB = 2,097,152 bytes
Retention: 2,097,152 / 1,815 = ~1,155 minutes = ~19.2 hours

So a 2 MB buffer can hold approximately 19 hours of data for 200 tags at 60-second intervals using binary encoding. With JSON, that drops to ~3.8 hours. Size your buffer accordingly.

What machineCDN Does Differently

machineCDN's edge gateway implements this paged ring buffer architecture natively. Every gateway shipped includes:

  • Fixed 2 MB paged buffer with configurable page sizes matching the MQTT broker's maximum packet size
  • Automatic binary encoding for all telemetry — 5x bandwidth reduction over JSON
  • Single-message flow control with QoS 1 acknowledgment tracking — no thundering herd on reconnect
  • 120-second delivery watchdog that detects zombie connections and forces reconnect
  • Graceful overflow handling — when buffer fills, oldest data is recycled (not newest), preserving the most recent operational state

For plant engineers, this means deploying a gateway on a cellular connection and knowing that a connectivity outage — whether 30 seconds or 12 hours — won't result in lost data. The buffer holds, the watchdog monitors, and data flows in order when the link comes back.

Key Takeaways

  1. Never use unbounded queues for industrial telemetry buffering — use fixed-size paged buffers that degrade gracefully under memory pressure
  2. One message in-flight at a time prevents the thundering herd problem and ensures ordered delivery
  3. Always track delivery acknowledgments — don't just publish and forget; verify the broker received each packet before advancing
  4. Implement a delivery watchdog — silent MQTT failures are harder to detect than disconnects
  5. Use binary encoding — 5x bandwidth reduction means 5x longer buffer retention on the same memory
  6. Size for your worst outage — calculate how much buffer you need based on tag count, interval, and the longest connectivity gap you expect
  7. Thread safety is non-negotiable — data collection and MQTT delivery run concurrently; every buffer operation needs mutex protection

The paged ring buffer isn't exotic computer science — it's a practical engineering pattern that's been battle-tested in thousands of industrial deployments. The difference between a prototype IIoT system and a production one often comes down to exactly this kind of infrastructure.

Planned Production Time vs Actual: How IIoT Closes the Capacity Gap in Manufacturing

· 10 min read
MachineCDN Team
Industrial IoT Experts

Every production manager has been asked the same question by their VP of Operations: "How much more capacity do we have?" And every production manager has given the same answer with varying degrees of confidence: "We think we have about 15-20% more capacity, but it depends."

It depends on downtime. It depends on changeovers. It depends on which products are running. It depends on whether the Tuesday night shift actually gets 7.5 hours of production out of their 8-hour shift or whether they lose 90 minutes to startup, cleanup, and that recurring alarm on Press 4.

The gap between planned production time and actual productive time is the single largest source of hidden capacity in manufacturing. According to a study by the Aberdeen Group, the average manufacturer operates at 65-72% capacity utilization — meaning 28-35% of available production time is consumed by downtime, changeovers, slow cycles, and other losses that are rarely measured accurately.

IIoT platforms close this gap by measuring exactly what happens during every minute of planned production time. Not what is supposed to happen. Not what operators report happened. What actually happened, based on real-time machine data.

Prescriptive Maintenance for Manufacturing: Beyond Prediction — What to Do When Your AI Tells You Something's Wrong

· 9 min read
MachineCDN Team
Industrial IoT Experts

Predictive maintenance tells you that something is going to fail. Prescriptive maintenance tells you what to do about it. That distinction sounds subtle, but in practice it's the difference between a maintenance team that gets alerts they don't know how to act on, and one that receives specific, actionable guidance that prevents failures with minimal disruption.

Securing Industrial MQTT and OT Networks: TLS, Certificates, and Zero-Trust for the Factory Floor [2026]

· 13 min read

The edge gateway sitting on your factory floor is talking to the cloud. It's reading temperature, pressure, and flow data from PLCs over Modbus, packaging it into MQTT messages, and publishing to a broker that might be Azure IoT Hub, AWS IoT Core, or a self-hosted Mosquitto instance. The question isn't whether that data path is valuable — it's whether anyone else is listening.

Industrial MQTT security isn't a theoretical exercise. A compromised edge gateway can inject false telemetry (making operators think everything is fine when it isn't), intercept production data (exposing process parameters to competitors), or pivot into the OT network to reach PLCs directly. This guide covers the practical measures that actually protect these systems.

Thread-Safe Telemetry Pipelines: Building Concurrent IIoT Edge Gateways That Don't Lose Data [2026]

· 17 min read

An edge gateway on a factory floor isn't a REST API handling one request at a time. It's a real-time system juggling multiple competing demands simultaneously: polling a PLC for tag values every second, buffering data locally when the cloud connection drops, transmitting batched telemetry over MQTT, processing incoming configuration commands from the cloud, and monitoring its own health — all at once, on hardware with the computing power of a ten-year-old smartphone.

Get the concurrency wrong, and you don't get a 500 error in your logs. You get silent data loss, corrupted telemetry batches, or — worst case — a watchdog reboot loop that takes your monitoring offline during a critical production run.

This guide covers the architecture patterns that make industrial edge gateways reliable under real-world conditions: concurrent PLC polling, thread-safe buffering, MQTT delivery guarantees, and the store-and-forward patterns that keep data flowing when the network doesn't.

Thread-safe edge gateway architecture with concurrent data pipelines

The Concurrency Challenge in Industrial Edge Gateways

A typical edge gateway has at least three threads running concurrently:

  1. The polling thread — reads tags from PLCs at configured intervals (1-second to 60-second cycles)
  2. The MQTT network thread — manages the broker connection, handles publish/subscribe, reconnection
  3. The main control thread — processes incoming commands, monitors watchdog timers, manages configuration

These threads all share one critical resource: the outgoing data buffer. The polling thread writes telemetry into the buffer. The MQTT thread reads from the buffer and transmits data. When the connection drops, the buffer must hold data without the polling thread stalling. When the connection recovers, the buffer must drain in order without losing or duplicating messages.

This is a classic producer-consumer problem, but with industrial constraints that make textbook solutions insufficient.

Why Standard Queues Fall Short

Your first instinct might be to use a thread-safe queue — a ConcurrentLinkedQueue in Java, a queue.Queue in Python, or a lock-free ring buffer. These work fine for web applications, but industrial edge gateways have constraints that break standard queue implementations:

1. Memory Is Fixed and Finite

Edge gateways run on embedded hardware with 64 MB to 512 MB of RAM — no swap space, no dynamic allocation after startup. An unbounded queue will eventually exhaust memory during a long network outage. A fixed-size queue forces you to choose: block the producer (stalling PLC polling) or drop the oldest data.

2. Network Outages Last Hours, Not Seconds

In a factory, network outages aren't transient blips. A fiber cut, a misconfigured switch, or a power surge on the network infrastructure can take connectivity down for hours. Your buffer needs to hold potentially thousands of telemetry batches — not just a few dozen.

3. Delivery Confirmation Is Asynchronous

MQTT QoS 1 guarantees at-least-once delivery, but the PUBACK confirmation comes back asynchronously — possibly hundreds of milliseconds after the PUBLISH. During that window, you can't release the buffer space (the message might need retransmission), and you can't stall the producer (PLC data keeps flowing).

4. Data Must Survive Process Restarts

If the edge gateway daemon restarts (due to a configuration update, a watchdog trigger, or a power cycle), buffered-but-undelivered data must be recoverable. Purely in-memory queues lose everything.

The Paged Ring Buffer Pattern

The pattern that works in production is a paged ring buffer — a fixed-size memory region divided into pages, with explicit state tracking for each page. Here's how it works:

Memory Layout

At startup, the gateway allocates a single contiguous memory block and divides it into equal-sized pages:

┌─────────┬─────────┬─────────┬─────────┬─────────┐
│ Page 0 │ Page 1 │ Page 2 │ Page 3 │ Page 4 │
│ FREE │ FREE │ FREE │ FREE │ FREE │
└─────────┴─────────┴─────────┴─────────┴─────────┘

Each page has its own header tracking:

  • A page number (for logging and debugging)
  • A start_p pointer (beginning of writable space)
  • A write_p pointer (current write position)
  • A read_p pointer (current read position for transmission)
  • A next pointer (linking to the next page in whatever list it's in)

Three Page Lists

Pages move between three linked lists:

  1. Free pages — available for the producer to write into
  2. Used pages — full of data, queued for transmission
  3. Work page — the single page currently being written to
Producer (Polling Thread)          Consumer (MQTT Thread)
│ │
▼ │
┌──────────┐ │
│Work Page │──────── When full ──────►┌──────────┐
│(writing) │ │Used Pages│──► MQTT Publish
└──────────┘ │(queued) │
▲ └──────────┘
│ │
│ When delivered │
│◄──────────────────────────────────────┘
┌──────────┐
│Free Pages│
│(empty) │
└──────────┘

The Producer Path

When the polling thread has a new batch of tag values to store:

  1. Check the work page — if there's no current work page, grab one from the free list
  2. Calculate space — check if the new data fits in the remaining space on the work page
  3. If it fits — write the data (with a size header) and advance write_p
  4. If it doesn't fit — move the work page to the used list, grab a new page (from free, or steal the oldest from used if free is empty), and write there
  5. After writing — check if there's data ready to transmit and kick the consumer

The critical detail: if the free list is empty, the producer steals the oldest used page. This means during extended outages, the buffer wraps around and overwrites the oldest data — exactly the behavior you want. Recent data is more valuable than stale data in industrial monitoring.

The Consumer Path

When the MQTT connection is active and there's data to send:

  1. Check the used page list — if empty, check if the work page has unsent data and promote it
  2. Read the next message from the first used page's read_p position
  3. Publish via MQTT with QoS 1
  4. Set a "packet sent" flag — this prevents sending the next message until the current one is acknowledged
  5. Wait for PUBACK — when the broker confirms receipt, advance read_p
  6. If read_p reaches write_p — the page is fully delivered; move it back to the free list
  7. Repeat — grab the next message from the next used page

The Mutex Strategy

The entire buffer is protected by a single mutex. This might seem like a bottleneck, but in practice:

  • Write operations (adding data) take microseconds
  • Read operations (preparing to transmit) take microseconds
  • The actual MQTT transmission happens outside the mutex — only the buffer state management is locked

The mutex is held for a few microseconds at a time, never during network I/O. This keeps the polling thread from ever blocking on network latency.

Polling Thread:               MQTT Thread:
lock(mutex) lock(mutex)
write data to page read data from page
check if page full mark as sent
maybe promote page unlock(mutex)
trigger send check ─── MQTT publish ───
unlock(mutex) (outside mutex!)
lock(mutex)
process PUBACK
maybe free page
unlock(mutex)

Message Framing Inside Pages

Each page holds multiple messages packed sequentially. Each message has a simple header:

┌──────────────┬──────────────┬─────────────────────┐
│ Message ID │ Message Size │ Message Body │
│ (4 bytes) │ (4 bytes) │ (variable) │
└──────────────┴──────────────┴─────────────────────┘

The Message ID field is initially zero. When the MQTT library publishes the message, it fills in the packet ID assigned by the broker. This is how the consumer tracks which specific message was acknowledged — when the PUBACK callback fires with a packet ID, it can match it to the message at read_p and advance.

This framing makes the buffer self-describing. During recovery after a restart, the gateway can scan page contents by reading size headers sequentially.

Handling Disconnections Gracefully

When the MQTT connection drops, the consumer thread must handle it without corrupting the buffer:

Connection Lost:
1. Set connected = 0
2. Clear "packet sent" flag
3. Do NOT touch any page pointers

That's it. The producer keeps writing — it doesn't know or care about the connection state. The buffer absorbs data normally.

When the connection recovers:

Connection Restored:
1. Set connected = 1
2. Trigger send check (under mutex)
3. Consumer picks up where it left off

The key insight: the "packet sent" flag prevents double-sending. If a PUBLISH was in flight when the connection dropped, the PUBACK never arrived. The flag remains set, but the disconnection handler clears it. When the connection recovers, the consumer re-reads the same message from read_p (which was never advanced) and re-publishes it. The broker either receives a duplicate (handled by QoS 1 dedup) or receives it for the first time.

Binary vs. JSON Batch Encoding

The telemetry data written into the buffer can be encoded in two formats, and the choice affects both bandwidth and reliability.

JSON Format

Each batch is a JSON object containing groups of timestamped values:

{
"groups": [
{
"ts": 1709424000,
"device_type": 1017,
"serial_number": 123456,
"values": [
{"id": 80, "values": [725]},
{"id": 81, "values": [680]},
{"id": 82, "values": [285]}
]
}
]
}

Pros: Human-readable, easy to debug, parseable by any language. Cons: 5-8× larger than binary, float precision loss (decimal representation), size estimation is rough.

Binary Format

A compact binary encoding with a header byte (0xF7), followed by big-endian packed groups:

F7                              ← Header
00 00 00 01 ← Number of groups (1)
65 E8 2C 00 ← Timestamp (Unix epoch)
03 F9 ← Device type (1017)
00 01 E2 40 ← Serial number
00 00 00 03 ← Number of values (3)
00 50 00 01 02 02 D5 ← Tag 80: status=0, 1 value, 2 bytes, 725
00 51 00 01 02 02 A8 ← Tag 81: status=0, 1 value, 2 bytes, 680
00 52 00 01 02 01 1D ← Tag 82: status=0, 1 value, 2 bytes, 285

Pros: 5-8× smaller, perfect float fidelity (raw bytes preserved), exact size calculation. Cons: Requires matching decoder on the cloud side, harder to debug without tools.

For gateways communicating over cellular connections — common in remote facilities like water treatment plants, oil wells, or distributed renewable energy sites — binary encoding is essentially mandatory. A gateway polling 100 tags every 10 seconds generates about 260 MB/month in JSON versus 35 MB/month in binary. At typical IoT cellular rates ($0.50-$2.00/MB), that's the difference between $130/month and $17/month per gateway.

The MQTT Watchdog Pattern

MQTT connections can enter a zombie state — technically connected according to the TCP stack, but the broker has stopped responding. This is especially common behind industrial firewalls and NAT devices with aggressive connection timeout policies.

The Problem

The MQTT library reports the connection as alive. The gateway publishes messages. No PUBACK comes back — ever. The buffer fills up because the consumer thinks each message is "in flight" (the packet_sent flag is set). Eventually the buffer wraps and data loss begins.

The Solution: Last-Delivered Timestamp

Track the timestamp of the last successful PUBACK. If more than N seconds have passed since the last acknowledged delivery, and there are messages waiting to be sent, the connection is stale:

monitor_watchdog():
if connected AND packet_sent:
elapsed = now - last_delivered_packet_timestamp
if elapsed > WATCHDOG_THRESHOLD:
// Force disconnect and reconnect
force_disconnect()
// Disconnection handler clears packet_sent
// Reconnection handler will re-deliver from read_p

A typical threshold is 60 seconds for LAN connections and 120 seconds for cellular. This catches zombie connections that the TCP stack and MQTT keep-alive miss.

Reconnection with Backoff

When the watchdog (or a genuine disconnection) triggers a reconnect, use a dedicated thread for the connection attempt. The connect_async call can block for the TCP timeout duration (potentially 30+ seconds), and you don't want that blocking the main loop or the polling thread.

A semaphore controls the reconnection thread:

Main Thread:                Reconnection Thread:
Detects need to (blocked on semaphore)
reconnect │
Posts semaphore ──────► Wakes up
Calls connect_async()
(may block 30s)
Success or failure
Posts "done" semaphore
Waits for "done" ◄──────
Checks result

The reconnect delay should be fixed and short (5 seconds is typical) for industrial applications, not exponential backoff. In a factory, the network outage either resolves quickly (a transient) or it's a hard failure that needs human intervention. Exponential backoff just delays reconnection after the network recovers.

Batching Strategy: Size vs. Time

Telemetry batches should be finalized and queued for transmission based on whichever threshold hits first: size or time.

Size-Based Finalization

When the accumulated batch data exceeds a configured maximum (typically 4-500 KB for JSON, 50-100 KB for binary), finalize and queue it. This prevents any single MQTT message from being too large for the broker or the network MTU.

Time-Based Finalization

When the batch has been collecting data for more than a configured timeout (typically 30-60 seconds), finalize it regardless of size. This ensures that even slowly-changing tags get transmitted within a bounded time window.

The Interaction Between Batching and Buffering

Batching and buffering are separate concerns that interact:

PLC Tags ──► Batch (collecting) ──► Buffer Page (queued) ──► MQTT (transmitted)

Tag reads accumulate When batch finalizes, Pages are transmitted
in the batch structure the encoded batch goes one at a time with
into the ring buffer PUBACK confirmation

A batch contains one or more "groups" — each group is a set of tag values read at the same timestamp. Multiple polling cycles might go into a single batch before it's finalized by size or time. The finalized batch then goes into the ring buffer as a single message.

Dependent Tag Reads and Atomic Groups

In many PLC configurations, certain tags are only meaningful when read together. For example:

  • Alarm word tags — a uint16 register where each bit represents a different alarm. You read the alarm word, then extract the individual bits. If the alarm word changes, you need to read and deliver the extracted bits atomically with the parent.

  • Machine state transitions — when a "blender running" tag changes from 0 to 1, you might need to immediately read all associated process values (RPM, temperatures, pressures) to capture the startup snapshot.

The architecture handles this through dependent tag chains:

Parent Tag (alarm_word, interval=1s, compare=true)
└── Calculated Tag (alarm_bit_0, shift=0, mask=0x01)
└── Calculated Tag (alarm_bit_1, shift=1, mask=0x01)
└── Dependent Tag (motor_speed, read_on_change=true)
└── Dependent Tag (temperature, read_on_change=true)

When the parent tag changes, the polling thread:

  1. Finalizes the current batch
  2. Recursively reads all dependent tags (forced read, ignoring intervals)
  3. Starts a new batch group with the same timestamp

This ensures that the dependent values are timestamped identically with the trigger event and delivered together.

Hourly Full-Read Reset

Change-of-value (COV) filtering dramatically reduces bandwidth, but it introduces a subtle failure mode: if a value changes during a transient read error, the gateway might never know it changed.

Here's the scenario:

  1. At 10:00:00, tag value = 72.5 → transmitted
  2. At 10:00:01, PLC returns an error for that tag → not transmitted
  3. At 10:00:02, tag value = 73.0 → compared against last successful read (72.5), change detected, transmitted
  4. But if the error at 10:00:01 was actually a valid read of 73.0 that was misinterpreted as an error, and the value stayed at 73.0, then at 10:00:02 the comparison against the last known value (72.5) correctly catches it.

The real problem is when:

  1. At 10:00:00, tag value = 72.5 → transmitted
  2. The PLC program changes the tag to 73.0 and then back to 72.5 between polling cycles
  3. The gateway never sees 73.0 — it polls at 10:00:00 and 10:00:01 and gets 72.5 both times

For most industrial applications, this sub-second transient is irrelevant. But to guard against drift — where small rounding differences accumulate between the gateway's cached value and the PLC's actual value — a full reset is performed every hour:

Every hour boundary (when the system clock's hour changes):
1. Clear the "read once" flag on every tag
2. Clear all last-known values
3. Force read and transmit every tag regardless of COV

This guarantees that the cloud platform has a complete snapshot of every tag value at least once per hour, even for tags that haven't changed.

Putting It All Together: The Polling Loop

Here's the complete polling loop architecture that ties all these patterns together:

main_polling_loop():
FOREVER:
current_time = monotonic_clock()

FOR each configured device:
// Hourly reset check
if hour(current_time) != hour(last_poll_time):
reset_all_tags(device)

// Start a new batch group
start_group(device.batch, unix_timestamp())

FOR each tag in device.tags:
// Check if this tag needs reading now
if not tag.read_once OR elapsed(tag.last_read) >= tag.interval:

value, status = read_tag(device, tag)

if status == LINK_ERROR:
set_link_state(device, DOWN)
break // Stop reading this device

set_link_state(device, UP)

// COV check
if tag.compare AND tag.read_once:
if value == tag.last_value AND status == tag.last_status:
continue // No change, skip

// Deliver value
if tag.do_not_batch:
deliver_immediately(device, tag, value)
else:
add_to_batch(device.batch, tag, value)

// Check dependent tags
if value_changed AND tag.has_dependents:
finalize_batch()
read_dependents(device, tag)
start_new_group()

// Update tracking
tag.last_value = value
tag.last_status = status
tag.read_once = true
tag.last_read = current_time

// Finalize batch group
stop_group(device.batch, output_buffer)
// ↑ This checks size/time thresholds and may
// queue the batch into the ring buffer

sleep(polling_interval)

Performance Characteristics

On a typical industrial edge gateway (ARM Cortex-A9, 512 MB RAM, Linux):

OperationTimeNotes
Mutex lock/unlock~1 µsPer buffer operation
Modbus TCP read (10 registers)5-15 msNetwork dependent
Modbus RTU read (10 registers)20-50 msBaud rate dependent (9600-115200)
EtherNet/IP tag read2-8 msCIP overhead
JSON batch encoding0.5-2 ms100 tags
Binary batch encoding0.1-0.5 ms100 tags
MQTT publish (QoS 1)1-5 msLAN broker
Buffer page write5-20 µsmemcpy only

The bottleneck is always the PLC protocol reads, not the buffer or transmission logic. A gateway polling 200 Modbus TCP tags can complete a full cycle in under 200 ms, leaving plenty of headroom for a 1-second polling interval.

For Modbus RTU (serial), the bottleneck shifts to the baud rate. At 9600 baud, a single register read takes ~15 ms including response. Polling 50 registers individually would take 750 ms — too close to a 1-second interval. This is why contiguous register grouping matters: reading 50 consecutive registers in a single request takes about 50 ms, a 15× improvement.

How machineCDN Implements These Patterns

machineCDN's edge gateway uses exactly these patterns — paged ring buffers with mutex-protected page management, QoS 1 MQTT with PUBACK-based buffer advancement, and both binary and JSON encoding depending on the deployment's bandwidth constraints.

The platform's gateway daemon runs on Linux-based edge hardware (including cellular routers like the Teltonika RUT series) and handles simultaneous Modbus RTU, Modbus TCP, and EtherNet/IP connections to mixed-vendor equipment. The buffer is sized during commissioning based on the expected outage duration — a 64 KB buffer holds roughly 4 hours of data at typical polling rates; a 512 KB buffer extends that to over 24 hours.

The result: plants running machineCDN don't lose telemetry during network outages. When connectivity recovers, the buffered data drains automatically and fills in the gaps in trending charts and analytics — no manual intervention, no missing data points.

Key Takeaways

  1. Use paged ring buffers, not unbounded queues — fixed memory, graceful overflow (oldest data dropped first)
  2. Protect buffer operations with a mutex, but never hold it during network I/O — microsecond lock durations keep producers and consumers non-blocking
  3. Track PUBACK per-message to prevent double-sending and enable reliable buffer advancement
  4. Implement a MQTT watchdog using last-delivery timestamps to catch zombie connections
  5. Batch by size OR time (whichever hits first) to balance bandwidth and latency
  6. Reset all tags hourly to guarantee complete snapshots and prevent drift
  7. Binary encoding saves 5-8× bandwidth with zero precision loss — essential for cellular-connected gateways
  8. Group contiguous Modbus registers into single requests — 15× faster than individual reads on RTU

Building a reliable IIoT edge gateway is fundamentally a systems programming challenge. The protocols, the buffering, the concurrency — each one is manageable alone, but getting them all right together, on constrained hardware, with zero tolerance for data loss, is what separates toy prototypes from production infrastructure.


See machineCDN's store-and-forward buffering in action with real factory data. Request a demo to explore the platform.