Skip to main content

37 posts tagged with "Edge Computing"

Edge computing for industrial data processing

View All Tags

Industrial Data Normalization: Byte Ordering, Register Formats, and Scaling Factors for IIoT [2026]

· 15 min read

Every IIoT engineer eventually hits the same wall: the PLC says the temperature is 16,742, the HMI shows 167.42°C, and your cloud dashboard displays -8.2×10⁻³⁹. Same data, three different interpretations. The problem isn't the network, the database, or the visualization layer — it's data normalization at the edge.

Getting raw register values from industrial devices into correctly typed, properly scaled, human-readable data points is arguably the most underappreciated challenge in IIoT. This guide covers the byte-level mechanics that trip up engineers daily: endianness, register encoding schemes, floating-point reconstruction, and the scaling math that transforms a raw uint16 into a meaningful process variable.

Data normalization and byte ordering in industrial systems

Why This Is Harder Than It Looks

Modern IT systems have standardized on little-endian byte ordering (x86, ARM in LE mode), IEEE 754 floating point, and UTF-8 strings. Industrial devices come from a different world:

  • Modbus uses big-endian (network byte order) for 16-bit registers, but the ordering of registers within a 32-bit value varies by manufacturer
  • EtherNet/IP uses little-endian internally (Allen-Bradley heritage), but CIP encapsulation follows specific rules per data type
  • PROFINET uses big-endian for I/O data
  • OPC-UA handles byte ordering transparently — one of its few genuinely nice features

When your edge gateway reads data from a Modbus device and publishes it via MQTT to a cloud platform, you're potentially crossing three byte-ordering boundaries. Get any one of them wrong and your data is silently corrupt.

The Modbus Register Map Problem

Modbus organizes data into four register types, each accessed by a different function code:

Address RangeRegister TypeFunction CodeData DirectionAccess
0–65,535Coils (discrete outputs)FC 01Read1-bit
100,000–165,535Discrete InputsFC 02Read1-bit
300,000–365,535Input RegistersFC 04Read-only16-bit
400,000–465,535Holding RegistersFC 03Read/Write16-bit

The address ranges are a convention, not a protocol requirement. Your gateway needs to map addresses to function codes:

  • Addresses 0–65,535 → FC 01 (Read Coils)
  • Addresses 100,000–165,535 → FC 02 (Read Discrete Inputs)
  • Addresses 300,000–365,535 → FC 04 (Read Input Registers)
  • Addresses 400,000–465,535 → FC 03 (Read Holding Registers)

The actual register address sent in the Modbus PDU is the offset within the range. So address 400,100 becomes register 100 using function code 03.

Why this matters for normalization: A tag configured with address 300,800 means "read input register 800 using FC 04." A tag at address 400,520 means "read holding register 520 using FC 03." If your gateway mixes these up, it reads the wrong register type entirely — and the PLC happily returns whatever lives at that address, with no type error.

Reading Coils vs Registers: Type Coercion

When reading coils (FC 01/02), the response contains bit-packed data — each coil is a single bit. When reading registers (FC 03/04), each register is a 16-bit word.

The tricky part is mapping these raw responses to typed tag values. Consider a tag configured as uint16 that's being read from a coil address. The raw response is a single bit (0 or 1), but the tag expects a 16-bit value. Your gateway must handle this coercion:

Coil response → bool tag:     bit value directly
Coil response → uint8 tag: cast to uint8
Coil response → uint16 tag: cast to uint16
Coil response → int32 tag: cast to int32 (effectively 0 or 1)

For register responses, the mapping depends on the element count — how many consecutive registers are combined to form the value:

1 register (elem_count=1):
→ uint16: direct value
→ int16: interpret as signed
→ uint8: mask with 0xFF (lower byte)
→ bool: mask with 0xFF, then boolean

2 registers (elem_count=2):
→ uint32: combine two 16-bit registers
→ int32: interpret combined value as signed
→ float: interpret combined value as IEEE 754

The 32-Bit Register Combination Problem

Here's where manufacturers diverge and data corruption begins. A 32-bit value (integer or float) spans two consecutive 16-bit Modbus registers. But which register contains the high word?

Word Order Variants

Big-endian word order (AB CD): Register N contains the high word, register N+1 contains the low word.

Register[N]   = 0x4248    (high word)
Register[N+1] = 0x0000 (low word)
Combined = 0x42480000
As float = 50.0

Little-endian word order (CD AB): Register N contains the low word, register N+1 contains the high word.

Register[N]   = 0x0000    (low word)
Register[N+1] = 0x4248 (high word)
Combined = 0x42480000
As float = 50.0

Byte-swapped big-endian (BA DC): Each register's bytes are swapped, then combined in big-endian order.

Register[N]   = 0x4842    (swapped high)
Register[N+1] = 0x0000 (swapped low)
Combined = 0x42480000
As float = 50.0

Byte-swapped little-endian (DC BA): Each register's bytes are swapped, then combined in little-endian order.

Register[N]   = 0x0000    (swapped low)
Register[N+1] = 0x4842 (swapped high)
Combined = 0x42480000
As float = 50.0

All four combinations are found in the wild. Schneider PLCs typically use big-endian word order. Some Siemens devices use byte-swapped variants. Many Chinese-manufactured VFDs (variable frequency drives) use little-endian word order. There is no way to detect the word order automatically — you must know it from the device documentation or determine it empirically.

Practical Detection Technique

When commissioning a new device and the word order isn't documented:

  1. Find a register that should contain a known float value (like a temperature reading you can verify with a handheld thermometer)
  2. Read two consecutive registers and try all four combinations
  3. The one that produces a physically reasonable value is your word order

For example, if the device reads temperature and the registers contain 0x4220 and 0x0000:

  • AB CD: 0x42200000 = 40.0 ← probably correct if room temp
  • CD AB: 0x00004220 = 5.9×10⁻⁴¹ ← nonsense
  • BA DC: 0x20420000 = 1.6×10⁻¹⁹ ← nonsense
  • DC BA: 0x00002042 = 1.1×10⁻⁴¹ ← nonsense

IEEE 754 Floating-Point Reconstruction

Reading a float from Modbus registers requires careful reconstruction. The standard approach:

Given: Register[N] = high_word, Register[N+1] = low_word (big-endian word order)

Step 1: Combine into 32 bits
uint32 combined = (high_word << 16) | low_word

Step 2: Reinterpret as IEEE 754 float
float value = *(float*)&combined // C-style type punning
// Or use modbus_get_float() from libmodbus

The critical detail: do not cast the integer to float — that performs a numeric conversion. You need to reinterpret the same bit pattern as a float. This is the difference between getting 50.0 (correct) and getting 1110441984.0 (the integer 0x42480000 converted to float).

Common Float Pitfalls

NaN and Infinity: IEEE 754 reserves certain bit patterns for special values. If your combined registers produce 0x7FC00000, that's NaN. If you see 0x7F800000, that's positive infinity. These often appear when:

  • The sensor is disconnected (NaN)
  • The measurement is out of range (Infinity)
  • The registers are being read during a PLC scan update (race condition producing a half-updated value)

Denormalized numbers: Very small float values (< 1.175×10⁻³⁸) are "denormalized" and may lose precision. In industrial contexts, if you're seeing numbers this small, something is wrong with your byte ordering.

Zero detection: A float value of exactly 0.0 is 0x00000000. But 0x80000000 is negative zero (-0.0). Both compare equal in standard float comparison, but the bit patterns are different. If you're doing bitwise comparison for change detection, be aware of this edge case.

Scaling Factors: From Raw to Engineering Units

Many industrial devices don't transmit floating-point values. Instead, they send raw integers that must be scaled to engineering units. This is especially common with:

  • Temperature transmitters (raw: 0–4000 → scaled: 0–100°C)
  • Pressure sensors (raw: 0–65535 → scaled: 0–250 PSI)
  • Flow meters (raw: counts/second → scaled: gallons/minute)

Linear Scaling

The most common pattern is linear scaling with two coefficients:

engineering_value = (raw_value × k1) / k2

Where k1 and k2 are integer scaling coefficients defined in the tag configuration. This avoids floating-point math on resource-constrained edge devices.

Examples:

  • Temperature: k1=1, k2=10 → raw 1675 becomes 167.5°C
  • Pressure: k1=250, k2=65535 → raw 32768 becomes 125.0 PSI
  • RPM: k1=1, k2=1 → raw value is direct (no scaling)

Important: k2 must never be zero. Always validate configuration before applying scaling — a division-by-zero in an edge gateway's main loop crashes the entire data acquisition pipeline.

Bit Extraction (Calculated Tags)

Some devices pack multiple boolean values into a single register. A 16-bit "status word" might contain:

Bit 0: Motor Running
Bit 1: Fault Active
Bit 2: High Temperature
Bit 3: Low Pressure
Bits 4-7: Operating Mode
Bits 8-15: Reserved

Extracting individual values requires bitwise operations:

motor_running = (status_word >> 0) & 0x01    // shift=0, mask=1
fault_active = (status_word >> 1) & 0x01 // shift=1, mask=1
op_mode = (status_word >> 4) & 0x0F // shift=4, mask=15

In a well-designed edge gateway, these "calculated tags" are defined as children of the parent register tag. When the parent register value changes, the gateway automatically recalculates all child tags and delivers their values. This eliminates redundant register reads — you read the status word once and derive multiple data points.

Dependent Tag Chains

Beyond simple bit extraction, production systems use dependent tag chains: when tag A changes, immediately read tags B, C, and D regardless of their normal polling interval.

Example: When machine_state transitions from 0 (IDLE) to 1 (RUNNING), immediately read:

  • Current speed setpoint
  • Actual motor RPM
  • Material temperature
  • Batch counter

This captures the complete state snapshot at the moment of transition, which is far more valuable than catching each value at their independent polling intervals (where you might see the new speed 5 seconds after the state change).

The key architectural insight: tag dependencies form a directed acyclic graph. The edge gateway must traverse this graph depth-first on each parent change, reading and delivering dependent tags within the same batch timestamp for temporal coherence.

Binary Serialization for Bandwidth Efficiency

Once values are normalized, they need to be serialized for transport. Two common formats:

JSON (Human-Readable)

{
"groups": [{
"ts": 1709510400,
"device_type": 1011,
"serial_number": 12345,
"values": [
{"id": 1, "values": [167.5]},
{"id": 2, "values": [true]},
{"id": 3, "values": [1250, 1248, 1251, 1249, 1250, 1252]}
]
}]
}

Binary (Bandwidth-Optimized)

A compact binary format packs the same data into roughly 20–30% of the JSON size:

Byte 0:     0xF7 (frame identifier)
Bytes 1-4: Number of groups (uint32, big-endian)

Per group:
4 bytes: Timestamp (uint32)
2 bytes: Device type (uint16)
4 bytes: Serial number (uint32)
4 bytes: Number of values (uint32)

Per value:
2 bytes: Tag ID (uint16)
1 byte: Status (0x00 = OK, else error code)
If status == OK:
1 byte: Array size (number of elements)
1 byte: Element size (1, 2, or 4 bytes)
N bytes: Packed values, each big-endian

Value packing examples:

bool:   true  → 0x01           (1 byte)
bool: false → 0x00 (1 byte)
int16: 55 → 0x00 0x37 (2 bytes, big-endian)
int16: -55 → 0xFF 0xC9 (2 bytes, two's complement)
uint16: 32768 → 0x80 0x00
int32: 55 → 0x00 0x00 0x00 0x37
float: 1.55 → 0x3F 0xC6 0x66 0x66 (IEEE 754)
float: -1.55 → 0xBF 0xC6 0x66 0x66

Note the byte ordering in the serialization format: values are packed big-endian (MSB first) regardless of the source device's native byte ordering. The edge gateway normalizes byte order during serialization, so the cloud consumer never needs to worry about endianness.

Register Grouping and Read Optimization

Modbus allows reading up to 125 consecutive registers in a single request (FC 03/04). A naive implementation sends one request per tag — reading 50 tags requires 50 round trips, each with its own Modbus frame overhead and inter-frame delay.

A well-optimized gateway groups tags by:

  1. Same function code — Tags addressed at 400,100 and 300,100 cannot be grouped (different FC)
  2. Contiguous addresses — Tags at addresses 400,100 and 400,101 can be read in one request
  3. Same polling interval — Tags with different intervals should be in separate groups to avoid reading slow-interval tags too frequently
  4. Maximum register count — Cap at ~50 registers per request to stay well within Modbus limits and avoid timeout issues with slower devices

The algorithm: sort tags by address, then walk the sorted list. Start a new group when:

  • The function code changes
  • The address is not contiguous with the previous tag
  • The polling interval differs
  • The accumulated register count exceeds the maximum

After each group read, insert a brief pause (50ms is typical) before the next read. This prevents overwhelming slow Modbus devices that need time between transactions to process their internal scan.

Change Detection and Comparison

For bandwidth-constrained deployments (cellular, satellite, LoRaWAN backhaul), sending every value on every read cycle is wasteful. Implement value comparison:

On each tag read:
if (tag.compare_enabled):
if (new_value == last_value) AND (status unchanged):
skip delivery
else:
deliver value
update last_value
else:
always deliver

The comparison must be type-aware:

  • Integer types: Direct bitwise comparison (uint_value != last_uint_value)
  • Float types: Bitwise comparison, NOT approximate comparison. In industrial contexts, if the bits didn't change, the value didn't change. Using epsilon-based comparison would miss relevant changes while potentially false-triggering on noise.
  • Boolean types: Direct comparison

Periodic forced delivery: Even with comparison enabled, force-deliver all tag values once per hour. This ensures the cloud state eventually converges with reality, even if a value change was missed during a brief network outage.

Handling Modbus RTU vs TCP

The normalization logic is identical for Modbus RTU (serial) and Modbus TCP (Ethernet). The differences are all in the transport layer:

ParameterModbus RTUModbus TCP
PhysicalRS-485 serialEthernet
ConnectionSerial port openTCP socket connect
AddressingSlave address (1-247)IP:port (default 502)
FramingCRC-16MBAP header
TimingInter-character timeout mattersTCP handles retransmission
Baud rate9600–115200 typicalN/A (Ethernet speed)
Response timeout400ms typicalShorter (network dependent)

RTU-Specific Configuration

For Modbus RTU, the serial link parameters must match the device exactly:

Baud rate:       9600 (most common) or 19200, 38400, 115200
Parity: None, Even, or Odd
Data bits: 8 (almost always)
Stop bits: 1 or 2
Slave address: 1-247
Byte timeout: 50ms (time between bytes in a frame)
Response timeout: 400ms (time to wait for a response)

Critical RTU detail: Always flush the serial buffer before starting a new transaction. Stale bytes in the receive buffer from a previous timed-out response will corrupt the current response parsing. This is the number one cause of intermittent "bad CRC" errors on Modbus RTU links.

Error Handling That Matters

When a Modbus read fails, the error code tells you what went wrong:

errnoMeaningRecovery Action
ETIMEDOUTDevice didn't respondRetry 2x, then mark link DOWN
ECONNRESETConnection droppedClose + reconnect
ECONNREFUSEDDevice rejected connectionCheck IP/port, wait before retry
EPIPEBroken pipeClose + reconnect
EBADFBad file descriptorSocket is dead, full reinit

On any of these errors, the correct response is: flush the connection, close it, mark the device link state as DOWN, and attempt reconnection on the next cycle. Don't try to send more data on a dead connection — it will fail faster than you can log it.

Deliver error status alongside the tag. When a tag read fails, don't silently drop the data point. Deliver the tag ID with a non-zero status code and no value data. This lets the cloud platform distinguish between "the sensor reads 0" and "we couldn't reach the sensor." They're very different situations.

How machineCDN Handles Data Normalization

machineCDN's edge runtime performs all normalization at the device boundary — byte order conversion, type coercion, bit extraction, scaling, and comparison — before data touches the network. The binary serialization format described above is the actual wire format used between edge gateways and the machineCDN cloud, achieving typical compression ratios of 3–5x versus JSON while maintaining full type fidelity.

For plant engineers, this means you configure tags with their register addresses, data types, and scaling factors. The platform handles the byte-level mechanics — you never need to manually swap words, reconstruct floats, or debug endianness issues. Tag values arrive in the cloud as properly typed, correctly scaled engineering units, ready for dashboards, analytics, and alerting.

Checklist: Commissioning a New Device

When connecting a new Modbus device to your IIoT platform:

  1. Identify the register map — Get the manufacturer's documentation. Don't guess addresses.
  2. Determine the word order — Read a known float value and try all four combinations.
  3. Verify function codes — Confirm which registers use FC 03 vs FC 04.
  4. Check the slave address — RTU only; confirm via device configuration panel.
  5. Set appropriate timeouts — 50ms byte timeout, 400ms response timeout for RTU; 2000ms for TCP.
  6. Read one tag at a time first — Validate each tag independently before grouping.
  7. Compare with HMI values — Cross-reference your gateway's readings against the device's local display.
  8. Enable comparison selectively — For status bits and slow-changing values only. Disable for process variables during commissioning.
  9. Monitor for -32 / timeout errors — Persistent errors indicate wiring, addressing, or timing issues.
  10. Document everything — Future you will not remember why tag 0x1A uses elem_count=2 with k1=10 and k2=100.

Conclusion

Data normalization is the unglamorous foundation of every working IIoT system. When it works, nobody notices. When it fails, your dashboards show nonsense and operators lose trust in the platform.

The key principles:

  • Know your byte order — and document it per device
  • Match element size to data type — a 4-byte read on a 2-byte register reads adjacent memory
  • Use bitwise comparison for floats — not epsilon
  • Batch and serialize efficiently — binary beats JSON for bandwidth-constrained links
  • Group contiguous registers — reduce Modbus round trips by 5–10x
  • Always deliver error status — silent data drops are worse than explicit failures

Get these right at the edge, and every layer above — time-series databases, dashboards, ML models, alerting — inherits clean, trustworthy data. Get them wrong, and no amount of cloud processing can fix values that were corrupted before they left the factory floor.

Binary vs JSON Payloads for Industrial MQTT Telemetry: Bandwidth, Encoding Strategies, and When Each Wins [2026]

· 14 min read

Every IIoT platform faces the same fundamental design decision for machine telemetry: do you encode data as human-readable JSON, or pack it into a compact binary format?

The answer affects bandwidth consumption, edge buffer capacity, parsing performance, debugging experience, and how well your system degrades under constrained connectivity. Despite what vendor marketing suggests, neither format universally wins. The engineering tradeoffs are real, and the right choice depends on your deployment constraints.

This article breaks down both approaches with the depth that plant engineers and IIoT architects need to make an informed decision.

Best MQTT Broker for Industrial IoT in 2026: Choosing the Right Message Broker for Manufacturing

· 9 min read
MachineCDN Team
Industrial IoT Experts

MQTT has become the dominant messaging protocol for industrial IoT, and for good reason: it's lightweight, handles unreliable networks gracefully, and scales from a single sensor to millions of devices. But choosing the right MQTT broker for manufacturing is a different problem than choosing one for consumer IoT. Factory floor data has different latency requirements, reliability expectations, and security constraints than smart home sensors or fleet telemetry.

Device Provisioning and Authentication for Industrial IoT Gateways: SAS Tokens, Certificates, and Auto-Reconnection [2026]

· 13 min read

Every industrial edge gateway faces the same fundamental challenge: prove its identity to a cloud platform, establish a secure connection, and keep that connection alive for months or years — all while running on hardware with limited memory, intermittent connectivity, and no IT staff on-site to rotate credentials.

Getting authentication wrong doesn't just mean lost telemetry. It means a factory floor device that silently stops reporting, burning through its local buffer until data is permanently lost. Or worse — an improperly secured device that becomes an entry point into an OT network.

This guide covers the practical reality of device provisioning, from the first boot through ongoing credential management, with patterns drawn from production deployments across thousands of industrial gateways.

IEEE 754 Floating-Point Edge Cases in Industrial Data Pipelines: A Practical Guide [2026]

· 12 min read

If you've ever seen a temperature reading of 3.4028235 × 10³⁸ flash across your monitoring dashboard at 2 AM, you've met IEEE 754's ugly side. Floating-point representation is the lingua franca of analog process data in industrial automation — and it's riddled with traps that can silently corrupt your data pipeline if you don't handle them at the edge.

This guide covers the real-world edge cases that matter when reading float registers from PLCs over Modbus, EtherNet/IP, and other industrial protocols — and how to catch them before they poison your analytics, trigger false alarms, or crash your trending charts.

IEEE 754 floating point data flowing through an industrial data pipeline

Why Floating-Point Matters More in Industrial IoT

In enterprise software, a floating-point rounding error means your bank balance is off by a fraction of a cent. In industrial IoT, a misinterpreted float register can mean:

  • A temperature sensor reading infinity instead of 450°F, triggering an emergency shutdown
  • An OEE calculation returning NaN, breaking every downstream dashboard
  • A pressure reading of -0.0 confusing threshold comparison logic
  • Two 16-bit registers assembled in the wrong byte order, turning 72.5 PSI into 1.6 × 10⁻³⁸

These aren't theoretical problems. They happen on real factory floors, every day, because the gap between PLC register formats and cloud-native data types is wider than most engineers realize.

The Anatomy of a PLC Float

Most modern PLCs store floating-point values as IEEE 754 single-precision (32-bit) numbers. The 32 bits break down as:

┌─────┬──────────┬───────────────────────┐
│Sign │ Exponent │ Mantissa │
│1 bit│ 8 bits │ 23 bits │
└─────┴──────────┴───────────────────────┘
Bit 31 Bits 30-23 Bits 22-0

This gives you a range of roughly ±1.18 × 10⁻³⁸ to ±3.40 × 10³⁸, with about 7 decimal digits of precision. That's plenty for most process variables — but the encoding introduces special values and edge cases that PLC programmers rarely think about.

The Five Dangerous Values

PatternValueWhat Causes It
0x7F800000+InfinityDivision by zero, sensor overflow
0xFF800000-InfinityNegative division by zero
0x7FC00000Quiet NaNUninitialized register, invalid operation
0x7FA00000Signaling NaNHardware fault flags in some PLCs
0x00000000 / 0x80000000+0.0 / -0.0Legitimate zero, but -0.0 can trip comparisons

Why PLCs Generate These Values

PLC ladder logic and structured text don't always guard against special float values. Common scenarios include:

Uninitialized registers: When a PLC program is downloaded but a tag hasn't been written to yet, many PLCs leave the register at 0x00000000 (zero) — but some leave it at 0xFFFFFFFF (NaN). There's no universal standard here.

Sensor faults: When an analog input card detects a broken wire or over-range condition, some PLCs write a sentinel value (often max positive float or NaN) to the associated tag. Others set a separate status bit and leave the value register frozen at the last good reading.

Division by zero: If your PLC program calculates a rate (e.g., throughput per hour) and the divisor drops to zero during a machine stop, you get infinity. Not every PLC programmer wraps division in a zero-check.

Scaling arithmetic: Converting raw 12-bit ADC counts (0–4095) to engineering units involves multiplication and offset. If the scaling coefficients are misconfigured, you can get results outside the normal range that are still technically valid IEEE 754 floats.

The Byte-Ordering Minefield

Here's where industrial protocols diverge from IT conventions in ways that cause the most data corruption.

Modbus Register Ordering

Modbus transmits data in 16-bit registers. A 32-bit float occupies two consecutive registers. The question is: which register holds the high word?

The Modbus specification says big-endian (high word first), but many PLC vendors violate this:

Standard Modbus (Big-Endian / "ABCD"):
Register N = High word (bytes A, B)
Register N+1 = Low word (bytes C, D)

Swapped (Little-Endian / "CDAB"):
Register N = Low word (bytes C, D)
Register N+1 = High word (bytes A, B)

Byte-Swapped ("BADC"):
Register N = Byte-swapped high word (B, A)
Register N+1 = Byte-swapped low word (D, C)

Full Reverse ("DCBA"):
Register N = (D, C)
Register N+1 = (B, A)

Real-world example: A process temperature of 72.5°F is 0x42910000 in IEEE 754. Here's what you'd read over Modbus depending on the byte order:

OrderRegister NRegister N+1Decoded Value
ABCD0x42910x000072.5 ✅
CDAB0x00000x42911.598 × 10⁻⁴¹ ❌
BADC0x91420x0000-6.01 × 10⁻²⁸ ❌
DCBA0x00000x9142Garbage ❌

The only reliable way to determine byte ordering is to read a known value from the PLC — like a setpoint you can verify — and compare the decoded result against all four orderings.

EtherNet/IP Tag Ordering

EtherNet/IP (CIP) is generally more predictable because it transmits structured data with typed access. When you read a REAL tag from an Allen-Bradley Micro800 or CompactLogix, the CIP layer handles byte ordering transparently. The value arrives in the host's native format through the client library.

However, watch out for array access. When reading a float array starting at a specific index, the start index and element count must match the PLC's memory layout exactly. Requesting tag_name[1] with elem_count=6 reads elements 1 through 6 — the zero-indexed first element is skipped. Getting this wrong doesn't produce an error; it silently gives you shifted values.

Practical Validation Strategies

Layer 1: Raw Register Validation

Before you even try to decode a float, validate the raw bytes:

import struct
import math

def validate_float_register(high_word: int, low_word: int,
byte_order: str = "ABCD") -> tuple[float, str]:
"""
Decode and validate a 32-bit float from two Modbus registers.
Returns (value, status) where status is 'ok', 'nan', 'inf', or 'denorm'.
"""
# Assemble bytes based on ordering
if byte_order == "ABCD":
raw = struct.pack('>HH', high_word, low_word)
elif byte_order == "CDAB":
raw = struct.pack('>HH', low_word, high_word)
elif byte_order == "BADC":
raw = struct.pack('>HH',
((high_word & 0xFF) << 8) | (high_word >> 8),
((low_word & 0xFF) << 8) | (low_word >> 8))
elif byte_order == "DCBA":
raw = struct.pack('<HH', high_word, low_word)
else:
raise ValueError(f"Unknown byte order: {byte_order}")

value = struct.unpack('>f', raw)[0]

# Check special values
if math.isnan(value):
return value, "nan"
if math.isinf(value):
return value, "inf"

# Check denormalized (subnormal) — often indicates garbage data
raw_int = struct.unpack('>I', raw)[0]
exponent = (raw_int >> 23) & 0xFF
if exponent == 0 and (raw_int & 0x7FFFFF) != 0:
return value, "denorm"

return value, "ok"

Layer 2: Engineering-Range Clamping

Every process variable has a physically meaningful range. A mold temperature can't be -40,000°F. A flow rate can't be 10 billion GPM. Enforce these ranges at the edge:

RANGE_LIMITS = {
"mold_temperature_f": (-50.0, 900.0),
"barrel_pressure_psi": (0.0, 40000.0),
"screw_rpm": (0.0, 500.0),
"coolant_flow_gpm": (0.0, 200.0),
}

def clamp_to_range(tag_name: str, value: float) -> tuple[float, bool]:
"""Clamp a value to its engineering range. Returns (clamped_value, was_clamped)."""
if tag_name not in RANGE_LIMITS:
return value, False
low, high = RANGE_LIMITS[tag_name]
if value < low:
return low, True
if value > high:
return high, True
return value, False

Layer 3: Rate-of-Change Filtering

A legitimate temperature can't jump from 200°F to 800°F in one polling cycle (typically 1–60 seconds). Rate-of-change filtering catches sensor glitches and transient read errors:

MAX_RATE_OF_CHANGE = {
"mold_temperature_f": 50.0, # Max °F per polling cycle
"barrel_pressure_psi": 2000.0, # Max PSI per cycle
"screw_rpm": 100.0, # Max RPM per cycle
}

def rate_check(tag_name: str, new_value: float,
last_value: float) -> bool:
"""Returns True if the change rate is within acceptable limits."""
if tag_name not in MAX_RATE_OF_CHANGE:
return True
max_delta = MAX_RATE_OF_CHANGE[tag_name]
return abs(new_value - last_value) <= max_delta

The 32-Bit Float Reassembly Problem

When your edge gateway reads two 16-bit Modbus registers and needs to assemble them into a 32-bit float, the implementation must handle several non-obvious cases.

Two-Register Float Assembly

The most common approach reads two registers and combines them. But there's a critical subtlety: the function code determines how you interpret the raw words.

For holding registers (function code 3) and input registers (function code 4), each register is a 16-bit unsigned integer. To assemble a float:

Step 1: Read register N → uint16 word_high
Step 2: Read register N+1 → uint16 word_low
Step 3: Combine → uint32 raw = (word_high << 16) | word_low
Step 4: Reinterpret raw as IEEE 754 float

But here's the trap: some Modbus libraries automatically apply byte swapping at the protocol layer (converting from Modbus big-endian to host little-endian), which means your "high word" might already be byte-swapped before you assemble it.

A robust implementation uses the library's native float-extraction function (like modbus_get_float() in libmodbus) rather than manual assembly when possible. When you must assemble manually, test against a known value first.

Handling Mixed-Endian Devices

In real factories, you'll often have devices from multiple vendors on the same Modbus network — each with their own byte-ordering conventions. Your edge gateway must support per-device (or even per-register) byte-order configuration:

devices:
- name: "Injection_Molding_Press_1"
protocol: modbus-tcp
address: "192.168.1.10"
byte_order: ABCD
tags:
- name: barrel_temp_zone1
register: 40001
type: float32
# Inherits device byte_order

- name: "Chiller_Unit_3"
protocol: modbus-tcp
address: "192.168.1.20"
byte_order: CDAB # This vendor swaps words
tags:
- name: coolant_supply_temp
register: 30000
type: float32

Change Detection with Floating-Point Values

One of the most powerful bandwidth optimizations in IIoT edge gateways is change-of-value (COV) detection — only transmitting a value when it actually changes. But floating-point comparison is inherently tricky.

The Naive Approach (Broken)

// DON'T DO THIS
if (new_value != old_value) {
send(new_value);
}

This fails because:

  • Sensor noise causes sub-LSB fluctuations that produce different float representations
  • NaN ≠ NaN by IEEE 754 rules, so you'd send NaN every single cycle
  • -0.0 == +0.0 by IEEE 754, so you'd miss sign changes that might matter

The Practical Approach

Compare at the raw register level (integer comparison), not the float level. If the uint32 representation of two registers hasn't changed, the float is identical bit-for-bit — no ambiguity:

uint32_t new_raw = (word_high << 16) | word_low;
uint32_t old_raw = stored_raw_value;

if (new_raw != old_raw) {
// Value actually changed — decode and transmit
stored_raw_value = new_raw;
transmit(decode_float(new_raw));
}

This approach is used in production edge gateways and avoids all the floating-point comparison pitfalls. It's also faster — integer comparison is a single CPU instruction, while float comparison requires FPU operations and NaN handling.

Batching and Precision Preservation

When batching multiple tag values for transmission, format choice matters for float precision.

JSON Serialization Pitfalls

JSON doesn't distinguish between integers and floats, and most JSON serializers will round-trip a float through a decimal representation, potentially losing precision:

Original float: 72.5 (exact in IEEE 754: 0x42910000)
JSON: "72.5" → Deserialized: 72.5 ✅

Original float: 72.3 (NOT exact: 0x4290999A)
JSON: "72.30000305175781" → Deserialized: 72.30000305175781
Or: "72.3" → Deserialized: 72.30000305175781 (different!)

For telemetry where exact bit-level reproduction matters (e.g., comparing dashboard values against PLC HMI values), use binary encoding. A well-designed binary telemetry format encodes the tag ID, status, value type, and raw bytes — preserving perfect fidelity with less bandwidth.

A typical binary batch frame looks like:

┌──────────┬────────────┬──────────┬──────────┬────────────────┐
│ Batch │ Group │ Device │ Serial │ Values │
│ Header │ Timestamp │ Type │ Number │ Array │
│ (1 byte) │ (4 bytes) │ (2 bytes)│ (4 bytes)│ (variable) │
└──────────┴────────────┴──────────┴──────────┴────────────────┘

Each value entry:
┌──────────┬────────┬──────────┬──────────┬────────────────┐
│ Tag ID │ Status │ Count │ Elem │ Raw Values │
│ (2 bytes)│(1 byte)│ (1 byte) │ Size │ (count × size) │
│ │ │ │ (1 byte) │ │
└──────────┴────────┴──────────┴──────────┴────────────────┘

This format reduces a typical 100-tag batch from ~5 KB (JSON) to ~600 bytes (binary) — an 8× bandwidth reduction with zero precision loss.

Edge Gateway Best Practices

Based on years of deploying edge gateways in plastics, metals, and packaging manufacturing, here are the practices that prevent float-related data quality issues:

1. Validate at the Source

Don't wait until data reaches the cloud to check for NaN and infinity. By then, you've wasted bandwidth transmitting garbage and may have corrupted aggregations. Validate immediately after the register read.

2. Separate Value and Status

Every tag read should produce two outputs: the decoded value AND a status code. Status codes distinguish between "value is zero because the sensor reads zero" and "value is zero because the read failed." Most Modbus libraries return error codes — propagate them alongside the values.

3. Configure Byte Order Per Device

Don't hardcode byte ordering. Every industrial device you connect might have different conventions. Your tag configuration should support per-device or per-tag byte-order specification.

If your edge gateway communicates over cellular (4G/5G) or satellite, binary encoding pays for itself immediately. The bandwidth savings compound with polling frequency — a gateway polling 200 tags every second generates 17 GB/month in JSON but only 2 GB/month in binary.

5. Hourly Full Reads

Even with change-of-value filtering, perform a full read of all tags at least once per hour. This catches situations where a value changed but the change was lost due to a transient error, and ensures your cloud platform always has a recent snapshot of every tag — even slowly-changing ones.

How machineCDN Handles Float Data

machineCDN's edge infrastructure handles these float challenges at the protocol driver level. The platform supports automatic byte-order detection during device onboarding, validates every register read against configurable engineering ranges, and uses binary telemetry encoding to minimize bandwidth while preserving perfect float fidelity.

For plants running mixed-vendor equipment — which is nearly every plant — machineCDN normalizes all float data into a consistent format before it reaches your dashboards, ensuring that a temperature from a Modbus chiller and a temperature from an EtherNet/IP blender are directly comparable.

Key Takeaways

  1. IEEE 754 special values (NaN, infinity, denormals) appear regularly in PLC data — don't assume every register read produces a valid number
  2. Byte ordering varies by vendor, not by protocol — always verify against a known value
  3. Compare at the raw register level for change detection — never use float equality
  4. Binary encoding preserves precision and saves 8× bandwidth over JSON for telemetry
  5. Validate at the edge, not in the cloud — garbage data should never leave the factory

Getting floating-point handling right at the edge gateway is one of those unglamorous engineering fundamentals that separates reliable IIoT platforms from brittle ones. Your trending charts, alarm logic, and analytics all depend on it.


Want to see how machineCDN handles multi-protocol float data normalization in production? Request a demo to explore the platform with real factory data.

MQTT QoS Levels for Industrial Telemetry: Choosing the Right Delivery Guarantee [2026]

· 11 min read

When an edge gateway publishes a temperature reading from a plastics extruder running at 230°C, does it matter if that message arrives exactly once, at least once, or possibly not at all? The answer depends on what you're doing with the data — and getting it wrong can mean either lost production insights or a network drowning in redundant traffic.

MQTT's Quality of Service (QoS) levels are one of the most misunderstood aspects of industrial IoT deployments. Most engineers default to QoS 1 for everything, which is rarely optimal. This guide breaks down each level with real industrial scenarios, bandwidth math, and patterns that actually work on factory floors where cellular links drop and PLCs generate thousands of data points per second.

Thread-Safe Telemetry Pipelines: Building Concurrent IIoT Edge Gateways That Don't Lose Data [2026]

· 17 min read

An edge gateway on a factory floor isn't a REST API handling one request at a time. It's a real-time system juggling multiple competing demands simultaneously: polling a PLC for tag values every second, buffering data locally when the cloud connection drops, transmitting batched telemetry over MQTT, processing incoming configuration commands from the cloud, and monitoring its own health — all at once, on hardware with the computing power of a ten-year-old smartphone.

Get the concurrency wrong, and you don't get a 500 error in your logs. You get silent data loss, corrupted telemetry batches, or — worst case — a watchdog reboot loop that takes your monitoring offline during a critical production run.

This guide covers the architecture patterns that make industrial edge gateways reliable under real-world conditions: concurrent PLC polling, thread-safe buffering, MQTT delivery guarantees, and the store-and-forward patterns that keep data flowing when the network doesn't.

Thread-safe edge gateway architecture with concurrent data pipelines

The Concurrency Challenge in Industrial Edge Gateways

A typical edge gateway has at least three threads running concurrently:

  1. The polling thread — reads tags from PLCs at configured intervals (1-second to 60-second cycles)
  2. The MQTT network thread — manages the broker connection, handles publish/subscribe, reconnection
  3. The main control thread — processes incoming commands, monitors watchdog timers, manages configuration

These threads all share one critical resource: the outgoing data buffer. The polling thread writes telemetry into the buffer. The MQTT thread reads from the buffer and transmits data. When the connection drops, the buffer must hold data without the polling thread stalling. When the connection recovers, the buffer must drain in order without losing or duplicating messages.

This is a classic producer-consumer problem, but with industrial constraints that make textbook solutions insufficient.

Why Standard Queues Fall Short

Your first instinct might be to use a thread-safe queue — a ConcurrentLinkedQueue in Java, a queue.Queue in Python, or a lock-free ring buffer. These work fine for web applications, but industrial edge gateways have constraints that break standard queue implementations:

1. Memory Is Fixed and Finite

Edge gateways run on embedded hardware with 64 MB to 512 MB of RAM — no swap space, no dynamic allocation after startup. An unbounded queue will eventually exhaust memory during a long network outage. A fixed-size queue forces you to choose: block the producer (stalling PLC polling) or drop the oldest data.

2. Network Outages Last Hours, Not Seconds

In a factory, network outages aren't transient blips. A fiber cut, a misconfigured switch, or a power surge on the network infrastructure can take connectivity down for hours. Your buffer needs to hold potentially thousands of telemetry batches — not just a few dozen.

3. Delivery Confirmation Is Asynchronous

MQTT QoS 1 guarantees at-least-once delivery, but the PUBACK confirmation comes back asynchronously — possibly hundreds of milliseconds after the PUBLISH. During that window, you can't release the buffer space (the message might need retransmission), and you can't stall the producer (PLC data keeps flowing).

4. Data Must Survive Process Restarts

If the edge gateway daemon restarts (due to a configuration update, a watchdog trigger, or a power cycle), buffered-but-undelivered data must be recoverable. Purely in-memory queues lose everything.

The Paged Ring Buffer Pattern

The pattern that works in production is a paged ring buffer — a fixed-size memory region divided into pages, with explicit state tracking for each page. Here's how it works:

Memory Layout

At startup, the gateway allocates a single contiguous memory block and divides it into equal-sized pages:

┌─────────┬─────────┬─────────┬─────────┬─────────┐
│ Page 0 │ Page 1 │ Page 2 │ Page 3 │ Page 4 │
│ FREE │ FREE │ FREE │ FREE │ FREE │
└─────────┴─────────┴─────────┴─────────┴─────────┘

Each page has its own header tracking:

  • A page number (for logging and debugging)
  • A start_p pointer (beginning of writable space)
  • A write_p pointer (current write position)
  • A read_p pointer (current read position for transmission)
  • A next pointer (linking to the next page in whatever list it's in)

Three Page Lists

Pages move between three linked lists:

  1. Free pages — available for the producer to write into
  2. Used pages — full of data, queued for transmission
  3. Work page — the single page currently being written to
Producer (Polling Thread)          Consumer (MQTT Thread)
│ │
▼ │
┌──────────┐ │
│Work Page │──────── When full ──────►┌──────────┐
│(writing) │ │Used Pages│──► MQTT Publish
└──────────┘ │(queued) │
▲ └──────────┘
│ │
│ When delivered │
│◄──────────────────────────────────────┘
┌──────────┐
│Free Pages│
│(empty) │
└──────────┘

The Producer Path

When the polling thread has a new batch of tag values to store:

  1. Check the work page — if there's no current work page, grab one from the free list
  2. Calculate space — check if the new data fits in the remaining space on the work page
  3. If it fits — write the data (with a size header) and advance write_p
  4. If it doesn't fit — move the work page to the used list, grab a new page (from free, or steal the oldest from used if free is empty), and write there
  5. After writing — check if there's data ready to transmit and kick the consumer

The critical detail: if the free list is empty, the producer steals the oldest used page. This means during extended outages, the buffer wraps around and overwrites the oldest data — exactly the behavior you want. Recent data is more valuable than stale data in industrial monitoring.

The Consumer Path

When the MQTT connection is active and there's data to send:

  1. Check the used page list — if empty, check if the work page has unsent data and promote it
  2. Read the next message from the first used page's read_p position
  3. Publish via MQTT with QoS 1
  4. Set a "packet sent" flag — this prevents sending the next message until the current one is acknowledged
  5. Wait for PUBACK — when the broker confirms receipt, advance read_p
  6. If read_p reaches write_p — the page is fully delivered; move it back to the free list
  7. Repeat — grab the next message from the next used page

The Mutex Strategy

The entire buffer is protected by a single mutex. This might seem like a bottleneck, but in practice:

  • Write operations (adding data) take microseconds
  • Read operations (preparing to transmit) take microseconds
  • The actual MQTT transmission happens outside the mutex — only the buffer state management is locked

The mutex is held for a few microseconds at a time, never during network I/O. This keeps the polling thread from ever blocking on network latency.

Polling Thread:               MQTT Thread:
lock(mutex) lock(mutex)
write data to page read data from page
check if page full mark as sent
maybe promote page unlock(mutex)
trigger send check ─── MQTT publish ───
unlock(mutex) (outside mutex!)
lock(mutex)
process PUBACK
maybe free page
unlock(mutex)

Message Framing Inside Pages

Each page holds multiple messages packed sequentially. Each message has a simple header:

┌──────────────┬──────────────┬─────────────────────┐
│ Message ID │ Message Size │ Message Body │
│ (4 bytes) │ (4 bytes) │ (variable) │
└──────────────┴──────────────┴─────────────────────┘

The Message ID field is initially zero. When the MQTT library publishes the message, it fills in the packet ID assigned by the broker. This is how the consumer tracks which specific message was acknowledged — when the PUBACK callback fires with a packet ID, it can match it to the message at read_p and advance.

This framing makes the buffer self-describing. During recovery after a restart, the gateway can scan page contents by reading size headers sequentially.

Handling Disconnections Gracefully

When the MQTT connection drops, the consumer thread must handle it without corrupting the buffer:

Connection Lost:
1. Set connected = 0
2. Clear "packet sent" flag
3. Do NOT touch any page pointers

That's it. The producer keeps writing — it doesn't know or care about the connection state. The buffer absorbs data normally.

When the connection recovers:

Connection Restored:
1. Set connected = 1
2. Trigger send check (under mutex)
3. Consumer picks up where it left off

The key insight: the "packet sent" flag prevents double-sending. If a PUBLISH was in flight when the connection dropped, the PUBACK never arrived. The flag remains set, but the disconnection handler clears it. When the connection recovers, the consumer re-reads the same message from read_p (which was never advanced) and re-publishes it. The broker either receives a duplicate (handled by QoS 1 dedup) or receives it for the first time.

Binary vs. JSON Batch Encoding

The telemetry data written into the buffer can be encoded in two formats, and the choice affects both bandwidth and reliability.

JSON Format

Each batch is a JSON object containing groups of timestamped values:

{
"groups": [
{
"ts": 1709424000,
"device_type": 1017,
"serial_number": 123456,
"values": [
{"id": 80, "values": [725]},
{"id": 81, "values": [680]},
{"id": 82, "values": [285]}
]
}
]
}

Pros: Human-readable, easy to debug, parseable by any language. Cons: 5-8× larger than binary, float precision loss (decimal representation), size estimation is rough.

Binary Format

A compact binary encoding with a header byte (0xF7), followed by big-endian packed groups:

F7                              ← Header
00 00 00 01 ← Number of groups (1)
65 E8 2C 00 ← Timestamp (Unix epoch)
03 F9 ← Device type (1017)
00 01 E2 40 ← Serial number
00 00 00 03 ← Number of values (3)
00 50 00 01 02 02 D5 ← Tag 80: status=0, 1 value, 2 bytes, 725
00 51 00 01 02 02 A8 ← Tag 81: status=0, 1 value, 2 bytes, 680
00 52 00 01 02 01 1D ← Tag 82: status=0, 1 value, 2 bytes, 285

Pros: 5-8× smaller, perfect float fidelity (raw bytes preserved), exact size calculation. Cons: Requires matching decoder on the cloud side, harder to debug without tools.

For gateways communicating over cellular connections — common in remote facilities like water treatment plants, oil wells, or distributed renewable energy sites — binary encoding is essentially mandatory. A gateway polling 100 tags every 10 seconds generates about 260 MB/month in JSON versus 35 MB/month in binary. At typical IoT cellular rates ($0.50-$2.00/MB), that's the difference between $130/month and $17/month per gateway.

The MQTT Watchdog Pattern

MQTT connections can enter a zombie state — technically connected according to the TCP stack, but the broker has stopped responding. This is especially common behind industrial firewalls and NAT devices with aggressive connection timeout policies.

The Problem

The MQTT library reports the connection as alive. The gateway publishes messages. No PUBACK comes back — ever. The buffer fills up because the consumer thinks each message is "in flight" (the packet_sent flag is set). Eventually the buffer wraps and data loss begins.

The Solution: Last-Delivered Timestamp

Track the timestamp of the last successful PUBACK. If more than N seconds have passed since the last acknowledged delivery, and there are messages waiting to be sent, the connection is stale:

monitor_watchdog():
if connected AND packet_sent:
elapsed = now - last_delivered_packet_timestamp
if elapsed > WATCHDOG_THRESHOLD:
// Force disconnect and reconnect
force_disconnect()
// Disconnection handler clears packet_sent
// Reconnection handler will re-deliver from read_p

A typical threshold is 60 seconds for LAN connections and 120 seconds for cellular. This catches zombie connections that the TCP stack and MQTT keep-alive miss.

Reconnection with Backoff

When the watchdog (or a genuine disconnection) triggers a reconnect, use a dedicated thread for the connection attempt. The connect_async call can block for the TCP timeout duration (potentially 30+ seconds), and you don't want that blocking the main loop or the polling thread.

A semaphore controls the reconnection thread:

Main Thread:                Reconnection Thread:
Detects need to (blocked on semaphore)
reconnect │
Posts semaphore ──────► Wakes up
Calls connect_async()
(may block 30s)
Success or failure
Posts "done" semaphore
Waits for "done" ◄──────
Checks result

The reconnect delay should be fixed and short (5 seconds is typical) for industrial applications, not exponential backoff. In a factory, the network outage either resolves quickly (a transient) or it's a hard failure that needs human intervention. Exponential backoff just delays reconnection after the network recovers.

Batching Strategy: Size vs. Time

Telemetry batches should be finalized and queued for transmission based on whichever threshold hits first: size or time.

Size-Based Finalization

When the accumulated batch data exceeds a configured maximum (typically 4-500 KB for JSON, 50-100 KB for binary), finalize and queue it. This prevents any single MQTT message from being too large for the broker or the network MTU.

Time-Based Finalization

When the batch has been collecting data for more than a configured timeout (typically 30-60 seconds), finalize it regardless of size. This ensures that even slowly-changing tags get transmitted within a bounded time window.

The Interaction Between Batching and Buffering

Batching and buffering are separate concerns that interact:

PLC Tags ──► Batch (collecting) ──► Buffer Page (queued) ──► MQTT (transmitted)

Tag reads accumulate When batch finalizes, Pages are transmitted
in the batch structure the encoded batch goes one at a time with
into the ring buffer PUBACK confirmation

A batch contains one or more "groups" — each group is a set of tag values read at the same timestamp. Multiple polling cycles might go into a single batch before it's finalized by size or time. The finalized batch then goes into the ring buffer as a single message.

Dependent Tag Reads and Atomic Groups

In many PLC configurations, certain tags are only meaningful when read together. For example:

  • Alarm word tags — a uint16 register where each bit represents a different alarm. You read the alarm word, then extract the individual bits. If the alarm word changes, you need to read and deliver the extracted bits atomically with the parent.

  • Machine state transitions — when a "blender running" tag changes from 0 to 1, you might need to immediately read all associated process values (RPM, temperatures, pressures) to capture the startup snapshot.

The architecture handles this through dependent tag chains:

Parent Tag (alarm_word, interval=1s, compare=true)
└── Calculated Tag (alarm_bit_0, shift=0, mask=0x01)
└── Calculated Tag (alarm_bit_1, shift=1, mask=0x01)
└── Dependent Tag (motor_speed, read_on_change=true)
└── Dependent Tag (temperature, read_on_change=true)

When the parent tag changes, the polling thread:

  1. Finalizes the current batch
  2. Recursively reads all dependent tags (forced read, ignoring intervals)
  3. Starts a new batch group with the same timestamp

This ensures that the dependent values are timestamped identically with the trigger event and delivered together.

Hourly Full-Read Reset

Change-of-value (COV) filtering dramatically reduces bandwidth, but it introduces a subtle failure mode: if a value changes during a transient read error, the gateway might never know it changed.

Here's the scenario:

  1. At 10:00:00, tag value = 72.5 → transmitted
  2. At 10:00:01, PLC returns an error for that tag → not transmitted
  3. At 10:00:02, tag value = 73.0 → compared against last successful read (72.5), change detected, transmitted
  4. But if the error at 10:00:01 was actually a valid read of 73.0 that was misinterpreted as an error, and the value stayed at 73.0, then at 10:00:02 the comparison against the last known value (72.5) correctly catches it.

The real problem is when:

  1. At 10:00:00, tag value = 72.5 → transmitted
  2. The PLC program changes the tag to 73.0 and then back to 72.5 between polling cycles
  3. The gateway never sees 73.0 — it polls at 10:00:00 and 10:00:01 and gets 72.5 both times

For most industrial applications, this sub-second transient is irrelevant. But to guard against drift — where small rounding differences accumulate between the gateway's cached value and the PLC's actual value — a full reset is performed every hour:

Every hour boundary (when the system clock's hour changes):
1. Clear the "read once" flag on every tag
2. Clear all last-known values
3. Force read and transmit every tag regardless of COV

This guarantees that the cloud platform has a complete snapshot of every tag value at least once per hour, even for tags that haven't changed.

Putting It All Together: The Polling Loop

Here's the complete polling loop architecture that ties all these patterns together:

main_polling_loop():
FOREVER:
current_time = monotonic_clock()

FOR each configured device:
// Hourly reset check
if hour(current_time) != hour(last_poll_time):
reset_all_tags(device)

// Start a new batch group
start_group(device.batch, unix_timestamp())

FOR each tag in device.tags:
// Check if this tag needs reading now
if not tag.read_once OR elapsed(tag.last_read) >= tag.interval:

value, status = read_tag(device, tag)

if status == LINK_ERROR:
set_link_state(device, DOWN)
break // Stop reading this device

set_link_state(device, UP)

// COV check
if tag.compare AND tag.read_once:
if value == tag.last_value AND status == tag.last_status:
continue // No change, skip

// Deliver value
if tag.do_not_batch:
deliver_immediately(device, tag, value)
else:
add_to_batch(device.batch, tag, value)

// Check dependent tags
if value_changed AND tag.has_dependents:
finalize_batch()
read_dependents(device, tag)
start_new_group()

// Update tracking
tag.last_value = value
tag.last_status = status
tag.read_once = true
tag.last_read = current_time

// Finalize batch group
stop_group(device.batch, output_buffer)
// ↑ This checks size/time thresholds and may
// queue the batch into the ring buffer

sleep(polling_interval)

Performance Characteristics

On a typical industrial edge gateway (ARM Cortex-A9, 512 MB RAM, Linux):

OperationTimeNotes
Mutex lock/unlock~1 µsPer buffer operation
Modbus TCP read (10 registers)5-15 msNetwork dependent
Modbus RTU read (10 registers)20-50 msBaud rate dependent (9600-115200)
EtherNet/IP tag read2-8 msCIP overhead
JSON batch encoding0.5-2 ms100 tags
Binary batch encoding0.1-0.5 ms100 tags
MQTT publish (QoS 1)1-5 msLAN broker
Buffer page write5-20 µsmemcpy only

The bottleneck is always the PLC protocol reads, not the buffer or transmission logic. A gateway polling 200 Modbus TCP tags can complete a full cycle in under 200 ms, leaving plenty of headroom for a 1-second polling interval.

For Modbus RTU (serial), the bottleneck shifts to the baud rate. At 9600 baud, a single register read takes ~15 ms including response. Polling 50 registers individually would take 750 ms — too close to a 1-second interval. This is why contiguous register grouping matters: reading 50 consecutive registers in a single request takes about 50 ms, a 15× improvement.

How machineCDN Implements These Patterns

machineCDN's edge gateway uses exactly these patterns — paged ring buffers with mutex-protected page management, QoS 1 MQTT with PUBACK-based buffer advancement, and both binary and JSON encoding depending on the deployment's bandwidth constraints.

The platform's gateway daemon runs on Linux-based edge hardware (including cellular routers like the Teltonika RUT series) and handles simultaneous Modbus RTU, Modbus TCP, and EtherNet/IP connections to mixed-vendor equipment. The buffer is sized during commissioning based on the expected outage duration — a 64 KB buffer holds roughly 4 hours of data at typical polling rates; a 512 KB buffer extends that to over 24 hours.

The result: plants running machineCDN don't lose telemetry during network outages. When connectivity recovers, the buffered data drains automatically and fills in the gaps in trending charts and analytics — no manual intervention, no missing data points.

Key Takeaways

  1. Use paged ring buffers, not unbounded queues — fixed memory, graceful overflow (oldest data dropped first)
  2. Protect buffer operations with a mutex, but never hold it during network I/O — microsecond lock durations keep producers and consumers non-blocking
  3. Track PUBACK per-message to prevent double-sending and enable reliable buffer advancement
  4. Implement a MQTT watchdog using last-delivery timestamps to catch zombie connections
  5. Batch by size OR time (whichever hits first) to balance bandwidth and latency
  6. Reset all tags hourly to guarantee complete snapshots and prevent drift
  7. Binary encoding saves 5-8× bandwidth with zero precision loss — essential for cellular-connected gateways
  8. Group contiguous Modbus registers into single requests — 15× faster than individual reads on RTU

Building a reliable IIoT edge gateway is fundamentally a systems programming challenge. The protocols, the buffering, the concurrency — each one is manageable alone, but getting them all right together, on constrained hardware, with zero tolerance for data loss, is what separates toy prototypes from production infrastructure.


See machineCDN's store-and-forward buffering in action with real factory data. Request a demo to explore the platform.

Batched vs. Immediate Telemetry Delivery: When to Use Each in Industrial Monitoring [2026]

· 11 min read

Every industrial IoT edge gateway faces a fundamental architectural decision for every data point it collects: ship it now, or hold it and ship a batch later?

Get this wrong and you either drown your MQTT broker in tiny messages or you miss a critical alarm because it was sitting in a buffer when the compressor caught fire. This guide covers the engineering behind both approaches, the real-world trade-offs, and a framework for deciding which to use where.

Calculated Tags in Industrial IoT: Deriving Boolean Alarms from Raw PLC Registers [2026]

· 9 min read

If you've ever tried to monitor 32 individual alarm conditions from a PLC, you've probably discovered an uncomfortable truth: polling each one as a separate tag creates a nightmarish amount of bus traffic. The solution — calculated tags — is one of the most powerful yet underexplained patterns in industrial data acquisition.

This guide breaks down exactly how calculated tags work, why they matter for alarm systems, and how to implement them efficiently at the edge.

Cloud Connection Watchdogs for IIoT Edge Gateways: Designing Self-Healing MQTT Pipelines [2026]

· 12 min read

The edge gateway powering your factory floor monitoring has exactly one job that matters: get data from PLCs to the cloud. Everything else — protocol translation, tag mapping, batch encoding — is just preparation for that moment when bits leave the gateway and travel to your cloud backend.

And that's exactly where things break. MQTT connections go stale. TLS certificates expire silently. Cloud endpoints restart for maintenance. Cellular modems drop carrier. The gateway's connection looks alive — the TCP socket is open, the MQTT client reports "connected" — but nothing is actually getting delivered.

This is the silent failure problem, and it kills more IIoT deployments than any protocol misconfiguration ever will. This guide covers how to design watchdog systems that detect, diagnose, and automatically recover from every flavor of connectivity failure.

Why MQTT Connections Fail Silently

To understand why watchdogs are necessary, you need to understand what MQTT's keep-alive mechanism does and — more importantly — what it doesn't do.

MQTT keep-alive is a bi-directional ping. The client sends a PINGREQ, the broker responds with PINGRESP. If the broker doesn't hear from the client within 1.5× the keep-alive interval, it considers the client dead and closes the session. If the client doesn't get a PINGRESP, it knows the connection is lost.

Sounds robust, right? Here's where it falls apart:

The Half-Open Connection Problem

TCP connections can enter a "half-open" state where one side thinks the connection is alive, but the other side has already dropped it. This happens when a NAT gateway times out the session, a cellular modem roams to a new tower, or a firewall silently drops the route. The MQTT client's operating system still shows the socket as ESTABLISHED. The keep-alive PINGREQ gets queued in the kernel's send buffer — and sits there, never actually reaching the wire.

The Zombie Session Problem

The gateway reconnects after an outage and gets a new TCP session, but the broker still has the old session's resources allocated. Depending on the clean session flag and broker implementation, you might end up with duplicate subscriptions, missed messages on the command channel, or a broker that refuses the new connection because the old client ID is still "active."

The Token Expiration Problem

Cloud IoT platforms (Azure IoT Hub, AWS IoT Core, Google Cloud IoT) use SAS tokens or JWT tokens for authentication. These tokens have expiration timestamps. When a token expires, the MQTT connection stays open until the next reconnection attempt — which then fails with an authentication error. If your reconnection logic doesn't refresh the token before retrying, you'll loop forever: connect → auth failure → reconnect → auth failure.

The Backpressure Problem

The MQTT client library reports "connected," publishes succeed (they return a message ID), but the broker is under load and takes 30 seconds to acknowledge the publish. Your QoS 1 messages pile up in the client's outbound queue. Eventually the client's memory is exhausted, publishes start failing, but the connection is technically alive.

Designing a Proper Watchdog

A production-grade edge watchdog doesn't just check "am I connected?" It monitors three independent health signals:

Signal 1: Connection State

Track the MQTT on_connect and on_disconnect callbacks. Maintain a state machine:

States:
DISCONNECTED → CONNECTING → CONNECTED → DISCONNECTING → DISCONNECTED

Transitions:
DISCONNECTED + config_available → CONNECTING (initiate async connect)
CONNECTING + on_connect(status=0) → CONNECTED
CONNECTING + on_connect(status≠0) → DISCONNECTED (log error, wait backoff)
CONNECTED + on_disconnect → DISCONNECTING → DISCONNECTED

The key detail: initiate MQTT connections asynchronously in a dedicated thread. A blocking mqtt_connect() call in the main data collection loop will halt PLC reads during the TCP handshake — which on a cellular link with 2-second RTT means 2 seconds of missed data. Use a semaphore or signal to coordinate: the connection thread posts "I'm ready" when it finishes, and the main loop picks it up on the next cycle.

Signal 2: Delivery Confirmation

This is the critical signal that catches silent failures. Track the timestamp of the last successfully delivered message (acknowledged by the broker, not just sent by the client).

For QoS 1: the on_publish callback fires when the broker acknowledges receipt with a PUBACK. Record this timestamp every time it fires.

Last Delivery Tracking:
on_publish(packet_id) → last_delivery_timestamp = now()

Watchdog Check (every main loop cycle):
if (now() - last_delivery_timestamp > WATCHDOG_TIMEOUT):
trigger_reconnection()

What's the right watchdog timeout? It depends on your data rate:

Data RateSuggested TimeoutRationale
Every 1s30–60s30 missed deliveries before alert
Every 5s60–120s12–24 missed deliveries
Every 30s120–300s4–10 missed deliveries

The timeout should be significantly longer than your maximum expected inter-delivery interval. If your batch timeout is 30 seconds, a 120-second watchdog timeout gives you 4 batch cycles of tolerance before concluding something is wrong.

Signal 3: Token/Certificate Validity

Before attempting reconnection, check the authentication material:

Token Check:
if (token_expiration_timestamp ≠ 0):
if (current_time > token_expiration_timestamp):
log("WARNING: Cloud auth token may be expired")
else:
log("Token valid until {expiration_time}")

If your deployment uses SAS tokens with expiration timestamps, parse the se= (signature expiry) parameter from the connection string at startup. Log a warning when the token is approaching expiry. Some platforms provide token refresh mechanisms; others require a redeployment. Either way, knowing the token is expired before the first reconnection attempt saves you from debugging phantom connection failures at 3 AM.

Buffer-Aware Recovery: Don't Lose Data During Outages

The watchdog triggers a reconnection. But what happens to the data that was collected while the connection was down?

This is where most IIoT platforms quietly drop data. The naïve approach: if the MQTT publish call fails, discard the message and move on. This means any network outage, no matter how brief, creates a permanent gap in your historical data.

A proper store-and-forward buffer works like this:

Page-Based Buffer Architecture

Instead of a simple FIFO queue, divide a fixed memory region into pages. Each page holds multiple messages packed sequentially. Three page lists manage the lifecycle:

  • Free Pages: Empty, available for new data
  • Work Page: Currently being filled with new messages
  • Used Pages: Full pages waiting for delivery
Data Flow:
PLC Read → Batch Encoder → Work Page (append)
Work Page Full → Move to Used Pages queue

MQTT Connected:
Used Pages front → Send first message → Wait for PUBACK
PUBACK received → Advance read pointer
Page fully delivered → Move to Free Pages

MQTT Disconnected:
Used Pages continue accumulating
Work Page continues filling
If Free Pages exhausted → Reclaim oldest Used Page (overflow warning)

Why Pages, Not Individual Messages

Individual message queuing has per-message overhead that becomes significant at high data rates: pointer storage, allocation/deallocation, fragmentation. A page-based buffer pre-allocates a contiguous memory block (typically 1–2 MB on embedded edge hardware) and manages it as fixed-size pages. No dynamic allocation after startup. No fragmentation. Predictable memory footprint.

The overflow behavior is also better. When the buffer is full and the connection is still down, you sacrifice the oldest complete page — losing, say, 60 seconds of data from 10 minutes ago rather than randomly dropping individual messages from different time periods. The resulting data gap is clean and contiguous, which is much easier for downstream analytics to handle than scattered missing points.

Disconnect Recovery Sequence

When the MQTT on_disconnect callback fires:

  1. Mark connection as down immediately — the buffer stops trying to send
  2. Reset "packet in flight" flag — the pending PUBACK will never arrive
  3. Continue accepting data from PLC reads into the buffer
  4. Do NOT flush or clear the buffer — all unsent data stays queued

When on_connect fires after reconnection:

  1. Mark connection as up
  2. Begin draining Used Pages from the front of the queue
  3. Send first queued message, wait for PUBACK, then send next
  4. Simultaneously accept new data into the Work Page

This "catch-up" phase is important to handle correctly. New real-time data is still flowing into the buffer while old data is being drained. The buffer must handle concurrent writes (from the PLC reading thread) and reads (for MQTT delivery) safely. Mutex protection on the page list operations is essential.

Async Connection Threads: The Pattern That Saves You

Network operations block. DNS resolution blocks. TCP handshakes block. TLS negotiation blocks. On a cellular connection with packet loss, a single connection attempt can take 5–30 seconds.

If your edge gateway has a single thread doing both PLC reads and MQTT connections, that's 5–30 seconds of missed PLC data every time the connection drops. For an injection molding machine with a 15-second cycle, you could miss an entire shot.

The solution is a dedicated connection thread:

Main Thread:
loop:
read_plc_tags()
encode_and_buffer()
dispatch_command_queue()
check_watchdog()
if watchdog_triggered:
post_job_to_connection_thread()
sleep(1s)

Connection Thread:
loop:
wait_for_job() // blocks on semaphore
destroy_old_connection()
create_new_mqtt_client()
configure_tls()
set_callbacks()
mqtt_connect_async(host, port)
signal_job_complete() // post semaphore

Two semaphores coordinate this:

  • Job semaphore: Main thread posts to trigger reconnection, connection thread waits on it
  • Completion semaphore: Connection thread posts when done, main thread checks (non-blocking) before posting next job

Critical detail: check that the connection thread isn't already running before posting a new job. If the main thread fires the watchdog timeout every 120 seconds but the last reconnection attempt is still in progress (stuck in a 90-second TLS handshake), you'll get overlapping connection attempts that corrupt the MQTT client state.

Reconnection Backoff Strategy

When the cloud endpoint is genuinely down (maintenance window, region outage), aggressive reconnection attempts waste cellular data and CPU cycles. But when it's a transient network glitch, you want to reconnect immediately.

The right approach combines fixed-interval reconnect with watchdog escalation:

Reconnect Timing:
Attempt 1: Immediate (transient glitch)
Attempt 2: 5 seconds
Attempt 3: 5 seconds (cap at 5s for constant backoff)

Watchdog escalation:
if no successful delivery in 120 seconds despite "connected" state:
force full reconnection (destroy + recreate client)

Why not exponential backoff? In industrial settings, the most common failure mode is a brief network interruption — a cell tower handoff, a router reboot, a firewall session timeout. These resolve in 5–15 seconds. Exponential backoff would delay your reconnection to 30s, 60s, 120s, 240s... meaning you could be offline for 4+ minutes after a 2-second glitch. Constant 5-second retry with watchdog escalation provides faster recovery for the common case while still preventing connection storms during genuine outages.

Device Status Broadcasting

Your edge gateway should periodically broadcast its own health status via MQTT. This serves two purposes: it validates the delivery pipeline end-to-end, and it gives the cloud platform visibility into the gateway fleet's health.

A well-designed status message includes:

  • System uptime (OS level — how long since last reboot)
  • Daemon uptime (application level — how long since last restart)
  • Connected device inventory (PLC types, serial numbers, link states)
  • Token expiration timestamp (proactive alerting for credential rotation)
  • Buffer utilization (how close to overflow)
  • Software version + build hash (for fleet management and OTA targeting)
  • Per-device tag counts and last-read timestamps (stale data detection)

Send a compact status on every connection establishment, and a detailed status periodically (every 5–10 minutes). The compact status acts as a "birth certificate" — the cloud platform immediately knows which gateway just came online and what equipment it's managing.

Real-World Failure Scenarios and How the Watchdog Handles Them

Scenario 1: Cellular Modem Roaming

Symptom: TCP connection goes half-open. MQTT client thinks it's connected. Publishes queue up in OS buffer. Detection: Watchdog timeout fires — no PUBACK received in 120 seconds despite continuous publishes. Recovery: Force reconnection. Buffer holds all unsent data. Reconnect on new cell tower, drain buffer. Data loss: Zero (buffer sized for 2-minute outage).

Scenario 2: Cloud Platform Maintenance Window

Symptom: MQTT broker goes offline. Client receives disconnection callback. Detection: Immediate — on_disconnect fires. Recovery: 5-second reconnect attempts. Buffer accumulates data. Connection succeeds when maintenance ends. Data loss: Zero if maintenance window is shorter than buffer capacity (typically 10–30 minutes at normal data rates).

Scenario 3: SAS Token Expiration

Symptom: Connection drops. Reconnection attempts fail with authentication error. Detection: Watchdog notices repeated connection failures. Token timestamp check confirms expiration. Recovery: Log critical alert. Wait for token refresh (manual or automated). Reconnect with new token. Data loss: Depends on token refresh time. Buffer provides bridge.

Scenario 4: PLC Goes Offline

Symptom: Tag reads start returning errors. Gateway loses link state to PLC. Detection: Link state monitoring fires immediately. Error delivered to cloud as a priority (unbatched) event. Recovery: Gateway continues attempting PLC reads. When PLC comes back, link state restored, reads resume. MQTT impact: None — the cloud connection is independent of PLC connections. Both failures are handled by separate watchdog systems.

Monitoring Your Watchdog (Yes, You Need to Watch the Watcher)

The watchdog itself needs observability:

  1. Log every watchdog trigger with reason (no PUBACK, connection timeout, token expiry)
  2. Count reconnection attempts per hour — a spike indicates infrastructure instability
  3. Track buffer high-water marks — if the buffer repeatedly approaches capacity, your connectivity is too unreliable for the data rate
  4. Alert on repeated authentication failures — this is almost always a credential rotation issue

Platforms like machineCDN build this entire watchdog system into the edge agent — monitoring cloud connections, managing store-and-forward buffers, handling reconnection with awareness of both the MQTT transport state and the buffer delivery state. The result is a self-healing data pipeline where network outages create brief delays in cloud delivery but never cause data loss.

Implementation Checklist

Before deploying your edge gateway to production, verify:

  • Watchdog timer runs independently of MQTT callback threads
  • Connection establishment is fully asynchronous (dedicated thread)
  • Buffer survives connection loss (no flush on disconnect)
  • Buffer overflow discards oldest data, not newest
  • Token/certificate expiration is checked before reconnection
  • Reconnection doesn't overlap with in-progress connection attempts
  • Device status is broadcast on every successful reconnection
  • Buffer drain and new data accept can operate concurrently
  • All watchdog events are logged with timestamps for post-mortem analysis
  • PLC read loop continues uninterrupted during reconnection

The unsexy truth about industrial IoT reliability is that it's not about the protocol choice or the cloud platform. It's about what happens in the 120 seconds after your connection drops. Get the watchdog right, and a 10-minute network outage is invisible to your operators. Get it wrong, and a 2-second glitch creates a permanent hole in your production data.

Build the self-healing pipeline. Your 3 AM self will thank you.