Skip to main content

MQTT QoS Levels for Industrial Telemetry: Choosing the Right Delivery Guarantee [2026]

· 11 min read

When an edge gateway publishes a temperature reading from a plastics extruder running at 230°C, does it matter if that message arrives exactly once, at least once, or possibly not at all? The answer depends on what you're doing with the data — and getting it wrong can mean either lost production insights or a network drowning in redundant traffic.

MQTT's Quality of Service (QoS) levels are one of the most misunderstood aspects of industrial IoT deployments. Most engineers default to QoS 1 for everything, which is rarely optimal. This guide breaks down each level with real industrial scenarios, bandwidth math, and patterns that actually work on factory floors where cellular links drop and PLCs generate thousands of data points per second.

The Three QoS Levels — What Actually Happens on the Wire

QoS 0: Fire and Forget

The publisher sends a PUBLISH packet. That's it. No acknowledgment, no retry, no confirmation. The message either arrives or it doesn't.

Wire sequence:

Publisher → Broker: PUBLISH (QoS 0, topic, payload)

One packet. One direction. Zero overhead beyond the MQTT fixed header.

When this makes sense industrially:

  • High-frequency sensor data where individual samples don't matter — vibration waveforms at 1 kHz, temperature readings every 100ms. Missing one out of 600 readings per minute is statistically irrelevant.
  • Supplementary status data like CPU temperature of the gateway itself, memory utilization, or WiFi signal strength.
  • Real-time display feeds where stale data is worse than missing data — showing a live gauge on a dashboard where the next value arrives in 500ms anyway.

The bandwidth math: A QoS 0 PUBLISH with a 20-byte topic and 50-byte payload requires exactly 72 bytes on the wire (2-byte fixed header + 2-byte topic length + 20-byte topic + 2-byte packet ID... actually, QoS 0 has no packet ID, so 70 bytes). Compare this to QoS 1, which adds a 4-byte PUBACK response — seemingly trivial, but at 1,000 messages per second, that's 4 KB/s of additional upstream traffic that matters on metered cellular connections.

QoS 1: At Least Once

The publisher sends a PUBLISH, the broker acknowledges with PUBACK. If the publisher doesn't receive PUBACK within a timeout, it retransmits with the DUP flag set.

Wire sequence:

Publisher → Broker: PUBLISH (QoS 1, packetId=42, topic, payload)
Broker → Publisher: PUBACK (packetId=42)

This is the workhorse of industrial MQTT. The "at least once" guarantee means the message will definitely arrive — but it might arrive twice if the PUBACK is lost and the publisher retransmits.

The duplicate problem in practice:

Consider a gateway publishing a batch of telemetry containing 15 tag values with a timestamp. The broker receives it and stores it, then sends PUBACK. But the PUBACK is lost (cellular dropout, TCP reset, whatever). The gateway retransmits. Now the backend has two identical batches with the same timestamp.

For time-series data, this is usually harmless — you insert the same values at the same timestamp, and either your database deduplicates or you get two rows that average out to the same value. But for event-driven data like alarm state changes or production counters, duplicates can cause real problems:

  • An alarm "cleared" event received twice might incorrectly decrement an active alarm count
  • A production piece count increment received twice inflates your OEE calculation
  • A "machine started" event received twice could trigger duplicate notifications

Industrial pattern — idempotent message design:

Structure your payloads so that receiving the same message twice produces the same result as receiving it once:

{
"ts": 1709500800,
"device_serial": 22091017,
"device_type": 1010,
"groups": [{
"values": [
{"id": 42, "values": [231.5]},
{"id": 43, "values": [true]}
]
}]
}

Each message carries absolute state (temperature IS 231.5°C, alarm IS active) rather than deltas (temperature CHANGED BY +2°C). The backend performs upserts keyed on timestamp + device + tag ID. Two identical messages = same database state.

QoS 2: Exactly Once

Four-packet handshake: PUBLISH → PUBREC → PUBREL → PUBCOMP.

Wire sequence:

Publisher → Broker: PUBLISH (QoS 2, packetId=42, topic, payload)
Broker → Publisher: PUBREC (packetId=42)
Publisher → Broker: PUBREL (packetId=42)
Broker → Publisher: PUBCOMP (packetId=42)

Four packets instead of two. Double the round trips. And critically, the broker must maintain session state for each in-flight QoS 2 message until the PUBCOMP arrives.

Why almost nobody uses QoS 2 in industrial telemetry:

  1. Latency: On a 150ms cellular link, the four-packet handshake adds 600ms minimum before the next message can use that packet ID. With only 65,535 available packet IDs and a realistic inflight window, throughput craters.

  2. Broker memory: Every in-flight QoS 2 message requires the broker to hold state. Multiply by thousands of devices, each publishing hundreds of tags — the broker's memory footprint explodes.

  3. Diminishing returns: If your payload is idempotent (as it should be), QoS 1 with deduplication at the application layer gives you effectively-once delivery at half the network cost.

The one case for QoS 2: Billing and metering data where regulatory compliance demands provable exactly-once delivery and you cannot implement application-layer deduplication. Even then, consider whether QoS 1 + dedup isn't simpler.

The Publish Confirmation Pipeline

Understanding how publish confirmations interact with your edge gateway's buffer is where theory meets production reality.

A well-designed gateway doesn't block waiting for each PUBACK. Instead, it maintains a pipeline:

Pipelined delivery pattern:

Time →
Gateway: [PUB id=1] [PUB id=2] [PUB id=3] [PUB id=4]
Broker: [PUBACK 1] [PUBACK 2] [PUBACK 3] [PUBACK 4]

The gateway tracks which packet IDs are in-flight. When a PUBACK arrives, it marks that packet as delivered and advances its read pointer. This is critical for store-and-forward buffers — you can't free a buffer page until you confirm every message on it has been acknowledged.

What happens during disconnection:

Time →
Gateway: [PUB id=5] [PUB id=6] ← connection drops
... 45 seconds of buffering ...
[reconnect] [PUB id=7 (buffered)] [PUB id=8 (buffered)] [PUB id=9 (live)]
Broker: [PUBACK 7] [PUBACK 8] [PUBACK 9]

Messages published during the outage (id=5 and id=6) were never acknowledged. Were they received? Maybe. The gateway should retransmit them, but now you're potentially sending duplicates. The defense is the same: idempotent payloads with timestamp-based deduplication.

Bandwidth Impact by QoS Level

Here's real math for a typical industrial deployment — a gateway monitoring 50 tags on a plastics injection molding machine, publishing every 5 seconds:

MetricQoS 0QoS 1QoS 2
Packets per publish cycle124
Bytes per cycle (500B payload)524528536
Hourly overhead (720 cycles)0 B2,880 B8,640 B
Monthly overhead (cellular)0 MB2.0 MB6.1 MB

For a single device, the difference is negligible. But scale to 500 devices on metered cellular, and you're looking at 0 vs. 1 GB vs. 3 GB of monthly overhead — just for acknowledgment packets.

Binary vs. JSON Payloads and QoS Interaction

The QoS level interacts heavily with your payload encoding strategy. At QoS 1, every retransmission resends the full payload. If your batch is a 4 KB JSON blob, that retransmission costs 4 KB. If you've encoded the same data as a compact binary format — perhaps 800 bytes — retransmission costs 800 bytes.

Binary batch structure (conceptual):

[0xF7]                          // magic byte (1 byte)
[number_of_groups] // uint32 big-endian (4 bytes)
[timestamp] // uint32 (4 bytes)
[device_type] // uint16 (2 bytes)
[device_serial] // uint32 (4 bytes)
[number_of_values] // uint32 (4 bytes)
[tag_id] // uint16 (2 bytes)
[status] // uint8 (1 byte)
[values_count] // uint8 (1 byte)
[value_size] // uint8 (1 byte)
[value_data] // N bytes

A batch of 50 uint16 tag values in this format is roughly 360 bytes. The same data in JSON: {"groups":[{"ts":1709500800,"device_type":1010,"serial_number":22091017,"values":[{"id":1,"values":[4523]},{"id":2,"values":[8891]},...]}]} — easily 2,000+ bytes.

At QoS 1, where retransmissions are expected during connectivity hiccups, binary encoding effectively gives you 5x more retransmission budget for the same bandwidth cost.

Practical Patterns for Factory Deployments

Pattern 1: Split QoS by Data Priority

Don't use the same QoS for everything. Segment your data:

  • QoS 0: Continuous process data (temperature, pressure, flow rate at 1-second intervals)
  • QoS 1: Alarm state changes, production events, batch completions
  • QoS 1: Periodic status reports (gateway health, link state, diagnostic data)

This keeps your high-frequency telemetry flowing without congesting the acknowledgment pipeline, while ensuring critical events are guaranteed to arrive.

Pattern 2: Batch Before Publish

Instead of publishing each tag value individually (50 publishes per cycle × QoS 1 = 100 packets), batch all values from a single read cycle into one message (1 publish × QoS 1 = 2 packets). This reduces packet overhead by 98% while maintaining the same delivery guarantee.

A well-designed edge gateway collects values over a configurable time window (typically 5-60 seconds), groups them by timestamp, and publishes one consolidated batch. If that batch exceeds a size threshold (say, 500 KB), it's split into pages that are published and confirmed individually.

Pattern 3: Store-and-Forward with QoS 1

The most robust pattern for unreliable connections:

  1. Gateway reads PLC tags and writes values into a local ring buffer
  2. Buffer is divided into pages (e.g., 4 KB each)
  3. When connected, the gateway publishes the oldest unconfirmed page at QoS 1
  4. On PUBACK, it marks the page as delivered and advances to the next
  5. On disconnect, it continues buffering locally
  6. On reconnect, it resumes from the last unconfirmed page

The buffer size determines your survivable outage window. A 512 KB buffer with 4 KB pages and 10-second batch intervals holds roughly 21 minutes of data. Increase to 2 MB and you're covered for nearly 90 minutes.

Pattern 4: Connection Watchdog with PUBACK Monitoring

Track the timestamp of the last successful PUBACK. If no confirmation arrives within N seconds (e.g., 60), force a reconnect. This catches silent connection failures where TCP thinks the link is alive but the broker has dropped the session.

if (now - last_puback_time > watchdog_timeout) {
// Force disconnect and reconnect
disconnect();
reconnect_with_backoff();
}

MQTT v3.1.1 vs v5.0 QoS Considerations

Most industrial deployments still run MQTT v3.1.1 (protocol version 4), which is battle-tested and universally supported. MQTT v5.0 adds some relevant QoS-adjacent features:

  • Message Expiry Interval: Automatically discards messages that sit in the broker queue too long — useful for time-sensitive process data that's meaningless if delayed.
  • Topic Aliases: Reduces per-message overhead by replacing long topic strings with short integers — significant when publishing thousands of messages.
  • Flow Control: Allows the receiver to control how many QoS 1/2 messages the sender can have in-flight simultaneously, preventing buffer overflow.

However, v5.0 adoption in embedded gateways and industrial brokers is still limited. Design for v3.1.1 compatibility and treat v5.0 features as optimizations.

How machineCDN Handles Delivery Guarantees

machineCDN's edge gateway firmware uses QoS 1 for all telemetry delivery, combined with a paged ring buffer that provides store-and-forward resilience during connectivity drops. The binary encoding format keeps retransmission costs minimal, and the publish confirmation pipeline tracks per-packet delivery status so that buffer pages are only recycled after every message on the page has been acknowledged.

The gateway monitors PUBACK latency as a connection health indicator — if confirmations stop arriving despite an apparently active TCP connection, it triggers a forced reconnect. This catches the "half-open connection" failure mode that plagues cellular deployments, where the TCP stack believes the connection is alive long after the network path has failed.

For alarm state changes and link-state transitions, the gateway bypasses the batch pipeline entirely and publishes immediately at QoS 1, ensuring that critical events reach the cloud without waiting for the next batch window.

Key Takeaways

  1. QoS 1 is the industrial default — it guarantees delivery without the overhead of QoS 2
  2. Make your payloads idempotent — design for "at least once" by using absolute values and timestamp-keyed upserts
  3. Binary encoding multiplies your retransmission budget — 5x smaller payloads mean 5x more affordable retransmissions
  4. Batch aggressively — one batched publish at QoS 1 beats fifty individual publishes
  5. Monitor PUBACKs as a health signal — the gap between publish and confirmation reveals connection quality before total failure
  6. Split QoS by data type — high-frequency process data tolerates QoS 0; events and alarms need QoS 1

The right QoS strategy isn't about choosing a number — it's about understanding your data's tolerance for loss, duplication, and latency, then engineering the pipeline to match.