Events, Messages, and Delivery
The question that actually determines your architecture.
Isolation — each subscriber gets independent progress tracking
Error Handling — platform-managed retries and dead-letter queues
Load Leveling — elastic scaling without partition ceilings
Ordering — guaranteed FIFO (First In, First Out) per entity key
Replay — re-read history for recovery or analytics
Exactly-Once — no duplicates, no loss
These map to four delivery classes: D0 Stream (replay-only) · D1 Brokered Event (needs reliability) · D2 Brokered + FIFO (needs ordering) · D3 Command (directed intent).
A Catalog Gate checkpoint (4 fields for streams, full gate for broker flows) prevents teams from defaulting to whatever they used last time. Pilot with 3 teams, classify highest-incident flows, then mandate for all new flows.
Watch the Deep Dive
There's a phrase that quietly spreads through engineering teams:
Messages go to Service Bus or RabbitMQ."
It sounds clean. It sounds architectural. And the technology names reinforce it: Azure Event Hubs — surely events go here. AMQP (Advanced Message Queuing Protocol) — surely messages go here.
But events are messages. The word "message" doesn't mean "command." It means "envelope." The same trap exists everywhere — the industry labels create false categories that don't exist at the design level.
The Problem Nobody Talks About
Most systems don't fail because of bad code.
They fail because of bad assumptions about delivery.
At some point, every system runs into the same reality:
- Traffic spikes unexpectedly
- Consumers fall behind
- Failures require retries
- Ordering suddenly matters
- Someone asks, "Can we replay this?"
And that's when the cracks show.
Because the real question was never "Is this an event or a message?"
The real question was always: "What does this need to survive in production?"
Events, Commands, and Messages
A message is the envelope. What's inside determines the meaning:
→ Event = fact: "this already happened" — PaymentCompleted, UserRegistered
→ Command = intent: "please do this" — ProcessPayment, CreateUser
Events are messages. Commands are messages. Both travel through the same infrastructure. The distinction is what's inside the envelope, not which pipe carries it. This is how production frameworks are built: NServiceBus defines it explicitly: "A message is the unit of communication. There are two types of messages: commands and events."
Commands change state. Events describe state changes. Treating them the same breaks ownership, idempotency, and debuggability.
Where things go wrong is this assumption:
That's not true.
An event can be mission-critical. A command can be low importance. The name doesn't determine the guarantees.
Even Microsoft's official documentation reinforces this confusion. Their messaging services comparison categorizes their own products like this:
Six Delivery Properties
What actually matters isn't what you call it — it's what your system needs to guarantee. Reliability lives in the transport, not in the label. Every flow can be classified by which of these six properties it requires:
Each logical subscriber gets isolated, durable progress tracking. Both Kafka consumer groups and Service Bus subscriptions provide this. The differentiator is how each subscriber scales internally — that's Load Leveling.
Brokers provide retries + DLQ (Dead Letter Queue) natively and consistently. Kafka has platform-level DLQ in Connect and error handlers in Streams — but custom consumers still build their own. The real difference: consistency and discoverability across all consumers.
Competing consumers can be spun up to drain spikes. Elastic, not bounded by partition count.
Guaranteed FIFO for an entity key. Preserves invariants. Implies Isolation.
Re-read history for reprocessing, forensics, or recovery. Essential for analytics and audit.
Processed exactly once — no duplicates, no loss. Kafka supports EOS natively (idempotent producers + transactional API). Service Bus offers bounded-window deterministic dedup via MessageId — guaranteed within the window, but not true EOS. For financial flows, ensure consumer idempotency regardless of transport.
Two Worlds: Streams vs Brokers
Streams (Kafka, Event Hubs, Kinesis, Pub/Sub)
Built for scale. Great at high-throughput ingestion, real-time pipelines, analytics, and replay.
But parallelism is bounded by partition count. Within a consumer group, one active consumer per partition. Extra consumers sit idle. If your consumer is slow, you can't just "add more workers" like you can with a broker.
Streams support retry and error handling patterns, but at the application level, not the platform level. Each team builds their own — differently, with different quality.
Streams guarantee delivery to the log. They delegate what happens after.
Brokers (Service Bus, SQS/SNS, RabbitMQ)
Built for control. Per-consumer queues, built-in retries and DLQs, competing consumers for elastic scaling, and optional strict ordering.
Slower per-message than streams. You can add workers to drain spikes without a partition-based ceiling (Service Bus supports up to 1,000 concurrent connections per entity).
Failures are isolated per subscription. Abandoned subscriptions can fill up and hit topic-level quotas. TTLs, DLQ policies, and depth alerts are still necessary.
When Streams Are the Right Default
This framework is not "always use a broker." There are architectures where streams are architecturally superior:
- Event sourcing — the append-only log IS the source of truth. Log compaction retains the latest value per key indefinitely.
- CQRS read-model projection — replay rebuilds projections, ordering preserves consistency
- High-throughput ingestion — millions of events/sec for clickstream, IoT, metrics
- Log-based architectures — the log as the central nervous system (Jay Kreps' "The Log")
- Analytics and data pipelines — data lakes, ML pipelines, real-time dashboards
For organizations with mature Kafka operations, stream-first with selective broker bridging for consumers that need error handling, load leveling, or ordering is an equally valid architecture.
The Partition Math Problem
Two throughput models determine whether your consumers keep up:
Sequential Model
One event at a time per partition. Use this model when each event requires an independent external call (API, database write, enrichment lookup).
Batched Model
Process multiple events per pass. Use this when events can be processed in bulk (batch inserts, aggregations, analytics writes, ML inference batches). Batch Size = events processed concurrently per pass, not
max.poll.records.
When to use which model
Sequential (per-event)
- External API calls per event
- Database writes with per-row logic
- Enrichment lookups
- Payment processing
- Any work that can't be batched
Batched (bulk processing)
- Batch database inserts
- Aggregation / windowed analytics
- ML inference batches
- Data lake writes (Parquet/Avro)
- Log/metric forwarding
The partition count remains the hard parallelism ceiling in both models. Batching improves throughput per partition but doesn't remove the limit. With a broker, you can add workers freely regardless of model.
Options when consumers can't keep up:
10 → 32 partitions: a 500ms consumer goes from 20/sec to 64/sec. But: EH Standard caps at 32 (immutable), Premium up to 1,024, Kafka ~4,000/broker. Can't decrease once added.
If the consumer doesn't need real-time and spikes are temporary, the backlog drains during off-peak. Valid for analytics and reporting. Not valid for payments.
Push to a broker queue in ~200ms instead of processing inline for 2s. Drain with unlimited competing workers (SB supports 1,000 connections). But now you've built a bridge from stream to broker.
The Bad Consumer Problem
Don't do it. One slow consumer should not force a producer to change its architecture.
What Consumer D wants
Move the producer to Service Bus queues so they can use competing consumers for their slow enrichment calls.
Problem: This hurts Consumers A, B, and C who are built for stream-speed processing.
What Consumer D should do
Receive the event from the stream in milliseconds, push it to their own queue, and process at their own pace with competing workers.
Result: Producer doesn't change. Other consumers don't notice.
It's not just load leveling. A consumer might bridge to a broker because they need per-key ordering (FIFO) that the stream only provides within a partition. Or because they need platform-managed DLQ with reason codes instead of building their own. Or because they need exactly-once processing guarantees for financial transactions. Each of these is a valid reason to bridge — error handling, load leveling, ordering, exactly-once.
The contradiction resolved: One consumer bridging to their own queue for any of these reasons = fine. Five teams independently building five different bridges for five different delivery needs = anti-pattern. If multiple consumers need broker-grade properties, either centralize the bridge as a platform service, or reconsider whether the producer should be on a broker in the first place.
flowchart LR P[Producer - 50 events/sec] --> EH[Event Hub - 10 Partitions] EH --> CG1[Consumer Group A - Payment Service - 100ms avg] EH --> CG2[Consumer Group B - Analytics - 50ms avg] EH --> CG3[Consumer Group C - Invoice Enrichment - 2s avg] CG3 --> Q[Service Bus Queue - enqueue in 200ms] Q --> W1[Worker 1] Q --> W2[Worker 2] Q --> W3[Worker 3] Q --> WN[Worker N] style P fill:#1a1000,stroke:#fbbf24,color:#f5f5fa style EH fill:#0a1a2e,stroke:#22d3ee,color:#f5f5fa style CG1 fill:#0d1520,stroke:#34d399,color:#f5f5fa style CG2 fill:#0d1520,stroke:#34d399,color:#f5f5fa style CG3 fill:#2d1010,stroke:#ef4444,color:#f5f5fa style Q fill:#1a1040,stroke:#a78bfa,color:#f5f5fa style WN fill:#0d1520,stroke:#34d399,color:#f5f5fa
Consumer Groups A and B keep up fine. Consumer Group C takes 2 seconds per event and falls behind at 5 events/sec max (10 partitions / 2s). The fix: enqueue to a Service Bus queue in ~200ms, then drain with unlimited competing workers. But you've just built the bridge yourself.
Where Systems Actually Break
It works... until:
Consumer lag grows. No platform-managed backpressure. Partition ceiling blocks scaling.
No platform DLQ. Each team builds custom retry logic with different quality.
Cross-partition ordering isn't guaranteed. Teams discover this after the bug ships.
The Real Tradeoffs
Neither approach is free. The question is where the complexity lives and what you get in return:
A Better Way to Think About It
Stop choosing based on "event vs message." Start choosing based on what your flow actually needs:
Quick Reference
Only need Replay?
Event Hubs / Kafka / Kinesis / Pub/Sub
Need Error Handling or Load Leveling?
SB Topic / SNS+SQS / Pub/Sub
Need Ordering + Error Handling or Scaling?
SB Sessions / SQS FIFO or Kafka partition-key (if error handling/scaling not needed)
Directed intent: "do this"?
SB Queue / SQS / Pub/Sub
Need reliability AND replay?
Broker for delivery, mirror to stream for analytics (or vice versa)
Internal Concurrency: The Hidden Broker
There's a common response to the partition math problem: "Just add concurrency inside the consumer."
The consumer receives events, pushes them into an internal queue, and a worker pool processes them in parallel. It works. But you're now responsible for internal queuing, concurrency control, offset coordination (can't commit until all in-flight messages complete), retry paths, and ordering preservation.
Kafka Streams and Apache Flink provide framework-level solutions for internal concurrency and state management within the stream ecosystem — but these are stream processing frameworks with their own complexity, not the raw consumer model.
This is not wrong. It's a valid pattern. But it should be a conscious decision.
Use internal concurrency when:
- Ordering does NOT matter
- Processing is parallelizable
- High throughput needed, can't add partitions
- Analytics, enrichment, fan-out workloads
Avoid it when:
- Per-key ordering matters (payments, entity state)
- Financial correctness is required
- Retries must be consistent and observable
- You need platform-level DLQ and monitoring
You Don't Eliminate Complexity — You Relocate It
Stream-based systems push complexity into consumers: internal queues, retry logic, concurrency control, offset management. Broker-based systems push complexity into platform infrastructure: queue configuration, DLQ policies, subscription management.
The architectural decision is not about avoiding complexity. It's about where that complexity should live.
The D0-D3 Taxonomy
Detailed Classification
These delivery properties map to four practical classes. The D0-D3 classification is a decision framework, not a strict industry standard. Equivalent patterns can be implemented using different technologies depending on platform and team maturity.
Decision Tree
Step 0: Does this need to be async at all? If the producer can wait and the operation completes in <500ms, consider a synchronous call. Not everything needs a message.
flowchart TD
START["🔵 New Message"] --> Q1{"Fact or Intent?"}
Q1 -->|"FACT (Event)"| GATE["📋 Catalog Gate"]
Q1 -->|"INTENT (Command)"| D3["🟣 D3: Command\nSB Queue / Topic"]
GATE --> Q2{"Ordering\nneeded?"}
Q2 -->|Yes| Q3{"Also need error\nhandling or scaling?"}
Q2 -->|No| Q4{"Error handling\nor scaling needed?"}
Q3 -->|Yes| D2["🟡 D2: Brokered FIFO\nSB Sessions"]
Q3 -->|No| KFK["🔵 Kafka\npartition-key routing"]
Q4 -->|Yes| D1["🟣 D1: Brokered Event\nSB Subscriptions"]
Q4 -->|No| Q5{"Only Replay\nneeded?"}
Q5 -->|Yes| D0["🔵 D0: Stream\nEvent Hubs / Kafka"]
Q5 -->|No| RE["🔴 Revisit\nRequirements"]
D1 --> MIR{"Also need\nReplay?"}
D2 --> MIR
MIR -->|Yes| MIRROR["🟢 Mirror\nSB → EH"]
MIR -->|No| DONE["✅ Done"]
classDef start fill:#1a1a2e,stroke:#6e6e8a,color:#f5f5fa
classDef gate fill:#1a1a2e,stroke:#6e6e8a,color:#a8a8c0
classDef question fill:#1a1040,stroke:#a78bfa,color:#f5f5fa
classDef resultPink fill:#2d1040,stroke:#f472b6,color:#f5f5fa
classDef resultAmber fill:#1a1000,stroke:#fbbf24,color:#f5f5fa
classDef resultCyan fill:#0a1a2e,stroke:#22d3ee,color:#f5f5fa
classDef resultGreen fill:#0d1520,stroke:#34d399,color:#f5f5fa
classDef resultRed fill:#2d1010,stroke:#ef4444,color:#f5f5fa
classDef done fill:#1a1a2e,stroke:#6e6e8a,color:#a8a8c0
class START start
class GATE gate
class Q1,Q2,Q3,Q4,Q5,MIR question
class D3 resultPink
class D2 resultAmber
class D1 resultPink
class D0,KFK resultCyan
class MIRROR resultGreen
class RE resultRed
class DONE done For stream-first organizations, invert: start at D0 and bridge to brokers where error handling or scaling is needed.
Or click through step by step:
What type of content does it carry?
Record delivery requirements. Does this flow need FIFO ordering per entity key?
Do you also need managed error handling or elastic load leveling?
Does this flow need managed error handling or elastic load leveling?
D3 — Command
Service Bus Queue or Topic. Directed intent with guaranteed handling.
D2 — Brokered + FIFO
Service Bus Topic + Sessions. Per-key ordering with managed error handling.
Also need replay? → Mirror SB to Event Hubs for analytics.
Kafka — Partition-Key Routing
Ordering + Replay natively in one system. No mirroring needed. Ideal for event sourcing and CQRS.
D1 — Brokered Event
Service Bus Topic + Subscriptions. Managed DLQ, elastic scaling, subscriber isolation.
Also need replay? → Mirror SB to Event Hubs for analytics.
D0 — Stream
Event Hubs or Kafka. Replay-first. Consumer group isolation native. Best for analytics, telemetry, event sourcing.
Revisit Requirements
No delivery properties identified. Does this need to be a message at all? Consider a synchronous call.
For stream-first organizations, invert: start with D0 and bridge to brokers where error handling or load leveling are needed.
PaymentFailed, AccountSuspended, and InvoiceOverdue are all facts — but many subscribers treat them as implicit triggers. This is normal. Classify by the publisher's intent, not the subscriber's reaction. If the publisher broadcasts a fact without caring who acts on it, it's an event. If it directs a specific handler to do a specific thing, it's a command. When genuinely ambiguous, resolve it during the Catalog Gate.
The Catalog Gate
A lightweight governance checkpoint recorded before first publish. It forces teams to make explicit, defensible decisions rather than defaulting to whatever the last project used.
D0 fast path: For D0 flows, require only 4 fields: P1-P6 assessment (confirm No except P5), data owner, volume, on-call team. Full gate applies to D1-D3.
Exactly-Once note: For any flow with financial impact, exactly-once semantics must be addressed explicitly. Neither Kafka EOS nor Service Bus dedup eliminates the need for idempotent consumer logic when writing to external systems.
Re-evaluate when new consumers onboard, when incident patterns change, or annually. Consumer populations shift.
Adoption: Phase 1: Pilot with 3 teams on new flows. Phase 2: Classify top 10-20 highest-incident flows, migrate top 3-5. Phase 3: Mandate for all new flows. Requires a platform team of 2-4 engineers.
Hybrid Patterns: Combining Streams and Brokers
Most production systems need both. The question is which direction the data flows.
Pattern A: Broker-First with Stream Mirroring
Producer publishes to a broker. A platform-managed bridge mirrors to a stream for analytics and replay.
flowchart LR
PROD["Producer"] --> SB["Service Bus
(Authoritative)"]
SB --> C1["Consumer A
Error Handling + Scaling"]
SB --> C2["Consumer B
FIFO Ordering"]
SB --> BRIDGE["Platform Bridge
(Managed)"]
BRIDGE --> EH["Event Hubs
(Replay + Analytics)"]
EH --> AN["Analytics"]
EH --> DL["Data Lake"]
style SB fill:#1a1040,stroke:#a78bfa,color:#f5f5fa
style EH fill:#0a1a2e,stroke:#22d3ee,color:#f5f5fa
style BRIDGE fill:#0d1520,stroke:#34d399,color:#f5f5fa
Pattern B: Stream-First with Broker Bridging
Producer publishes to a stream. Consumers that need broker-grade delivery bridge to their own queues.
flowchart LR
PROD["Producer"] --> EH["Event Hubs / Kafka
(Authoritative)"]
EH --> CGA["Consumer A
Analytics - Direct"]
EH --> CGB["Consumer B
ML Pipeline - Direct"]
EH --> CGC["Consumer C
Needs Reliability"]
CGC --> Q["Own SB Queue"]
Q --> W1["Worker 1"]
Q --> W2["Worker N"]
style EH fill:#0a1a2e,stroke:#22d3ee,color:#f5f5fa
style Q fill:#1a1040,stroke:#a78bfa,color:#f5f5fa
style CGA fill:#0d1520,stroke:#34d399,color:#f5f5fa
style CGB fill:#0d1520,stroke:#34d399,color:#f5f5fa
style CGC fill:#2d1010,stroke:#ef4444,color:#f5f5fa
Tradeoffs by direction
What This Looks Like in Real Life
Payments Platform (D2/D3 Hybrid)
A 6-person payments team processing ~12,000 transactions/hour across 4 partner banks.
ProcessPayment commands route through Service Bus queues (D3) with strict per-merchant FIFO (D2 sessions).
PaymentCompleted events mirror to Event Hubs for the fraud-detection ML pipeline.
Before the framework: 3 consumer teams each built custom retry logic; incident triage averaged 45 minutes because DLQ formats were inconsistent.
After: platform-managed DLQ with standardized reason codes reduced incident triage from 45 min to 8 min and eliminated 2 duplicate-payment bugs per quarter.
Exactly-once is addressed explicitly — idempotent consumers with MessageId-based dedup guard every write regardless of transport.
Master Data Management (D2)
An MDM hub serving 14 downstream systems with ~800 entity-change events/min (customer, product, pricing). Entity updates require per-key ordering — a price change arriving before the product-create event breaks 3 downstream projections. Before: each team consumed from a shared Kafka topic and built their own ordering reconciliation; 3 separate consumer implementations totalled ~6 FTE-months/quarter in maintenance. After: consolidated into 1 shared Service Bus topic with sessions keyed by entity ID; saved 2 FTE-months/quarter and cut ordering-related incidents from ~5/month to 0. The same events mirror to a Kafka topic for the BI team's Spark-based reporting pipeline and the data-science team's feature store.
Clickstream Analytics (D0)
A 3-person data-engineering team ingesting ~2.4 million clickstream events/hour from web and mobile into a 32-partition Event Hubs namespace. No ordering or delivery guarantees needed — a dropped event means one fewer row in a Parquet aggregate, not a business error. Consumers write directly to a data lake (ADLS Gen2) in 100-event micro-batches at ~50ms/batch. Before: the team tried routing through Service Bus "for safety," which added 120ms p99 latency per event and cost an extra $1,400/mo in Premium messaging units. After: pure D0 stream path cut per-event latency by 58% and saved $16,800/year in infrastructure costs with zero increase in data-quality incidents.
Anti-Patterns This Prevents
Pushes isolation, error handling, and scaling burdens to every consumer. Partition caps limit surge draining. Each team rebuilds DLQ differently.
Everyone rebuilds the same bridge differently. Uneven ops quality. Added lag. Duplicated infrastructure cost.
Replay supports recovery and reprocessing, but does not replace delivery guarantees such as per-subscriber durability, ordering, or managed retries. You can need both replay and managed error handling. They are orthogonal.
Platform Limits to Know
Consumer Configuration Reference
Final Thought
Events and messages don't determine your architecture.
Delivery guarantees do.
Further Reading
- Enterprise Integration Patterns — Gregor Hohpe & Bobby Woolf. The canonical text defining Event Message, Command Message, Document Message.
- Event Message Pattern (EIP) — The original pattern establishing events as a type of message.
- The Log: What Every Software Engineer Should Know — Jay Kreps. The foundational essay on log-based architectures.
- Don't Let EDA Buzzwords Fool You — Oskar Dudycz. How EDA terminology gets misused.
- Messaging Anti-Patterns in EDA — Ben Morris. Common mistakes in message design.
- Azure Event Hubs Features — Microsoft Learn. Partition model, consumer groups, throughput.
- Event Hubs Quotas and Limits — Microsoft Learn.
- Service Bus Quotas and Limits — Microsoft Learn.
- How to Choose Kafka Partitions — Confluent.
- Event-Driven Architecture — Confluent. Comprehensive intro from the Kafka company.
- Queue-Based Load Leveling Pattern — Azure Architecture Center.
- Competing Consumers Pattern — Azure Architecture Center.
- Next-Gen Event-Driven Architectures — arXiv. Benchmarking Kafka at 1.2M messages/sec.
- NServiceBus: Messages, Events, and Commands — The clearest industry definition: "A message is the unit of communication. There are two types: commands and events."
- Choose between Azure messaging services — Microsoft's comparison table that reinforces the false dichotomy this framework challenges.