System Design Event Architecture Distributed Systems

Events, Messages, and Delivery

The question that actually determines your architecture.

TL;DR: Stop choosing technology by label ("event" vs "message"). Evaluate every flow against six delivery properties:

Isolation: each subscriber gets independent progress tracking
Error Handling: platform-managed retries and dead-letter queues
Load Leveling: elastic scaling without partition ceilings
Ordering: guaranteed FIFO (First In, First Out) per entity key
Replay: re-read history for recovery or analytics
Exactly-Once: no duplicates, no loss

These map to four delivery classes: D0 Stream (replay-only) · D1 Brokered Event (needs reliability) · D2 Brokered + FIFO (needs ordering) · D3 Command (directed intent).

A Catalog Gate checkpoint (4 fields for streams, full gate for broker flows) prevents teams from defaulting to whatever they used last time. Pilot with 3 teams, classify highest-incident flows, then mandate for all new flows.

Watch the Deep Dive

Developer sending critical events into a black hole

There's a phrase that quietly spreads through engineering teams:

"Events go to Kafka or Event Hubs.
Messages go to Service Bus or RabbitMQ."

It sounds clean. It sounds architectural. And the technology names reinforce it: Azure Event Hubs, so surely events go here. AMQP (Advanced Message Queuing Protocol), so surely messages go here.

But events are messages. The word "message" doesn't mean "command." It means "envelope." The same trap exists everywhere. The industry labels create false categories that don't exist at the design level.

The Problem Nobody Talks About

Most systems don't fail because of bad code.

They fail because of bad assumptions about delivery.

At some point, every system runs into the same reality:

Traffic spikes unexpectedly
Consumers fall behind
Failures require retries
Ordering suddenly matters
Someone asks, "Can we replay this?"

And that's when the cracks show.

Because the real question was never "Is this an event or a message?"

The real question was always: "What does this need to survive in production?"

Events, Commands, and Messages

A message is the envelope. What's inside determines the meaning:

Message (the envelope)
→ Event = fact: "this already happened" (e.g. PaymentCompleted, UserRegistered)
→ Command = intent: "please do this" (e.g. ProcessPayment, CreateUser)

Events are messages. Commands are messages. Both travel through the same infrastructure. The distinction is what's inside the envelope, not which pipe carries it. This is how production frameworks are built: NServiceBus defines it explicitly: "A message is the unit of communication. There are two types of messages: commands and events."

Commands change state. Events describe state changes. Treating them the same breaks ownership, idempotency, and debuggability.

Where things go wrong is this assumption:

"Events are lightweight, so they should go on streams."

That's not true.

An event can be mission-critical. A command can be low importance. The name doesn't determine the guarantees.

Even Microsoft's official documentation reinforces this confusion. Their messaging services comparison categorizes their own products like this:

Service Type Use Case

Event Grid Event distribution React to status changes

Event Hubs Event streaming Telemetry and data streaming

Service Bus Message Order processing and financial transactions

The problem: This table teaches every Azure developer that "events" and "messages" are different categories routed to different services. Service Bus gets the label "Message" as if events aren't messages. An event describing a completed payment is still a message. It might need Service Bus for reliable delivery. But this table says it should go to Event Hubs because it's an "event." That framing leads to architectures that break under real-world pressure.

Six Delivery Properties

What actually matters isn't what you call it; it's what your system needs to guarantee. Reliability lives in the transport, not in the label. Every flow can be classified by which of these six properties it requires:

Per-Subscriber Isolation

Each logical subscriber gets isolated, durable progress tracking. Both Kafka consumer groups and Service Bus subscriptions provide this. The differentiator is how each subscriber scales internally, and that's Load Leveling.

Managed Error Handling

Brokers provide retries + DLQ (Dead Letter Queue) natively and consistently. Kafka has platform-level DLQ in Connect and error handlers in Streams, but custom consumers still build their own. The real difference: consistency and discoverability across all consumers.

Load Leveling

Competing consumers can be spun up to drain spikes. Elastic, not bounded by partition count.

Per-Key Ordering (FIFO)

Guaranteed FIFO for an entity key. Preserves invariants. Implies Isolation.

Replay

Re-read history for reprocessing, forensics, or recovery. Essential for analytics and audit.

Exactly-Once Semantics (EOS)

Processed exactly once: no duplicates, no loss. Kafka supports EOS natively (idempotent producers + transactional API). Service Bus offers bounded-window deterministic dedup via MessageId, guaranteed within the window, but not true EOS. For financial flows, ensure consumer idempotency regardless of transport.

Two Worlds: Streams vs Brokers

Streams (Kafka, Event Hubs, Kinesis, Pub/Sub)

Built for scale. Great at high-throughput ingestion, real-time pipelines, analytics, and replay.

But parallelism is bounded by partition count. Within a consumer group, one active consumer per partition. Extra consumers sit idle. If your consumer is slow, you can't just "add more workers" like you can with a broker.

Streams support retry and error handling patterns, but at the application level, not the platform level. Each team builds their own, differently, with different quality.

Streams guarantee delivery to the log. They delegate what happens after.

Brokers (Service Bus, SQS/SNS, RabbitMQ)

Built for control. Per-consumer queues, built-in retries and DLQs, competing consumers for elastic scaling, and optional strict ordering.

Slower per-message than streams. You can add workers to drain spikes without a partition-based ceiling (Service Bus supports up to 1,000 concurrent connections per entity).

Failures are isolated per subscription. Abandoned subscriptions can fill up and hit topic-level quotas. TTLs, DLQ policies, and depth alerts are still necessary.

When Streams Are the Right Default

This framework is not "always use a broker." There are architectures where streams are architecturally superior:

Event sourcing: the append-only log IS the source of truth. Log compaction retains the latest value per key indefinitely.
CQRS read-model projection: replay rebuilds projections, ordering preserves consistency
High-throughput ingestion: millions of events/sec for clickstream, IoT, metrics
Log-based architectures: the log as the central nervous system (Jay Kreps' "The Log")
Analytics and data pipelines: data lakes, ML pipelines, real-time dashboards

For organizations with mature Kafka operations, stream-first with selective broker bridging for consumers that need error handling, load leveling, or ordering is an equally valid architecture.

The decision is not "brokers vs streams." It's "where does the complexity live, and can your team operate it?"

The Partition Math Problem

Two throughput models determine whether your consumers keep up:

Sequential Model

Max events/sec = Partitions / Processing Time (seconds)
One event at a time per partition. Use this model when each event requires an independent external call (API, database write, enrichment lookup).

Parts Per Event Throughput Ingest Status

10 100ms 100/sec 50/sec Healthy

10 200ms 50/sec 50/sec Tight

10 500ms 20/sec 50/sec Falling behind

10 2s 5/sec 50/sec 10x behind

32 500ms 64/sec 50/sec Healthy

32 2s 16/sec 50/sec Still behind

Batched Model

Max events/sec = (Partitions x Batch Size) / Time per Batch
Process multiple events per pass. Use this when events can be processed in bulk (batch inserts, aggregations, analytics writes, ML inference batches). Batch Size = events processed concurrently per pass, not max.poll.records.

Parts Batch Throughput Ingest Status

10 50 in 200ms 2,500/sec 50/sec 50x headroom

10 100 in 500ms 2,000/sec 50/sec 40x headroom

10 20 in 1s 200/sec 50/sec 4x headroom

32 100 in 500ms 6,400/sec 50/sec 128x headroom

When to use which model

Sequential (per-event)

External API calls per event
Database writes with per-row logic
Enrichment lookups
Payment processing
Any work that can't be batched

Batched (bulk processing)

Batch database inserts
Aggregation / windowed analytics
ML inference batches
Data lake writes (Parquet/Avro)
Log/metric forwarding

The partition count remains the hard parallelism ceiling in both models. Batching improves throughput per partition but doesn't remove the limit. With a broker, you can add workers freely regardless of model.

Options when consumers can't keep up:

Increase partitions

10 → 32 partitions: a 500ms consumer goes from 20/sec to 64/sec. But: EH Standard caps at 32 (immutable), Premium up to 1,024, Kafka ~4,000/broker. Can't decrease once added.

Accept eventual catch-up

If the consumer doesn't need real-time and spikes are temporary, the backlog drains during off-peak. Valid for analytics and reporting. Not valid for payments.

Queue-based load leveling

Push to a broker queue in ~200ms instead of processing inline for 2s. Drain with unlimited competing workers (SB supports 1,000 connections). But now you've built a bridge from stream to broker.

The Bad Consumer Problem

The scenario: 3 consumer groups are happy on a 50-partition Event Hub. Consumer D shows up needing 5-second enrichment calls. They ask the producer to switch to Service Bus.

Don't do it. One slow consumer should not force a producer to change its architecture.

What Consumer D wants

Move the producer to Service Bus queues so they can use competing consumers for their slow enrichment calls.

Problem: This hurts Consumers A, B, and C who are built for stream-speed processing.

What Consumer D should do

Receive the event from the stream in milliseconds, push it to their own queue, and process at their own pace with competing workers.

Result: Producer doesn't change. Other consumers don't notice.

The principle: Brokers are not a crutch for bad consumer design. The consumer who can't keep up owns the fix. The Catalog Gate captures consumer processing characteristics so this gets caught early.

The nuance: Optimize for the majority, not the slowest. But if most consumers need broker-grade delivery (managed error handling, elastic scaling, FIFO ordering, or delivery guarantees), the system should be broker-first.

It's not just load leveling. A consumer might bridge to a broker because they need per-key ordering (FIFO) that the stream only provides within a partition. Or because they need platform-managed DLQ with reason codes instead of building their own. Or because they need exactly-once processing guarantees for financial transactions. Each of these is a valid reason to bridge: error handling, load leveling, ordering, exactly-once.

The contradiction resolved: One consumer bridging to their own queue for any of these reasons = fine. Five teams independently building five different bridges for five different delivery needs = anti-pattern. If multiple consumers need broker-grade properties, either centralize the bridge as a platform service, or reconsider whether the producer should be on a broker in the first place.

flowchart LR
P[Producer - 50 events/sec] --> EH[Event Hub - 10 Partitions]
EH --> CG1[Consumer Group A - Payment Service - 100ms avg]
EH --> CG2[Consumer Group B - Analytics - 50ms avg]
EH --> CG3[Consumer Group C - Invoice Enrichment - 2s avg]
CG3 --> Q[Service Bus Queue - enqueue in 200ms]
Q --> W1[Worker 1]
Q --> W2[Worker 2]
Q --> W3[Worker 3]
Q --> WN[Worker N]
style P fill:#1a1000,stroke:#fbbf24,color:#f5f5fa
style EH fill:#0a1a2e,stroke:#22d3ee,color:#f5f5fa
style CG1 fill:#0d1520,stroke:#34d399,color:#f5f5fa
style CG2 fill:#0d1520,stroke:#34d399,color:#f5f5fa
style CG3 fill:#2d1010,stroke:#ef4444,color:#f5f5fa
style Q fill:#1a1040,stroke:#a78bfa,color:#f5f5fa
style WN fill:#0d1520,stroke:#34d399,color:#f5f5fa

Consumer Groups A and B keep up fine. Consumer Group C takes 2 seconds per event and falls behind at 5 events/sec max (10 partitions / 2s). The fix: enqueue to a Service Bus queue in ~200ms, then drain with unlimited competing workers. But you've just built the bridge yourself.

Where Systems Actually Break

"All events go to Event Hubs."

It works... until:

A downstream system can't keep up

Consumer lag grows. No platform-managed backpressure. Partition ceiling blocks scaling.

A failure needs retries

No platform DLQ. Each team builds custom retry logic with different quality.

Ordering suddenly matters

Cross-partition ordering isn't guaranteed. Teams discover this after the bug ships.

And suddenly the "simple architecture" becomes a distributed collection of inconsistent reliability systems.

The Real Tradeoffs

Neither approach is free. The question is where the complexity lives and what you get in return:

Concern Streams Brokers

Error handling App-level. Kafka Connect DLQ exists; plain consumers build their own. Platform DLQ + reason codes, consistent across all consumers.

Scaling Partition-bounded. Batching helps but ceiling remains. Competing consumers scale freely (up to 1,000 connections).

Throughput Millions of events/sec. Unmatched for ingestion. Lower per-message throughput. Not built for firehose.

Replay Native. Configurable retention + log compaction. Consume-and-delete. Gone once processed.

Ordering Per-partition (native). Cross-partition requires careful key routing. Sessions (SB) enforce per-key FIFO with platform guarantees.

Monitoring Consumer lag is the primary signal. Each team instruments differently. Consistent: DLQ depth, active messages, subscription health.

Cost at scale Lower per-event cost at high volume. Higher per-message. SB Premium ~$668/mo per messaging unit.

Operational burden Each consumer team owns reliability. Divergent implementations. Platform enforces consistency. But broker infra has its own operational cost.

The complexity doesn't disappear. With streams, it lives in every consumer. With brokers, it lives in platform infrastructure. The question is which one your team can operate reliably.

A Better Way to Think About It

Stop choosing based on "event vs message." Start choosing based on what your flow actually needs:

Quick Reference

D0 Stream

Only need Replay?

Event Hubs / Kafka / Kinesis / Pub/Sub

D1 Brokered Event

Need Error Handling or Load Leveling?

SB Topic / SNS+SQS / Pub/Sub

D2 Brokered + FIFO

Need Ordering + Error Handling or Scaling?

SB Sessions / SQS FIFO or Kafka partition-key (if error handling/scaling not needed)

D3 Command

Directed intent: "do this"?

SB Queue / SQS / Pub/Sub

↔ Hybrid

Need reliability AND replay?

Broker for delivery, mirror to stream for analytics (or vice versa)

Streams and brokers are both valid defaults depending on your consumers, your team's operational maturity, and your cost constraints. The framework helps you choose; it doesn't prescribe.

Internal Concurrency: The Hidden Broker

There's a common response to the partition math problem: "Just add concurrency inside the consumer."

The consumer receives events, pushes them into an internal queue, and a worker pool processes them in parallel. It works. But you're now responsible for internal queuing, concurrency control, offset coordination (can't commit until all in-flight messages complete), retry paths, and ordering preservation.

If you are adding internal queues, worker pools, retry logic, and offset coordination inside your consumer, you are no longer just consuming a stream. You are rebuilding capabilities that brokers provide natively.

Kafka Streams and Apache Flink provide framework-level solutions for internal concurrency and state management within the stream ecosystem, but these are stream processing frameworks with their own complexity, not the raw consumer model.

This is not wrong. It's a valid pattern. But it should be a conscious decision.

Use internal concurrency when:

Ordering does NOT matter
Processing is parallelizable
High throughput needed, can't add partitions
Analytics, enrichment, fan-out workloads

Avoid it when:

Per-key ordering matters (payments, entity state)
Financial correctness is required
Retries must be consistent and observable
You need platform-level DLQ and monitoring

You Don't Eliminate Complexity. You Relocate It.

Stream-based systems push complexity into consumers: internal queues, retry logic, concurrency control, offset management. Broker-based systems push complexity into platform infrastructure: queue configuration, DLQ policies, subscription management.

The architectural decision is not about avoiding complexity. It's about where that complexity should live.

With streams, every team builds their own reliability model. With brokers, the platform enforces one. Neither is free. But one is consistent.

The D0-D3 Taxonomy

Detailed Classification

These delivery properties map to four practical classes. The D0-D3 classification is a decision framework, not a strict industry standard. Equivalent patterns can be implemented using different technologies depending on platform and team maturity.

Class When It Fits Tech Properties

D0 Replay (P5) is primary. P1 (subscriber isolation) is native via consumer groups. Per-partition ordering available. Does not provide consistent platform-managed error handling (P2) or elastic load leveling beyond partition count (P3). Event Hubs / Kafka / Kinesis / Pub/Sub P5 Replay

D1 Brokered Event: needs managed error handling with DLQ (P2) or elastic load leveling (P3). P1 alone doesn't require a broker. Test: if dropping 0.1% during a spike = no incident, stay on D0. If a single drop = missed payment, use D1. For financial/regulatory flows, classify by consequence of failure, not frequency. SB Topic / SNS+SQS / Pub/Sub P1 Durability / P2 Error Handling / P3 Load Leveling

D2 Brokered + FIFO: needs guaranteed per-key ordering (P4), which implies per-subscriber durability (P1). SB Sessions / SQS FIFO P4 FIFO Ordering (+P1 Durability)

D3 Command / Request: instructional intent. Needs durability (P1), managed errors (P2), and load leveling (P3) by default. SB Queue / SQS / Pub/Sub P1 Durability / P2 Error Handling / P3 Load Leveling

Decision Tree

Step 0: Does this need to be async at all? If the producer can wait and the operation completes in <500ms, consider a synchronous call. Not everything needs a message.

flowchart TD
START["🔵 New Message"] --> Q1{"Fact or Intent?"}
Q1 -->|"FACT (Event)"| GATE["📋 Catalog Gate"]
Q1 -->|"INTENT (Command)"| D3["🟣 D3: Command\nSB Queue / Topic"]
GATE --> Q2{"Ordering\nneeded?"}
Q2 -->|Yes| Q3{"Also need error\nhandling or scaling?"}
Q2 -->|No| Q4{"Error handling\nor scaling needed?"}
Q3 -->|Yes| D2["🟡 D2: Brokered FIFO\nSB Sessions"]
Q3 -->|No| KFK["🔵 Kafka\npartition-key routing"]
Q4 -->|Yes| D1["🟣 D1: Brokered Event\nSB Subscriptions"]
Q4 -->|No| Q5{"Only Replay\nneeded?"}
Q5 -->|Yes| D0["🔵 D0: Stream\nEvent Hubs / Kafka"]
Q5 -->|No| RE["🔴 Revisit\nRequirements"]
D1 --> MIR{"Also need\nReplay?"}
D2 --> MIR
MIR -->|Yes| MIRROR["🟢 Mirror\nSB → EH"]
MIR -->|No| DONE["✅ Done"]

classDef start fill:#1a1a2e,stroke:#6e6e8a,color:#f5f5fa
classDef gate fill:#1a1a2e,stroke:#6e6e8a,color:#a8a8c0
classDef question fill:#1a1040,stroke:#a78bfa,color:#f5f5fa
classDef resultPink fill:#2d1040,stroke:#f472b6,color:#f5f5fa
classDef resultAmber fill:#1a1000,stroke:#fbbf24,color:#f5f5fa
classDef resultCyan fill:#0a1a2e,stroke:#22d3ee,color:#f5f5fa
classDef resultGreen fill:#0d1520,stroke:#34d399,color:#f5f5fa
classDef resultRed fill:#2d1010,stroke:#ef4444,color:#f5f5fa
classDef done fill:#1a1a2e,stroke:#6e6e8a,color:#a8a8c0

class START start
class GATE gate
class Q1,Q2,Q3,Q4,Q5,MIR question
class D3 resultPink
class D2 resultAmber
class D1 resultPink
class D0,KFK resultCyan
class MIRROR resultGreen
class RE resultRed
class DONE done

For stream-first organizations, invert: start at D0 and bridge to brokers where error handling or scaling is needed.

Or click through step by step:

New Message

What type of content does it carry?

Catalog Gate

Record delivery requirements. Does this flow need FIFO ordering per entity key?

Ordering needed

Do you also need managed error handling or elastic load leveling?

No ordering needed

Does this flow need managed error handling or elastic load leveling?

Only Replay needed?

D3: Command

Service Bus Queue or Topic. Directed intent with guaranteed handling.

D2: Brokered + FIFO

Service Bus Topic + Sessions. Per-key ordering with managed error handling.

Also need replay? → Mirror SB to Event Hubs for analytics.

Kafka: Partition-Key Routing

Ordering + Replay natively in one system. No mirroring needed. Ideal for event sourcing and CQRS.

D1: Brokered Event

Service Bus Topic + Subscriptions. Managed DLQ, elastic scaling, subscriber isolation.

Also need replay? → Mirror SB to Event Hubs for analytics.

D0: Stream

Event Hubs or Kafka. Replay-first. Consumer group isolation native. Best for analytics, telemetry, event sourcing.

Revisit Requirements

No delivery properties identified. Does this need to be a message at all? Consider a synchronous call.

For stream-first organizations, invert: start with D0 and bridge to brokers where error handling or load leveling are needed.

Consumer uplift rule: When consumers have divergent requirements, the flow's classification is determined by the most demanding consumer, unless that consumer is a minority that should bridge independently. Heuristic: ≤20% need uplift → minority bridges. ≥50% need error handling, load leveling, or ordering → flow is broker-first.

Ordering + Replay without mirroring: If the consumer that needs ordering is also the consumer that needs replay (event sourcing, CQRS), Kafka with partition-key routing provides both natively, so no mirroring is required.

A note on ambiguous messages: Not every message is cleanly a fact or a command. PaymentFailed, AccountSuspended, and InvoiceOverdue are all facts, but many subscribers treat them as implicit triggers. This is normal. Classify by the publisher's intent, not the subscriber's reaction. If the publisher broadcasts a fact without caring who acts on it, it's an event. If it directs a specific handler to do a specific thing, it's a command. When genuinely ambiguous, resolve it during the Catalog Gate.

The Catalog Gate

A lightweight governance checkpoint recorded before first publish. It forces teams to make explicit, defensible decisions rather than defaulting to whatever the last project used.

What the Catalog Gate records: P1-P6 requirements, data owner and on-call team, known consumers, SLA tier (predefined, not open-ended), ordering key (if FIFO needed), expected volume, replay needs (Y/N), failure posture, existing topic/queue this replaces, team familiarity with the platform, existing monitoring/alerting coverage, and estimated monthly infrastructure cost delta, and schema registry or contract version strategy (if mirroring). This prevents "habit picks" and "surprise, consumers can't keep up."

D0 fast path: For D0 flows, require only 4 fields: P1-P6 assessment (confirm No except P5), data owner, volume, on-call team. Full gate applies to D1-D3.

Exactly-Once note: For any flow with financial impact, exactly-once semantics must be addressed explicitly. Neither Kafka EOS nor Service Bus dedup eliminates the need for idempotent consumer logic when writing to external systems.

Re-evaluate when new consumers onboard, when incident patterns change, or annually. Consumer populations shift.

Brownfield? Most teams have existing infrastructure. If you have 200 Kafka topics and identify 40 that should be on brokers by P1-P4 analysis: audit existing flows, classify with P1-P6, prioritize by incident history and operational pain, migrate highest-pain topics first, deprecate custom retry infrastructure as topics move. Run the Catalog Gate retroactively during the audit.

Adoption: Phase 1: Pilot with 3 teams on new flows. Phase 2: Classify top 10-20 highest-incident flows, migrate top 3-5. Phase 3: Mandate for all new flows. Requires a platform team of 2-4 engineers.

River splitting at a dam into controlled and free-flowing channels

Hybrid Patterns: Combining Streams and Brokers

Most production systems need both. The question is which direction the data flows.

Pattern A: Broker-First with Stream Mirroring

Producer publishes to a broker. A platform-managed bridge mirrors to a stream for analytics and replay.

flowchart LR
    PROD["Producer"] --> SB["Service Bus
(Authoritative)"]
    SB --> C1["Consumer A
Error Handling + Scaling"]
    SB --> C2["Consumer B
FIFO Ordering"]
    SB --> BRIDGE["Platform Bridge
(Managed)"]
    BRIDGE --> EH["Event Hubs
(Replay + Analytics)"]
    EH --> AN["Analytics"]
    EH --> DL["Data Lake"]

    style SB fill:#1a1040,stroke:#a78bfa,color:#f5f5fa
    style EH fill:#0a1a2e,stroke:#22d3ee,color:#f5f5fa
    style BRIDGE fill:#0d1520,stroke:#34d399,color:#f5f5fa

Pattern B: Stream-First with Broker Bridging

Producer publishes to a stream. Consumers that need broker-grade delivery bridge to their own queues.

flowchart LR
    PROD["Producer"] --> EH["Event Hubs / Kafka
(Authoritative)"]
    EH --> CGA["Consumer A
Analytics - Direct"]
    EH --> CGB["Consumer B
ML Pipeline - Direct"]
    EH --> CGC["Consumer C
Needs Reliability"]
    CGC --> Q["Own SB Queue"]
    Q --> W1["Worker 1"]
    Q --> W2["Worker N"]

    style EH fill:#0a1a2e,stroke:#22d3ee,color:#f5f5fa
    style Q fill:#1a1040,stroke:#a78bfa,color:#f5f5fa
    style CGA fill:#0d1520,stroke:#34d399,color:#f5f5fa
    style CGB fill:#0d1520,stroke:#34d399,color:#f5f5fa
    style CGC fill:#2d1010,stroke:#ef4444,color:#f5f5fa

Tradeoffs by direction

Concern Broker → Stream Stream → Broker

Who owns the bridge One platform team. Centralized. Each consumer team that needs it. Decentralized.

Ordering Broker preserves ordering. Stream mirror may reorder across partitions. Stream preserves per-partition order. Broker bridge maintains it if single consumer.

Replay Stream side has full replay. Broker side is consume-and-delete. Stream is authoritative with native replay. Broker side is consume-and-delete.

Throughput Broker is the bottleneck. Lower ingestion ceiling. Stream handles high throughput natively. Broker bridge only processes what it needs.

Failure impact Bridge down = stream goes stale. Broker consumers unaffected. Bridge down = reliability consumers lose their queue feed. Stream consumers unaffected.

Best for Majority of consumers need reliability. Analytics is secondary. Majority of consumers are high-throughput analytics. Few need reliability.

Both directions require infrastructure (a Function, Logic App, or Event Hubs Capture). The difference is who owns the bridge, how many times it gets built, and whether ordering survives the crossing. If multiple consumers independently build their own bridges, you've hit the "consumer-managed brokering" anti-pattern.

What This Looks Like in Real Life

Payments Platform (D2/D3 Hybrid)

A 6-person payments team processing ~12,000 transactions/hour across 4 partner banks. ProcessPayment commands route through Service Bus queues (D3) with strict per-merchant FIFO (D2 sessions). PaymentCompleted events mirror to Event Hubs for the fraud-detection ML pipeline. Before the framework: 3 consumer teams each built custom retry logic; incident triage averaged 45 minutes because DLQ formats were inconsistent. After: platform-managed DLQ with standardized reason codes reduced incident triage from 45 min to 8 min and eliminated 2 duplicate-payment bugs per quarter. Exactly-once is addressed explicitly. Idempotent consumers with MessageId-based dedup guard every write regardless of transport.

Master Data Management (D2)

An MDM hub serving 14 downstream systems with ~800 entity-change events/min (customer, product, pricing). Entity updates require per-key ordering; a price change arriving before the product-create event breaks 3 downstream projections. Before: each team consumed from a shared Kafka topic and built their own ordering reconciliation; 3 separate consumer implementations totalled ~6 FTE-months/quarter in maintenance. After: consolidated into 1 shared Service Bus topic with sessions keyed by entity ID; saved 2 FTE-months/quarter and cut ordering-related incidents from ~5/month to 0. The same events mirror to a Kafka topic for the BI team's Spark-based reporting pipeline and the data-science team's feature store.

Clickstream Analytics (D0)

A 3-person data-engineering team ingesting ~2.4 million clickstream events/hour from web and mobile into a 32-partition Event Hubs namespace. No ordering or delivery guarantees needed. A dropped event means one fewer row in a Parquet aggregate, not a business error. Consumers write directly to a data lake (ADLS Gen2) in 100-event micro-batches at ~50ms/batch. Before: the team tried routing through Service Bus "for safety," which added 120ms p99 latency per event and cost an extra $1,400/mo in Premium messaging units. After: pure D0 stream path cut per-event latency by 58% and saved $16,800/year in infrastructure costs with zero increase in data-quality incidents.

Anti-Patterns This Prevents

✗

"All cross-domain events go to Event Hubs"

Pushes isolation, error handling, and scaling burdens to every consumer. Partition caps limit surge draining. Each team rebuilds DLQ differently.

✗

"Consumer-managed brokering" (EH to each team's SB)

Everyone rebuilds the same bridge differently. Uneven ops quality. Added lag. Duplicated infrastructure cost.

✗

"Replay replaces reliability"

Replay supports recovery and reprocessing, but does not replace delivery guarantees such as per-subscriber durability, ordering, or managed retries. You can need both replay and managed error handling. They are orthogonal.

Platform Limits to Know

Platform Parallelism Limit Notes

Event Hubs Standard 32 partitions max (immutable) Cannot change after creation

Event Hubs Premium Up to 1,024 partitions Can increase, never decrease. 200 per PU namespace limit.

Apache Kafka ~4,000 per broker (soft), 200K per cluster No hard limit, but performance degrades. 10 partitions per topic is a safe default.

AWS Kinesis 500 shards per stream (soft limit) Each shard: 1 MB/s in, 2 MB/s out. Resharding is possible but operationally heavy.

GCP Pub/Sub No partition ceiling Scales automatically. Throughput limited by publisher/subscriber quotas, not partitions.

Service Bus Queue/Sub 1,000 concurrent connections per entity No partition ceiling. Up to 1,000 concurrent connections.

AWS SQS Unlimited consumers Standard: nearly unlimited throughput. FIFO: 300 msg/s (3,000 with batching).

Consumer Configuration Reference

Runtime Default Concurrency How to Change

Azure Functions (SB trigger) 16 x CPU cores per instance Set maxConcurrentCalls in host.json

Azure Functions (EH trigger) 1 per partition Set maxEventBatchSize and prefetchCount in host.json

Final Thought

Events and messages don't determine your architecture.

Delivery guarantees do.

Events, Messages, and Delivery

Watch the Deep Dive

The Problem Nobody Talks About

Events, Commands, and Messages

Six Delivery Properties

Two Worlds: Streams vs Brokers

Streams (Kafka, Event Hubs, Kinesis, Pub/Sub)

Brokers (Service Bus, SQS/SNS, RabbitMQ)

When Streams Are the Right Default

The Partition Math Problem

Sequential Model

Batched Model

When to use which model

Sequential (per-event)

Batched (bulk processing)

Options when consumers can't keep up:

The Bad Consumer Problem

What Consumer D wants

What Consumer D should do

Where Systems Actually Break

The Real Tradeoffs

A Better Way to Think About It

Quick Reference

Internal Concurrency: The Hidden Broker

Use internal concurrency when:

Avoid it when:

You Don't Eliminate Complexity. You Relocate It.

The D0-D3 Taxonomy

Detailed Classification

Decision Tree

D3: Command

D2: Brokered + FIFO

Kafka: Partition-Key Routing

D1: Brokered Event

D0: Stream

Revisit Requirements

The Catalog Gate

Hybrid Patterns: Combining Streams and Brokers

Pattern A: Broker-First with Stream Mirroring

Pattern B: Stream-First with Broker Bridging

Tradeoffs by direction

What This Looks Like in Real Life

Payments Platform (D2/D3 Hybrid)

Master Data Management (D2)

Clickstream Analytics (D0)

Anti-Patterns This Prevents

Platform Limits to Know

Consumer Configuration Reference

Final Thought

Further Reading