Message replay and poison queue handling for Dynamics 365 integrations

How to design Dynamics 365 integrations that survive transient failures and handle poison messages — dead-letter queues, replay tooling, idempotency, and the operational rhythms.

Updated 2026-08-13

Every integration eventually fails — receiver down, payload malformed, downstream system busy. A robust Dynamics 365 integration architecture handles these failures gracefully: failures isolated, messages retried, poison messages quarantined for human review, and replay machinery for restoring sync. Without it, transient failures cascade into data divergence and operational scrambling.

Failure modes.

Transient — temporary network issue, brief downstream slowness, momentary rate limit. Retry succeeds.
Persistent — receiver down for hours, schema mismatch, authentication expired. Retry continues to fail.
Poison — message data malformed in a way that no retry will help.

Each requires a different response.

Retry strategies.

Fixed delay retry — fixed N seconds between attempts.
Exponential backoff — delay doubles each attempt.
Exponential with jitter — backoff with random variance to prevent thundering herd.
Bounded retry count — give up after N attempts.

Exponential with jitter is the modern default; bounded count prevents infinite loops.

Dead-letter queues (DLQ). When retries exhaust, the message goes to a dead-letter queue:

Azure Service Bus — dead-letter subqueue per queue/topic.
Azure Event Grid — dead-letter to a blob storage container.
Azure Storage Queue — separate "poison" queue.
Webhook — no dead-letter; message lost.

For critical integrations, never use webhooks alone — the lack of DLQ is operationally unacceptable.

DLQ inspection.

Service Bus Explorer — see messages in DLQ, inspect content.
Azure CLI / PowerShell — programmatic access.
Custom DLQ dashboard — Power BI report over DLQ metrics.

Without inspection, messages pile up unseen.

Resubmission from DLQ.

Manual — operator inspects, fixes issue, resubmits message to main queue.
Automated — DLQ processor with logic to retry messages whose conditions have cleared.
Bulk resubmit — after an outage, resubmit all DLQ messages.

The choice depends on volume and judgment requirements. For small volumes, manual is fine; for high volumes, automation needed with safeguards.

Idempotency. A foundational requirement. Receivers must handle the same message multiple times:

Idempotency key — unique per business event.
Deduplication at receiver — check if already processed; skip if so.
State-based logic — set state to X, don't increment counter twice.

Without idempotency, retries cause data corruption. Build idempotency into every receiver from day one.

Idempotency keys.

Business event ID — order ID, invoice number.
Source-system primary key — record ID from external system.
Composite key — combination of source ID and operation.

The key should be deterministic — same business event always produces same key, regardless of how many times the message is sent.

Message correlation. For complex flows:

Correlation ID — identifies a multi-message conversation.
Sequence number — orders messages within a conversation.
Saga ID — identifies a longer-running process.

When something goes wrong, correlation lets you reconstruct the conversation.

Outbox pattern. A specific reliability pattern:

Application writes to local DB transaction AND outbox table.
Separate process reads outbox and publishes to messaging.
Even if messaging is down, outbox holds the messages until published.

For Dynamics 365 plug-ins emitting external events, an outbox table in Dataverse + a worker reading it is a reliable pattern.

Inbox pattern. The receiver's mirror:

Receiver writes message to local inbox table on receipt.
Processing logic checks inbox before acting.
Idempotency via inbox deduplication.

Outbox + inbox = at-least-once delivery with deduplication = exactly-once business outcome.

Schema versioning. Messages have schemas that change:

Backward compatible — add fields; old consumers ignore them.
Forward compatible — handle missing fields gracefully.
Breaking change — version the message type.

Most integrations should aim for backward compatibility; breaking changes need coordinated deployment.

Poison message identification. Distinguishing poison from transient is hard:

Persistent failure across retries with same error — likely poison.
Same error across multiple messages — likely systemic issue, not poison.
Manual review — operator looks at the message, identifies the problem.

DLQ inspection workflow includes triage to categorise messages.

Replay tooling.

Replay from DLQ — resubmit individual messages.
Replay from event store — for Event Hub or storage-backed events, replay from a timestamp.
Source system replay — re-extract and re-emit from the source.

For long-running outages, replaying from source is sometimes the only practical recovery.

Monitoring and alerting.

DLQ depth — alert when DLQ has messages older than N hours.
Retry rate — high retry rate signals systemic issue.
End-to-end latency — track from source emission to destination processing.
Failed plug-in executions — Dataverse async job failures.

Without monitoring, integration health is invisible until something breaks user-visibly.

Common pitfalls.

No idempotency. Retries cause duplicates; data corruption.
No DLQ. Failed messages disappear; integration silently incomplete.
DLQ ignored. Messages pile up; no one reviews; eventual cleanup loses business data.
Schema drift. Sender updates schema; receivers break; cascade of failures.
No replay capability. Outage recovery requires manual data backfill.
Optimistic retry counts. Retries forever; resource leak.

Operational rhythm.

Daily — DLQ depth check.
Weekly — DLQ message triage; resubmit or discard.
Monthly — review monitoring alerts and patterns; improve resilience.
After incidents — postmortem; improvements deployed.

Strategic positioning. Integration reliability is engineering discipline, not a feature. The architectural patterns — at-least-once delivery, idempotency, DLQ, replay — apply across messaging platforms and integration scenarios. Get them right once; reuse across integrations. The teams that invest in reliable integration architecture have boring integration days; the teams that don't have constant fire drills. The investment is small relative to ongoing operational cost; the savings compound.

Related guides

← All guides Glossary →