Message replay and poison queue handling for Dynamics 365 integrations
How to design Dynamics 365 integrations that survive transient failures and handle poison messages — dead-letter queues, replay tooling, idempotency, and the operational rhythms.
Every integration eventually fails — receiver down, payload malformed, downstream system busy. A robust Dynamics 365 integration architecture handles these failures gracefully: failures isolated, messages retried, poison messages quarantined for human review, and replay machinery for restoring sync. Without it, transient failures cascade into data divergence and operational scrambling.
Failure modes.
- Transient — temporary network issue, brief downstream slowness, momentary rate limit. Retry succeeds.
- Persistent — receiver down for hours, schema mismatch, authentication expired. Retry continues to fail.
- Poison — message data malformed in a way that no retry will help.
Each requires a different response.
Retry strategies.
- Fixed delay retry — fixed N seconds between attempts.
- Exponential backoff — delay doubles each attempt.
- Exponential with jitter — backoff with random variance to prevent thundering herd.
- Bounded retry count — give up after N attempts.
Exponential with jitter is the modern default; bounded count prevents infinite loops.
Dead-letter queues (DLQ). When retries exhaust, the message goes to a dead-letter queue:
- Azure Service Bus — dead-letter subqueue per queue/topic.
- Azure Event Grid — dead-letter to a blob storage container.
- Azure Storage Queue — separate "poison" queue.
- Webhook — no dead-letter; message lost.
For critical integrations, never use webhooks alone — the lack of DLQ is operationally unacceptable.
DLQ inspection.
- Service Bus Explorer — see messages in DLQ, inspect content.
- Azure CLI / PowerShell — programmatic access.
- Custom DLQ dashboard — Power BI report over DLQ metrics.
Without inspection, messages pile up unseen.
Resubmission from DLQ.
- Manual — operator inspects, fixes issue, resubmits message to main queue.
- Automated — DLQ processor with logic to retry messages whose conditions have cleared.
- Bulk resubmit — after an outage, resubmit all DLQ messages.
The choice depends on volume and judgment requirements. For small volumes, manual is fine; for high volumes, automation needed with safeguards.
Idempotency. A foundational requirement. Receivers must handle the same message multiple times:
- Idempotency key — unique per business event.
- Deduplication at receiver — check if already processed; skip if so.
- State-based logic — set state to X, don't increment counter twice.
Without idempotency, retries cause data corruption. Build idempotency into every receiver from day one.
Idempotency keys.
- Business event ID — order ID, invoice number.
- Source-system primary key — record ID from external system.
- Composite key — combination of source ID and operation.
The key should be deterministic — same business event always produces same key, regardless of how many times the message is sent.
Message correlation. For complex flows:
- Correlation ID — identifies a multi-message conversation.
- Sequence number — orders messages within a conversation.
- Saga ID — identifies a longer-running process.
When something goes wrong, correlation lets you reconstruct the conversation.
Outbox pattern. A specific reliability pattern:
- Application writes to local DB transaction AND outbox table.
- Separate process reads outbox and publishes to messaging.
- Even if messaging is down, outbox holds the messages until published.
For Dynamics 365 plug-ins emitting external events, an outbox table in Dataverse + a worker reading it is a reliable pattern.
Inbox pattern. The receiver's mirror:
- Receiver writes message to local inbox table on receipt.
- Processing logic checks inbox before acting.
- Idempotency via inbox deduplication.
Outbox + inbox = at-least-once delivery with deduplication = exactly-once business outcome.
Schema versioning. Messages have schemas that change:
- Backward compatible — add fields; old consumers ignore them.
- Forward compatible — handle missing fields gracefully.
- Breaking change — version the message type.
Most integrations should aim for backward compatibility; breaking changes need coordinated deployment.
Poison message identification. Distinguishing poison from transient is hard:
- Persistent failure across retries with same error — likely poison.
- Same error across multiple messages — likely systemic issue, not poison.
- Manual review — operator looks at the message, identifies the problem.
DLQ inspection workflow includes triage to categorise messages.
Replay tooling.
- Replay from DLQ — resubmit individual messages.
- Replay from event store — for Event Hub or storage-backed events, replay from a timestamp.
- Source system replay — re-extract and re-emit from the source.
For long-running outages, replaying from source is sometimes the only practical recovery.
Monitoring and alerting.
- DLQ depth — alert when DLQ has messages older than N hours.
- Retry rate — high retry rate signals systemic issue.
- End-to-end latency — track from source emission to destination processing.
- Failed plug-in executions — Dataverse async job failures.
Without monitoring, integration health is invisible until something breaks user-visibly.
Common pitfalls.
- No idempotency. Retries cause duplicates; data corruption.
- No DLQ. Failed messages disappear; integration silently incomplete.
- DLQ ignored. Messages pile up; no one reviews; eventual cleanup loses business data.
- Schema drift. Sender updates schema; receivers break; cascade of failures.
- No replay capability. Outage recovery requires manual data backfill.
- Optimistic retry counts. Retries forever; resource leak.
Operational rhythm.
- Daily — DLQ depth check.
- Weekly — DLQ message triage; resubmit or discard.
- Monthly — review monitoring alerts and patterns; improve resilience.
- After incidents — postmortem; improvements deployed.
Strategic positioning. Integration reliability is engineering discipline, not a feature. The architectural patterns — at-least-once delivery, idempotency, DLQ, replay — apply across messaging platforms and integration scenarios. Get them right once; reuse across integrations. The teams that invest in reliable integration architecture have boring integration days; the teams that don't have constant fire drills. The investment is small relative to ongoing operational cost; the savings compound.
Related guides
- Change tracking and delta queries in DataverseHow Dataverse change tracking enables efficient incremental sync — enabling per table, using delta tokens, and integration patterns.
- Event Grid with DataverseHow Azure Event Grid integrates with Dataverse — schema, subscribers, when to use Event Grid vs Service Bus, and operational patterns.
- MQTT and IoT integration with Dynamics 365How MQTT protocol fits IoT scenarios for Dynamics 365 — Azure IoT Hub, message routing, and the patterns for getting device telemetry into Dynamics workflows.
- Webhooks in Business CentralHow webhook subscriptions work in Business Central — subscribing, renewing, payloads, and the realities of consuming change notifications.
- Webhooks vs events in DataverseHow Dataverse exposes change notifications — webhooks, service endpoints to Azure Service Bus / Event Hub / Event Grid, and the trade-offs in reliability and scale.