Message replay and poison queue handling for Dynamics 365 integrations

How to design Dynamics 365 integrations that survive transient failures and handle poison messages — dead-letter queues, replay tooling, idempotency, and the operational rhythms.

Updated 2026-08-13

Every integration eventually fails — receiver down, payload malformed, downstream system busy. A robust Dynamics 365 integration architecture handles these failures gracefully: failures isolated, messages retried, poison messages quarantined for human review, and replay machinery for restoring sync. Without it, transient failures cascade into data divergence and operational scrambling.

Failure modes.

  • Transient — temporary network issue, brief downstream slowness, momentary rate limit. Retry succeeds.
  • Persistent — receiver down for hours, schema mismatch, authentication expired. Retry continues to fail.
  • Poison — message data malformed in a way that no retry will help.

Each requires a different response.

Retry strategies.

  • Fixed delay retry — fixed N seconds between attempts.
  • Exponential backoff — delay doubles each attempt.
  • Exponential with jitter — backoff with random variance to prevent thundering herd.
  • Bounded retry count — give up after N attempts.

Exponential with jitter is the modern default; bounded count prevents infinite loops.

Dead-letter queues (DLQ). When retries exhaust, the message goes to a dead-letter queue:

  • Azure Service Bus — dead-letter subqueue per queue/topic.
  • Azure Event Grid — dead-letter to a blob storage container.
  • Azure Storage Queue — separate "poison" queue.
  • Webhook — no dead-letter; message lost.

For critical integrations, never use webhooks alone — the lack of DLQ is operationally unacceptable.

DLQ inspection.

  • Service Bus Explorer — see messages in DLQ, inspect content.
  • Azure CLI / PowerShell — programmatic access.
  • Custom DLQ dashboardPower BI report over DLQ metrics.

Without inspection, messages pile up unseen.

Resubmission from DLQ.

  • Manual — operator inspects, fixes issue, resubmits message to main queue.
  • Automated — DLQ processor with logic to retry messages whose conditions have cleared.
  • Bulk resubmit — after an outage, resubmit all DLQ messages.

The choice depends on volume and judgment requirements. For small volumes, manual is fine; for high volumes, automation needed with safeguards.

Idempotency. A foundational requirement. Receivers must handle the same message multiple times:

  • Idempotency key — unique per business event.
  • Deduplication at receiver — check if already processed; skip if so.
  • State-based logic — set state to X, don't increment counter twice.

Without idempotency, retries cause data corruption. Build idempotency into every receiver from day one.

Idempotency keys.

  • Business event ID — order ID, invoice number.
  • Source-system primary key — record ID from external system.
  • Composite key — combination of source ID and operation.

The key should be deterministic — same business event always produces same key, regardless of how many times the message is sent.

Message correlation. For complex flows:

  • Correlation ID — identifies a multi-message conversation.
  • Sequence number — orders messages within a conversation.
  • Saga ID — identifies a longer-running process.

When something goes wrong, correlation lets you reconstruct the conversation.

Outbox pattern. A specific reliability pattern:

  • Application writes to local DB transaction AND outbox table.
  • Separate process reads outbox and publishes to messaging.
  • Even if messaging is down, outbox holds the messages until published.

For Dynamics 365 plug-ins emitting external events, an outbox table in Dataverse + a worker reading it is a reliable pattern.

Inbox pattern. The receiver's mirror:

  • Receiver writes message to local inbox table on receipt.
  • Processing logic checks inbox before acting.
  • Idempotency via inbox deduplication.

Outbox + inbox = at-least-once delivery with deduplication = exactly-once business outcome.

Schema versioning. Messages have schemas that change:

  • Backward compatible — add fields; old consumers ignore them.
  • Forward compatible — handle missing fields gracefully.
  • Breaking change — version the message type.

Most integrations should aim for backward compatibility; breaking changes need coordinated deployment.

Poison message identification. Distinguishing poison from transient is hard:

  • Persistent failure across retries with same error — likely poison.
  • Same error across multiple messages — likely systemic issue, not poison.
  • Manual review — operator looks at the message, identifies the problem.

DLQ inspection workflow includes triage to categorise messages.

Replay tooling.

  • Replay from DLQ — resubmit individual messages.
  • Replay from event store — for Event Hub or storage-backed events, replay from a timestamp.
  • Source system replay — re-extract and re-emit from the source.

For long-running outages, replaying from source is sometimes the only practical recovery.

Monitoring and alerting.

  • DLQ depth — alert when DLQ has messages older than N hours.
  • Retry rate — high retry rate signals systemic issue.
  • End-to-end latency — track from source emission to destination processing.
  • Failed plug-in executions — Dataverse async job failures.

Without monitoring, integration health is invisible until something breaks user-visibly.

Common pitfalls.

  • No idempotency. Retries cause duplicates; data corruption.
  • No DLQ. Failed messages disappear; integration silently incomplete.
  • DLQ ignored. Messages pile up; no one reviews; eventual cleanup loses business data.
  • Schema drift. Sender updates schema; receivers break; cascade of failures.
  • No replay capability. Outage recovery requires manual data backfill.
  • Optimistic retry counts. Retries forever; resource leak.

Operational rhythm.

  • Daily — DLQ depth check.
  • Weekly — DLQ message triage; resubmit or discard.
  • Monthly — review monitoring alerts and patterns; improve resilience.
  • After incidents — postmortem; improvements deployed.

Strategic positioning. Integration reliability is engineering discipline, not a feature. The architectural patterns — at-least-once delivery, idempotency, DLQ, replay — apply across messaging platforms and integration scenarios. Get them right once; reuse across integrations. The teams that invest in reliable integration architecture have boring integration days; the teams that don't have constant fire drills. The investment is small relative to ongoing operational cost; the savings compound.

Related guides