Retry policies with Azure services for Dynamics 365 integrations

How to implement retry policies in Azure-based Dynamics 365 integrations — exponential backoff, idempotency, circuit-breaker integration, and the patterns that handle transient failures gracefully.

Updated 2026-11-12

Network blips, rate limiting, downstream slowness — transient failures are inevitable in distributed systems. Retry policies absorb these failures by trying again with appropriate backoff. Implemented well, they make integrations resilient; implemented badly, they amplify failures into incidents.

Transient vs permanent failures.

  • Transient — temporary; succeeds on retry. Network glitch, rate limit, brief downstream outage.
  • Permanent — won't succeed however many retries. Schema mismatch, auth failure, business validation failure.

Retry only transients; permanent failures need different handling (alert, dead-letter, manual).

Failure classification. Decide per error type:

  • HTTP 5xx — usually transient.
  • HTTP 429 Too Many Requests — definitely transient; respect Retry-After.
  • HTTP 4xx — usually permanent (validation, auth).
  • HTTP 408 Request Timeout — transient.
  • Connection errors — transient.
  • Specific application errors — depends.

The classifier is critical; misclassifying causes infinite retry of permanent failures.

Backoff strategies.

  • Fixed delay — wait N seconds, retry.
  • Linear backoff — N, 2N, 3N seconds.
  • Exponential backoff — N, 2N, 4N, 8N seconds.
  • Exponential with jitter — exponential + random variance.

Exponential with jitter is modern default; prevents thundering herd.

Why jitter.

  • Without jitter, all retries synchronise.
  • Many clients failing simultaneously retry at the same moment.
  • Spike overwhelms the recovering downstream.

Random jitter spreads retries; smoother recovery.

Max retries.

  • Bounded — give up after N attempts.
  • Unbounded — retry forever (rare; usually wrong).

Bounded count protects against permanent failures masquerading as transient.

Retry-After header. When server says how long to wait:

  • HTTP response includes Retry-After: 30 (30 seconds) or absolute time.
  • Honour it; don't retry sooner.
  • Common in 429 (rate limit) responses.

Ignoring Retry-After can extend the throttling.

Implementation libraries.

  • Polly (.NET) — declarative retry policies.
  • Resilience4j (Java) — similar.
  • AxiosRetry (Node) — for axios HTTP client.
  • Azure SDK — built-in retry on many operations.

Choose mature library; rolling your own retry is error-prone.

Polly example.

var retryPolicy = Policy
    .Handle<HttpRequestException>()
    .Or<TimeoutException>()
    .WaitAndRetryAsync(
        retryCount: 5,
        sleepDurationProvider: attempt => 
            TimeSpan.FromSeconds(Math.Pow(2, attempt)) +
            TimeSpan.FromMilliseconds(jitterer.Next(0, 1000))
    );

await retryPolicy.ExecuteAsync(async () => 
    await client.PostAsync(url, content));

Five retries with exponential backoff and jitter; handles common transient cases.

Idempotency. Critical for retries:

  • Retry might succeed after thinking it failed.
  • Duplicate processing on receiving side.
  • Idempotency means same input → same outcome, regardless of count.

Patterns:

  • Idempotency-Key header — receiver dedupes by key.
  • Natural keys — operation tied to unique business identifier.
  • State-based — set state idempotently.

Without idempotency, retries cause data corruption.

Retry + circuit breaker. Combined patterns:

  • Retry handles individual call failures.
  • Circuit breaker handles sustained failure patterns.

Wrap retry inside circuit breaker:

  • Retries happen within circuit-closed state.
  • When too many failures total, circuit opens; retries stop.
  • Periodic test to re-open if downstream recovered.

Covered in [[circuit-breakers-in-dynamics-365-integrations]].

Per-service retry policies.

  • Dataverse — handle 429s, 5xx; respect Retry-After.
  • Service Bus — built into SDK; configurable.
  • Azure Storage — built-in retries; can override.
  • External APIs — per-API characteristics.

Each service may need tuned policy.

Retry budget. Limit total retry effort:

  • Per-request retry count (5 attempts).
  • Per-time-window total retries (1000 retries per minute across all requests).

Without budget, retry storms exhaust resources.

Retry observability.

  • Retry count metric — how often retries happen.
  • Success-after-retry rate — what % succeed.
  • Failures after exhausted retries — what truly fails.

These metrics reveal integration health beyond just success/failure.

Distributed retry.

  • Multiple consumers each retry.
  • Aggregate retry traffic significant.
  • Coordination via shared rate limiter helpful.

For high-scale, single-client retry isn't enough.

Dead-letter on exhaustion. When all retries fail:

  • Message → dead-letter queue.
  • Alert raised.
  • Operator inspects.
  • Manual resubmission or remediation.

Closes the loop on retry-failed messages.

Common pitfalls.

  • No retry. First transient failure becomes user-visible.
  • Naive retry. No backoff; tight retry loop hammers downstream.
  • Retry permanent failures. Forever; resource exhaustion.
  • Idempotency forgotten. Retries cause duplicates.
  • Throttling ignored. Retries faster than allowed; permanently throttled.
  • No timeout on individual attempts. Stuck retry; resource leak.
  • Retry on wrong layer. Application retries when SDK already retried; effective retry count multiplies.

Best practices.

  • Use library — Polly, Resilience4j, etc.
  • Classify errors — transient vs permanent.
  • Exponential + jitter.
  • Bounded count.
  • Timeout per attempt.
  • Observable metrics.
  • Dead-letter on exhaustion.
  • Idempotent receivers.

Strategic positioning. Retry policies are foundational reliability infrastructure. Mature integrations have considered retry policies per integration; immature ones have ad-hoc or no retries. The investment is modest — using libraries, classifying errors, building monitoring. The benefit is integrations that survive transient issues without operational intervention.

For architects:

  • Default to library-based retry.
  • Document policy per integration.
  • Monitor retry metrics.
  • Combine with circuit breaker, dead-letter, idempotency.

Production-grade integration architecture has all these patterns working together. Skipping any one weakens the whole. Invest in the foundation; benefit through every integration thereafter.

Related guides