Retry policies with Azure services for Dynamics 365 integrations

How to implement retry policies in Azure-based Dynamics 365 integrations — exponential backoff, idempotency, circuit-breaker integration, and the patterns that handle transient failures gracefully.

Updated 2026-11-12

Network blips, rate limiting, downstream slowness — transient failures are inevitable in distributed systems. Retry policies absorb these failures by trying again with appropriate backoff. Implemented well, they make integrations resilient; implemented badly, they amplify failures into incidents.

Transient vs permanent failures.

Transient — temporary; succeeds on retry. Network glitch, rate limit, brief downstream outage.
Permanent — won't succeed however many retries. Schema mismatch, auth failure, business validation failure.

Retry only transients; permanent failures need different handling (alert, dead-letter, manual).

Failure classification. Decide per error type:

HTTP 5xx — usually transient.
HTTP 429 Too Many Requests — definitely transient; respect Retry-After.
HTTP 4xx — usually permanent (validation, auth).
HTTP 408 Request Timeout — transient.
Connection errors — transient.
Specific application errors — depends.

The classifier is critical; misclassifying causes infinite retry of permanent failures.

Backoff strategies.

Fixed delay — wait N seconds, retry.
Linear backoff — N, 2N, 3N seconds.
Exponential backoff — N, 2N, 4N, 8N seconds.
Exponential with jitter — exponential + random variance.

Exponential with jitter is modern default; prevents thundering herd.

Why jitter.

Without jitter, all retries synchronise.
Many clients failing simultaneously retry at the same moment.
Spike overwhelms the recovering downstream.

Random jitter spreads retries; smoother recovery.

Max retries.

Bounded — give up after N attempts.
Unbounded — retry forever (rare; usually wrong).

Bounded count protects against permanent failures masquerading as transient.

Retry-After header. When server says how long to wait:

HTTP response includes Retry-After: 30 (30 seconds) or absolute time.
Honour it; don't retry sooner.
Common in 429 (rate limit) responses.

Ignoring Retry-After can extend the throttling.

Implementation libraries.

Polly (.NET) — declarative retry policies.
Resilience4j (Java) — similar.
AxiosRetry (Node) — for axios HTTP client.
Azure SDK — built-in retry on many operations.

Choose mature library; rolling your own retry is error-prone.

Polly example.

var retryPolicy = Policy
    .Handle<HttpRequestException>()
    .Or<TimeoutException>()
    .WaitAndRetryAsync(
        retryCount: 5,
        sleepDurationProvider: attempt => 
            TimeSpan.FromSeconds(Math.Pow(2, attempt)) +
            TimeSpan.FromMilliseconds(jitterer.Next(0, 1000))
    );

await retryPolicy.ExecuteAsync(async () => 
    await client.PostAsync(url, content));

Five retries with exponential backoff and jitter; handles common transient cases.

Idempotency. Critical for retries:

Retry might succeed after thinking it failed.
Duplicate processing on receiving side.
Idempotency means same input → same outcome, regardless of count.

Patterns:

Idempotency-Key header — receiver dedupes by key.
Natural keys — operation tied to unique business identifier.
State-based — set state idempotently.

Without idempotency, retries cause data corruption.

Retry + circuit breaker. Combined patterns:

Retry handles individual call failures.
Circuit breaker handles sustained failure patterns.

Wrap retry inside circuit breaker:

Retries happen within circuit-closed state.
When too many failures total, circuit opens; retries stop.
Periodic test to re-open if downstream recovered.

Covered in [[circuit-breakers-in-dynamics-365-integrations]].

Per-service retry policies.

Dataverse — handle 429s, 5xx; respect Retry-After.
Service Bus — built into SDK; configurable.
Azure Storage — built-in retries; can override.
External APIs — per-API characteristics.

Each service may need tuned policy.

Retry budget. Limit total retry effort:

Per-request retry count (5 attempts).
Per-time-window total retries (1000 retries per minute across all requests).

Without budget, retry storms exhaust resources.

Retry observability.

Retry count metric — how often retries happen.
Success-after-retry rate — what % succeed.
Failures after exhausted retries — what truly fails.

These metrics reveal integration health beyond just success/failure.

Distributed retry.

Multiple consumers each retry.
Aggregate retry traffic significant.
Coordination via shared rate limiter helpful.

For high-scale, single-client retry isn't enough.

Dead-letter on exhaustion. When all retries fail:

Message → dead-letter queue.
Alert raised.
Operator inspects.
Manual resubmission or remediation.

Closes the loop on retry-failed messages.

Common pitfalls.

No retry. First transient failure becomes user-visible.
Naive retry. No backoff; tight retry loop hammers downstream.
Retry permanent failures. Forever; resource exhaustion.
Idempotency forgotten. Retries cause duplicates.
Throttling ignored. Retries faster than allowed; permanently throttled.
No timeout on individual attempts. Stuck retry; resource leak.
Retry on wrong layer. Application retries when SDK already retried; effective retry count multiplies.

Best practices.

Use library — Polly, Resilience4j, etc.
Classify errors — transient vs permanent.
Exponential + jitter.
Bounded count.
Timeout per attempt.
Observable metrics.
Dead-letter on exhaustion.
Idempotent receivers.

Strategic positioning. Retry policies are foundational reliability infrastructure. Mature integrations have considered retry policies per integration; immature ones have ad-hoc or no retries. The investment is modest — using libraries, classifying errors, building monitoring. The benefit is integrations that survive transient issues without operational intervention.

For architects:

Default to library-based retry.
Document policy per integration.
Monitor retry metrics.
Combine with circuit breaker, dead-letter, idempotency.

Production-grade integration architecture has all these patterns working together. Skipping any one weakens the whole. Invest in the foundation; benefit through every integration thereafter.

Related guides

← All guides Glossary →