Retry policies with Azure services for Dynamics 365 integrations
How to implement retry policies in Azure-based Dynamics 365 integrations — exponential backoff, idempotency, circuit-breaker integration, and the patterns that handle transient failures gracefully.
Network blips, rate limiting, downstream slowness — transient failures are inevitable in distributed systems. Retry policies absorb these failures by trying again with appropriate backoff. Implemented well, they make integrations resilient; implemented badly, they amplify failures into incidents.
Transient vs permanent failures.
- Transient — temporary; succeeds on retry. Network glitch, rate limit, brief downstream outage.
- Permanent — won't succeed however many retries. Schema mismatch, auth failure, business validation failure.
Retry only transients; permanent failures need different handling (alert, dead-letter, manual).
Failure classification. Decide per error type:
- HTTP 5xx — usually transient.
- HTTP 429 Too Many Requests — definitely transient; respect Retry-After.
- HTTP 4xx — usually permanent (validation, auth).
- HTTP 408 Request Timeout — transient.
- Connection errors — transient.
- Specific application errors — depends.
The classifier is critical; misclassifying causes infinite retry of permanent failures.
Backoff strategies.
- Fixed delay — wait N seconds, retry.
- Linear backoff — N, 2N, 3N seconds.
- Exponential backoff — N, 2N, 4N, 8N seconds.
- Exponential with jitter — exponential + random variance.
Exponential with jitter is modern default; prevents thundering herd.
Why jitter.
- Without jitter, all retries synchronise.
- Many clients failing simultaneously retry at the same moment.
- Spike overwhelms the recovering downstream.
Random jitter spreads retries; smoother recovery.
Max retries.
- Bounded — give up after N attempts.
- Unbounded — retry forever (rare; usually wrong).
Bounded count protects against permanent failures masquerading as transient.
Retry-After header. When server says how long to wait:
- HTTP response includes
Retry-After: 30(30 seconds) or absolute time. - Honour it; don't retry sooner.
- Common in 429 (rate limit) responses.
Ignoring Retry-After can extend the throttling.
Implementation libraries.
- Polly (.NET) — declarative retry policies.
- Resilience4j (Java) — similar.
- AxiosRetry (Node) — for axios HTTP client.
- Azure SDK — built-in retry on many operations.
Choose mature library; rolling your own retry is error-prone.
Polly example.
var retryPolicy = Policy
.Handle<HttpRequestException>()
.Or<TimeoutException>()
.WaitAndRetryAsync(
retryCount: 5,
sleepDurationProvider: attempt =>
TimeSpan.FromSeconds(Math.Pow(2, attempt)) +
TimeSpan.FromMilliseconds(jitterer.Next(0, 1000))
);
await retryPolicy.ExecuteAsync(async () =>
await client.PostAsync(url, content));
Five retries with exponential backoff and jitter; handles common transient cases.
Idempotency. Critical for retries:
- Retry might succeed after thinking it failed.
- Duplicate processing on receiving side.
- Idempotency means same input → same outcome, regardless of count.
Patterns:
- Idempotency-Key header — receiver dedupes by key.
- Natural keys — operation tied to unique business identifier.
- State-based — set state idempotently.
Without idempotency, retries cause data corruption.
Retry + circuit breaker. Combined patterns:
- Retry handles individual call failures.
- Circuit breaker handles sustained failure patterns.
Wrap retry inside circuit breaker:
- Retries happen within circuit-closed state.
- When too many failures total, circuit opens; retries stop.
- Periodic test to re-open if downstream recovered.
Covered in [[circuit-breakers-in-dynamics-365-integrations]].
Per-service retry policies.
- Dataverse — handle 429s, 5xx; respect Retry-After.
- Service Bus — built into SDK; configurable.
- Azure Storage — built-in retries; can override.
- External APIs — per-API characteristics.
Each service may need tuned policy.
Retry budget. Limit total retry effort:
- Per-request retry count (5 attempts).
- Per-time-window total retries (1000 retries per minute across all requests).
Without budget, retry storms exhaust resources.
Retry observability.
- Retry count metric — how often retries happen.
- Success-after-retry rate — what % succeed.
- Failures after exhausted retries — what truly fails.
These metrics reveal integration health beyond just success/failure.
Distributed retry.
- Multiple consumers each retry.
- Aggregate retry traffic significant.
- Coordination via shared rate limiter helpful.
For high-scale, single-client retry isn't enough.
Dead-letter on exhaustion. When all retries fail:
- Message → dead-letter queue.
- Alert raised.
- Operator inspects.
- Manual resubmission or remediation.
Closes the loop on retry-failed messages.
Common pitfalls.
- No retry. First transient failure becomes user-visible.
- Naive retry. No backoff; tight retry loop hammers downstream.
- Retry permanent failures. Forever; resource exhaustion.
- Idempotency forgotten. Retries cause duplicates.
- Throttling ignored. Retries faster than allowed; permanently throttled.
- No timeout on individual attempts. Stuck retry; resource leak.
- Retry on wrong layer. Application retries when SDK already retried; effective retry count multiplies.
Best practices.
- Use library — Polly, Resilience4j, etc.
- Classify errors — transient vs permanent.
- Exponential + jitter.
- Bounded count.
- Timeout per attempt.
- Observable metrics.
- Dead-letter on exhaustion.
- Idempotent receivers.
Strategic positioning. Retry policies are foundational reliability infrastructure. Mature integrations have considered retry policies per integration; immature ones have ad-hoc or no retries. The investment is modest — using libraries, classifying errors, building monitoring. The benefit is integrations that survive transient issues without operational intervention.
For architects:
- Default to library-based retry.
- Document policy per integration.
- Monitor retry metrics.
- Combine with circuit breaker, dead-letter, idempotency.
Production-grade integration architecture has all these patterns working together. Skipping any one weakens the whole. Invest in the foundation; benefit through every integration thereafter.
Related guides
- Circuit breakers in Dynamics 365 integrationsHow the circuit-breaker pattern protects Dynamics 365 integrations from cascading failures — implementation in Azure Functions, Logic Apps, and Dataverse plug-ins, with operational tuning.
- Integration testing patterns for Dynamics 365How to test integrations for Dynamics 365 — unit, contract, integration, end-to-end tests, and the patterns that catch regressions before production.
- Observability for Dynamics 365 integrationsHow to build observability into Dynamics 365 integrations — logs, metrics, traces, correlation IDs, and the patterns that make production integration health visible.
- OpenTelemetry with Dynamics 365 integrationsHow OpenTelemetry standardises observability in Dynamics 365 architectures — instrumentation, exporters, distributed tracing, and the path to vendor-neutral observability.
- Integrating Dynamics 365 with Azure servicesHow Dynamics 365 plugs into Azure — Service Bus, Logic Apps, Functions, API Management, Event Grid, and the iPaaS patterns that actually work.