Skip to main content
All CollectionsSupport
API Backoff & Retry
API Backoff & Retry

Dealing With API limits, disruption, and outages

Openbridge Support avatar
Written by Openbridge Support
Updated over 4 months ago

When data is sourced from external or third-party APIs, requests inherently become vulnerable to API delays and issues. If these APIs experience prolonged disruptions, it can lead to extended delays in our data processing, even despite our retry efforts. For example, a daily sales report you expected by 8 AM may be delayed due to upstream API issues; these delays can result in postponements, sometimes by hours or even entire days.

Openbridge employs various request and retry strategies to minimize disruptions to data operations due to upstream API issues.

Dealing with API limits, disruption, and outages

The ability to request data depends on the rules and limits set forth by the upstream API. While there is an assumption that APIs work without error or limits, that does not reflect reality. The reality is APIs limit how frequently or at what scope data can be requested. APIs can also become temporarily unavailable, have bugs, mangle outputs, or supply undocumented outputs.

Here are common types of issues that cause requests to fail;

  1. Transient Failures. Often, network hiccups, temporary server overloads, or short-lived service interruptions can cause an API request to fail.

  2. Rate Limiting: Many APIs impose rate limits to prevent any single consumer from overloading the system. If you make requests too quickly, you might hit these limits and receive an error response.

  3. Resource Contention: In scenarios where multiple systems or users try to access a shared resource, contention can cause temporary failures.

While these are some common causes, they are not the only ones.

Implementing A Retry Strategy

If a request experiences an issue, Openbridge will use a retry/backoff/queue system to handle the abovementioned scenarios. For example, backoff and retry systems are implemented so that if a report creation request to the Reporting API fails due to capacity limits, it's added to a queue for retry.

This system is implemented to minimize a "hard" fail for requests—meaning a report gets generated on a retry as capacity permits or systems become available in case of a service disruption.

Backoff & Retry Strategies Used By Openbridge

The following are strategies employed by Openbridge;

  1. Exponential Backoff: This is a specific kind of backoff strategy where the time between retries increases exponentially. This approach ensures that if the API is experiencing issues, it isn't overwhelmed with a flood of retries quickly. Instead, retries are spread out, giving the system a better chance to recover.

  2. Intelligent Retries: Not all errors are worth retrying. For example, a 404 Not Found error might not be resolvable with a retry, whereas a 503 Service Unavailable error might be. Therefore, the retry strategy attempts to adjust based on the responses from the upstream API. For more insight into various types of errors and how to interpret the logs we share with you, please see How To Use Healthchecks.

  3. Max Retries: It's essential to limit the number of retries to prevent endless loops in cases where the error is permanent. After reaching the maximum number of retries, customer intervention may be required, given the persistent nature of the failure. (See "Limits To Retry Queues" below)

Request Jitter

In addition to retries, Openbridge also employs a scheduling jitter to requests.

  1. Scheduling Jitter: Adding randomness (jitter) to the request intervals can prevent many requests from retrying simultaneously, which could create a "thundering herd" problem. For instance, if an API has a limit of 60 requests an hour and many requests start their retries at the same intervals, they might all hit the API simultaneously. Jitter helps spread out these retries to reduce peak load. The jitter concept applies to all requests, not just retry requests.

Limits To Retry Queues

Requests get added to the retry queue if the API frequently exceeds capacity due to high demand or is limited due to service degradation or outages. When retrying these requests every hour, a percentage of requests from the queue will be successfully retried.

However, since an API can have a fixed limit of requests per hour, extended capacity issues mean the retry requests count increases, which means a retry queue can grow faster than the API will allow a retry (or any) request to be processed. This can lead to hundreds of outstanding requests in a retry queue waiting for capacity.

API Request Backlogs: Thundering Herd Problem

The "thundering herd" problem refers to a volume of requests that exceed and overwhelm an API. For instance, an Accounting API has a static rate limit of 10 requests per minute per account. There are 5 applications, each submitting 10 requests in the same period. The API will be instantly swamped, causing many requests to fail.

If all these applications try to retry their requests at the same interval, the next minute will be 90 requests waiting (50 from the current minute, 40 from the prior minute that failed). Then, in the next minute, there are 130 requests (50 from the current minute and 80 from the prior minute that failed).

This surge will repeat, and the exponential growth of the API request backlog will lead to a cycle of consistent overloads.

To address the issue of unbounded queue growth, items in queues must expire. Typically, if a request has been retried several times without success, it is flagged and removed from the queue. Without this type of MAX retry limit, a system will keep trying to process a problematic request indefinitely.

This can lead to an infinite loop, where a retry is "spamming" the upstream API repeatedly with failing requests. This is an issue of spamming an API and can lead to an upstream API terminating API access, suspending accounts, or further restricting API capacity.


Customer Intervention: Data Request Tasks

While retry mechanisms are essential to ensure robustness and reliability, there are times when a request has expired due to multiple failed retry attempts. In these cases, users can create data tasks to execute a retry attempt manually.

Monitoring

To identify target dates for a data request task, you can check your destination for dates with gaps:

References

For more context on how to identify when your account may be throttled:

Did this answer your question?