Data pipelines rely on Application Programming Interfaces (APIs) to operate. Unfortunately, even for technical engineers, APIs can be a mysterious, obtuse, and confusing. On the non-technical side, people assume that because a company provides an "API," it represents an unlimited, instantaneous, and perfectly defined source of information, always working without incident or issue.

It is easy to understand why; vendors promote APIs as a panacea for unlocking data silos. However, the small print that contains all the nuance, caveats, exclusions, constraints, limits, and omissions for APIs they offer is often not promoted.

Note: While the document focuses primarily on data sources, destinations like Snowflake, Redshift, BigQuery, Azure, and others also have APIs. Compared to data source APIs, destination APIs tend to be stable and slow-changing.

All APIs Have Limits

While you may want "live", real-time data, the source system defines when data is available and the frequency it can be requested.

For example, Amazon states that you can request a report once a day or that they only make data available on a daily basis. Another example is how frequently you can call an API. Amazon allows 0.0222 requests per second to ask for information from the Reporting API. This means you have approximately 1.32 requests that can be made per minute or about 80 per hour. As such, there are no "live", instantaneous, and continuous data feeds for this data source as it is not designed for that use case.

Amazon Data Source Example

Let's use the Amazon retail Selling Partner API (SP-API) Order tracking report as an example. A typical workflow for the API would look like this:

First Request: An API request is made to the API asking for a report generated according to the request parameters. Let's assume there are no errors, and Amazon acknowledges the request.
Second Request: Reports are not generated instantaneously by Amazon. After the first request, a second request is made 60 seconds later, asking Amazon if the previously requested Report is ready. Again, assuming no errors or Amazon saying the "report is not ready," the data is returned.

The example above is the "sunny day" scenario (i.e., no errors or delays); 2 API requests are needed to generate the Order tracking report. If there was a delay in rendering a report, it might take Amazon 10 minutes to create the Report, a third, fourth, or fifth request to poll if Amazon finally generated the information. In this case, one report request can consume the available API capacity for a processing window.

What would happen if you attempt a parallel API request for the All Listings Report, Referral Fee Preview Report, Returns Report, and Amazon Fulfilled Shipments Report? Another four reports would be anywhere from 8 to 20+ API additional requests. This means running all five reports in parallel would require 10 to 25+ API requests. Requesting 5 reports could consume 30-50% of available API capacity.

The practical, real-world performance expectation, in this case, would be that all five reports may be completed in about 30 minutes, assuming there are no processing delays with the Amazon response time.

In addition to API limits (how many requests can occur per X period), further limits are based on the resource being requested. For example, Amazon offers a "Remote Fulfillment Eligibility" report, and they state this report "can take up to 24 hours to generate". If this Report were requested in the same manner as Order tracking, there would be 1000s of pointless API calls occurring every day at a frequency Amazon expressly states not to do for this report type.

The key takeaway is that each API defines well-bounded limits, cadences, frequencies, and timings for data availability and delivery. While you may want something in "live", real-time" or updated "every 5 minutes", many APIs do not support that mode of operation.

APIs Decide What Data Is Available

As we stated in the "Remote Fulfillment Eligibility" report example, just because you can call an API does not mean that an API will return "new" or "updated" data.

For example, keeping with our SP-API use case, Amazon's Fee Preview report states that data is updated at least once every 72 hours. If you are calling for this Report every 5 minutes, you would have Amazon block requests or return the same data repeatedly (or not at all).

In another SP-API example, Amazon states the Referral Fee Discounts Report "may be up to 24 hours old," and you should not "request this report more than once within 24 hours."

So while you can technically make an API call, the underlying data availability makes it pointless to do so. You are burning up your Reporting API limits making API calls that will never return "new" or "updated" data, or worse, have your requests flagged as violating terms of service.

The critical takeaway is API calls must be made when the data is known to be "settled." Settled data demonstrates a collection of trusted facts about the information being supplied by a source. Only the source defines what is settled as they are system owners and are responsible for provisioning the data.

In essence, data is settled when the source says it is ready to be requested and used. For example, Amazon settles "Referral Fee Discounts" every 24 hours. They are saying, "Hey, you can rely on this information, but it takes us a while to reconcile it internally every 24 hours. Only request it once a day."

While you may want Referral Fee Discounts in "real-time" or updated every hour, the availability of the data does not align with that use case. API calls must be closely aligned with data availability.

APIs Abuse (Intentional or Unintentional) Has Consequences

API providers set expectations on use and code of conduct, covering previously described limits, throttles, and request cadence. Openbridge always operates with published and observed boundaries for an API. However, if you have a tool, system, or vendor who is over-aggressive, misuses, or attempts to bypass the guardrails set by the provider, the consequence can be degraded service or service termination for use of the API by all parties, including Openbridge.

For example, Google will block the ability to make API requests if it determines there is over-aggressive usage:

Quota Error: The number of recent reporting API requests failing by server error is too high. You are temporarily blocked from the reporting API for at least an hour. Please send fewer server errors in the future to avoid being blocked.

Google will prohibit any future calls for at least an hour. Google is not unique, as many API providers will employ similar restraints on poorly behaved applications.

If you have other applications using the same API, review their usage carefully as it can negatively impact applications like Openbridge from working without incident. (see https://docs.openbridge.com/en/articles/4895388-amazon-seller-central-data-feed-limits-errors-and-constraints for examples of this)

Be Smart, Follow The API Guidance

Most API providers offer detailed guidance on how to make reasonable, balanced use of their APIs. For example, Facebook states that "when the API limit has been reached, stop making API calls. Continuing to make calls will continue to increase your call count, which will increase the time before calls will be successful again."

Facebook, like Amazon and Google, suggest that requests are spread out evenly to avoid traffic spikes. Facebook states that API calls exceeding your limits are counted against limits, and it is a nasty loop.

This reinforces our previous Amazon examples where requests are not aligned to both API boundaries or data availability. Not understanding the behavior of the API in question will ensure consistent failure.

Openbridge carefully models requests and operational strategies to ensure we operate under a "good neighbor" API usage policy. A good neighbor policy means that we will only consume the API capacity needed to get the task accomplished, ensuring capacity for other applications that may also need to consume the same API.

API Data vs. Reporting User Interfaces

It is not uncommon for people to compare the outputs of an API to what they see in the provider user interface. Unfortunately, while it seems plausible, it is not a productive testing methodology.

Data flows differently internally. Facebook, Google, Amazon...user interfaces have direct access to data in ways that external API users do not. This includes how data may flow and update within those internal systems. As a result, data provided via API may be limited, delayed, or not even available.
User interfaces "package" data. All interfaces make choices on how to package and display information. For example, let's say you have an Amazon Ads campaign that shows the overall performance of delivering 1M impressions for the past 30 days. The data from the API shows 1.5M impressions. Why? The reporting interface is only displaying data for "active" campaigns. However, there are 500K impressions for "inactive campaigns not shown in the interface. This can give you the false impression that the 1.5M number from the API is "wrong."
Data can change over time. It is relatively common for specific systems, especially advertising platforms, to update prior dates with new/updated data. This can lead to a slight drifting of outputs between a UI and API. For example, Amazon Advertising reported five sales on 10/1 for the "Ever Fun" campaign. However, on 10/21, Amazon updated the sales attributed to "Ever Fun" for 10/1 to seven sales. This back updating is called a "lookback" period. An API call would need to occur re-requesting the 10/1 update. However, some APIs limit the ability to perform this type of lookback process.
You have to do some math. User interfaces often include calculated metrics, which are often not available in the API. Why? As a general rule, it is better to have the core data than the calculations. Any calculated output should be based on the core data, using preferred formulas. For example, Amazon shows ROAS in the interface. However, the API supplies a column called "attributedSales14d" and another called "sales." Combined, these two data points give you what you need to calculate ROAS (ROAS=attributedSales14d divided by sales).

Understanding Data Source APIs