Once you have activated a data pipeline, it can take 24-48 hours for the first sync to occur. In some cases, the sync may start within a couple of hours or take as much as a couple of weeks (e.g., Amazon Pay Settlement Reports) due to how the source system generates data.
Types of Data Pipeline Schedules
There are four primary types of schedules:
Daily: Schedules for daily data mean we request data based on an offset. The use of an offset ensures our requests align with data availability. For example, for Amazon, we will use a -1 offset. On 1/6/2021, we ask for data for 1/5/2021 or a -1 offset. Daily schedules are commonly used for reporting and insights APIs.
Hourly: Schedules for hourly data requests each hour. For example, 1 AM to 2 AM, 2 AM to 3 AM, and so forth. Hourly schedules are typical for transactional systems such as orders or shipping data APIs.
Lookback: Lookback schedules recollect data for prior dates. These schedules often run in combination with go-forward daily processes. Some data sources will update previous dates with new data. For example, on 10/1/2021, Amazon updated impression counts for 9/1/2021. Lookbacks are common in ad platforms, which support changes to performance attribution metrics (sales, impressions, clicks).
Historical: The purpose of historical schedules is to recreate, by date, a snapshot of data in a source. Historical schedules are run as secondary processes to minimize impacts to "go forward" daily, hourly, or lookback schedules. As such, they are optimized to run as long-running background schedules, carefully requesting prior dates as API capacity permits. For example, running historical schedules for Amazon retail requires close to 10,000 API requests for 12 months of data. If the API limit is 60 requests an hour, this can take close to 4 weeks to complete. As a result, it can take many days or weeks to complete background processes.
How Schedules Work: Data Source and Destination
Several factors impact a data pipeline schedule, influencing the time sync jobs begin and how often they occur. Here are five areas that shape the timing of data pipeline syncs from source to destination:
Limits & Throttles
Limits & Throttles
All source systems have API limits, throttles, and data availability constraints. Data destinations have loading limits and throttles, which may have cost implications. For example, Snowflake charges CPU time, and if data loads every hour, Snowflake will charge you for that time.
Openbridge employs a conservative API request strategy, ensuring we operate within published rate limits and throttling constraints. We carefully model each endpoint and how best to request and re-request data to minimize our consumption of API rate limits.
As a result, Openbridge data pipeline scheduling aligns closely with each system's policies and rules. For example, Amazon Seller Central allows only one daily sync for Referral Fees reports. However, in other cases, Amazon Seller Central will allow hourly syncs for Order transactions.
All sources and destinations have limits in concurrency, meaning how many requests can occur at once. For example, Amazon will reject five requests for reports that occur during the same hour as API restrictions state only one request can be made at a time each hour.
Ultimately, limits are set by the source or destination, not Openbridge. As a result, each data source uniquely sets and controls the frequency a data pipeline process can occur.
Openbridge will dynamically set a schedule based on the data source and quantity of data pipelines you may already have for a given source when creating a data pipeline. For example, suppose you already have 5 data pipelines for an Amazon Advertising profile. In that case, we will dynamically set a schedule because you are adding a sixth pipeline. All six pipelines need to operate with well-defined data source API limits. If we did not dynamically set timing, there is a risk of all six pipelines running in a manner that violates the limits imposed by Amazon, resulting in failed syncs.
When Openbridge initializes a data pipeline job, we have no precise control over when the process completes or when data is loaded. The completion of the job, including the time to load extracted data into your destination, is a function of two external factors;
the data source response to our requests for data
the availability of the target data destination
In the case of (1), the source system may delay processing requests, experience a system outage, authorization issues, or relay that an account is being temporarily blocked. Openbridge will "re-queue" requests to be processed later in these cases.
Different destinations have various loading capacities in the case of (2). For example, Redshift may have limits that allow the number of concurrent system processes. If you use five Redshift processes for data analytics, our requests to load data get placed within Redshift's queue. As a result, the actual loading time would be delayed because our ability to write data will depend on the destination's capacity and availability.
As with (1), Openbridge will queue our requests to a destination to deal with the backpressure of having a destination limit load capacity.
In (1) and (2), this can cause delays from minutes to hours to days in the worst-case scenario.
Different sources make data available at varying intervals. Data availability typically aligns with when a source system has "settled" the data. Settled data means the source system has packaged the data and made it available for delivery via the API. For example, when running a data pipeline for an Amazon Retail report on 10/20, we ask for the "settled" data from 10/19. This means we are requesting a fully settled day of data. If we attempt to call for the same data on 10/20, Amazon will reject the request or possibly deliver incorrect or partial results. Why? Amazon has not settled the data for 10/20, so asking for an incomplete or partial day often provides erroneous data. As a result, our data pipeline schedules reflect requests we know the source system can deliver.
Errors during a data pipeline can occur for many reasons. For example;
Authorization and credential issues
API limits and throttles
Connection failures and errors
Source or destination outages
Openbridge will check for the required user permissions and account configuration during activation and afterward. If permissions are insufficient, changed after pipeline setup, or the source isn't configured correctly, this can cause a pipeline to fail.
In addition to previously described API limits, there may be undocumented limits or throttles. For example, while the Amazon API states you can make 60 requests an hour, some reports can only be requested once daily.
It is not uncommon for customers to have another app or tool exhausting API capacity. Our ability to collect data on your behalf will impact these cases! For example, Amazon may report NO_DATA or CANCELLED in response to our API request (see https://docs.openbridge.com/en/articles/4895388-amazon-seller-central-data-feed-limits-errors-and-constraints). You should review any other apps making connections to the data source API.
When we call for daily data, we request data for the previous day. For example, on 1/6/2021, we asked for data for 1/5/2021—this aligns with data availability. For example, Facebook states that "most metrics will update once every 24 hours". If we call a data source more aggressively, they often return NULL or empty values for metrics. When this occurs, we cannot determine if a NULL or empty value results from no activity or if the source system has not settled the data. Supplying unsettled data can disrupt reporting by providing inaccurate or misleading data.