Firstly, it's essential to understand that APIs are not unlimited resources; they have certain restrictions. Accordingly, connectors are engineered to optimize data availability and quality. Several elements shape a data pipeline schedule, affecting when sync jobs start and how frequently they happen.
Here are the five critical factors that determine the timing of data pipeline syncs from the source to the destination:
1. Limits & Throttles
2. Scheduling
3. Constraints
4. Availability
5. Errors
Limits & Throttles
All source systems have defined API limits, throttles, and data availability restrictions. Similarly, data destinations have loading limits and throttles that could have cost implications. For instance, Snowflake bills for CPU time and hourly data loads will add to your charges.
Openbridge adopts a cautious API request strategy, ensuring we comply with established rate limits and throttling constraints. We meticulously model each endpoint to request and re-request data, minimizing our use of API rate limits.
Hence, Openbridge's data pipeline scheduling aligns strictly with each system's policies and rules. For example, Amazon Seller Central allows only one daily sync for Referral Fees reports, whereas hourly syncs are permissible for Order transactions.
All sources and destinations have concurrency limits, i.e., the number of simultaneous requests. For example, Amazon will reject multiple report requests within the same hour since API restrictions allow only one request per hour.
Ultimately, the frequency of a data pipeline process is determined and controlled by each unique data source, not Openbridge.
Scheduling
Openbridge dynamically establishes a schedule based on the data source and the number of existing data pipelines when creating a new one. For example, if you already have 5 data pipelines for an Amazon Advertising profile, we'll set a dynamic schedule when adding a sixth pipeline, ensuring compliance with the data source API limits. Otherwise, the simultaneous operation of all six pipelines might breach Amazon's limits, leading to failed syncs.
Constraints
When Openbridge initiates a data pipeline job, we don't have direct control over the completion time or when data is loaded. The job completion, including the time to load extracted data into your destination, depends on two external factors:
1. The data source's response to our requests for data
2. The availability of the target data destination
In the case of data source delays, system outages, authorization issues, or temporary account blocks, Openbridge will re-queue the requests for later processing.
Various destinations have different loading capacities. For instance, if you're using multiple Redshift processes for data analytics, our data loading requests get queued within Redshift's system. This results in delayed loading time as our ability to write data depends on the destination's capacity and availability.
Under such circumstances, Openbridge will queue our requests to a destination to manage the backpressure from destination load capacity limits.
Delays due to these reasons can range from a few minutes to several days in the worst cases.
Availability
Different sources provide data at various intervals, typically when the data has "settled" or been packaged for delivery via the API. For instance, when requesting data for an Amazon Retail report on 10/20, we ask for the "settled" data from 10/19, ensuring a complete data day. If we request the same day's data, Amazon might reject it or deliver incorrect or incomplete results, as the data may not have settled. Therefore, our data pipeline schedules reflect requests we know the source system can fulfill.
Errors
Errors during a data pipeline can occur for various reasons, such as:
1. Authorization and credential issues
2. API limits and throttles
3. Connection failures and errors
4. Source or destination outages
Openbridge verifies user permissions and account configuration during and post-activation. Insufficient permissions, changes after pipeline setup, or improper source configuration can cause a pipeline to fail.
Besides documented API limits, there might be undocumented limits or throttles. For example, although the Amazon API states you can make 60 requests an hour, some reports can only be requested once daily.
Often, customers might have another app or tool exhausting API capacity, which could impact our ability to collect data on your behalf. For instance, Amazon may report NO_DATA or CANCELLED in response to our API request. You should review any other apps connecting to the data source API. (see https://docs.openbridge.com/en/articles/4895388-amazon-seller-central-data-feed-limits-errors-and-constraints).
When we request daily data, we ask for the previous day's data. For example, on 1/6/2021, we requested data for 1/5/2021—this aligns with data availability. Data sources often return NULL or empty values for metrics if we call them more aggressively. In these cases, we cannot determine if a NULL or empty value results from no activity or if the data hasn't settled. Supplying unsettled data can disrupt reporting by providing inaccurate or misleading data.
Additional Information For Data Pipeline Scheduling
For additional information on scheduling, timing, and automation, see our article Understanding Data Pipeline Scheduling.