Understanding Data Pipeline Scheduling

Once you've initiated a data pipeline, it may require 24-48 hours for the initial sync to happen. However, this timeline can fluctuate significantly based on the source system's data generation speed. For instance, syncing Amazon Pay Settlement Reports could take several weeks.

Data Pipeline Schedules Classification

There are four primary classifications of schedules:

1. Daily (Recurring)

2. Hourly (Recurring)

3. Lookback (Recurring)

4. Historical

Daily Schedules

These schedules entail data requests based on a pre-defined offset to align with data availability. For instance, on Amazon, we use a -1 offset. So, on 1/6/2021, we ask for the data of 1/5/2021, corresponding to a -1 offset. This type of schedule is predominantly used for reporting and insights APIs.

Hourly Schedules

Hourly schedules involve data requests every hour (e.g., from 1 AM to 2 AM, 2 AM to 3 AM, and so on). This schedule type is particularly common for transactional systems like orders or shipping data APIs and is often denoted as "real-time" or "near real-time."

Lookback Schedules

Lookback schedules retrieve data from previous dates and often run with standard daily processes. Specific data sources might update past dates with new data, as Amazon updates impression counts for an earlier date. Lookbacks are frequently seen in ad platforms, which permit changes to performance attribution metrics such as sales, impressions, and clicks.

Historical Schedules & Requests

Historical schedules aim to reconstruct a by-date snapshot of data from a source. For example, a request on July 1 for data from the past 180 days would create daily API requests for 180 days of data. If the connector manages five reports, it results in 900 reports (180x5). Hence, the API requests could surpass 3000 to recreate the data of 180 days.

As these historical requests can be demanding, they're scheduled to run as separate processes to minimize interference with daily, hourly, or lookback schedules. These jobs are optimized to run as long-lasting background tasks, requesting past data dates as the API capacity allows. For example, recreating a year's worth of data for Amazon retail reports could require almost 10,000 API requests and might take up to 4 weeks to complete, assuming an API limit of 60 requests per hour.

Due to the process of reconstructing daily snapshots of historical data, these requests can often take days or even weeks to finalize.

Data Workflow Jobs

There are two primary types of Recurring and Historical. Recurring jobs are running at a specific interval, typically hourly or daily. Recurring jobs also include LOOKBACK tasks, which re-request prior dates.

A Request Date, is the date the process is scheduled to run.
The Report Date reflects the "data date" we are asking the source system to provide.
The Execution Schedule reflects timing for when a request is expected to run or has previously run

Recurring Request, Report, and Execution Schedules are dynamically set based on the upstream API rate limits and throttled.

Historical requests are those initiated by the end user, typically a one-time request for a specific date or range of dates. Historical data request tasks also show up in the jobs list. For more information on these processes, see Historical Data Request Tasks.

Further Information on Schedules & Timing

For a deeper understanding of Data Source and Destination automation and timing, please refer to our article, Key Considerations For Data Source and Destination Automation Timing.

For API limits and why we use dynamic scheduling, see Amazon Selling Partner API Data Feed Limits, Errors, And Constraints and How To Use Healthchecks