All Collections
Support
Checking The Status Of Your Data Pipeline
Checking The Status Of Your Data Pipeline
Openbridge Support avatar
Written by Openbridge Support
Updated over a week ago

Openbridge is continually processing requests, retrying in the event of API errors, or executing rate limit backoff strategies mandated by the upstream API. As a result, data pipelines are in a continuous state of operations, especially those pipelines that leverage lookbacks.

Data Pipeline Workflows

Upstream APIs rarely guarantee a service level. As a result, data availability, or velocity, can vary significantly depending on API errors, limits, changes, or service degradation. Different data sources have different cadences. A source can be hourly, daily, some weekly, and others bi-weekly or monthly. This timing is dependent on the source system. As a result, the timing of when data may land at your destination can vary.

Please look at these two articles, which cover these topics in-depth:

Checking For Data In Your Destination

To check the status of a data pipeline, understanding what is currently at your destination is the primary mechanism to accomplish this task. Below are a collection of diagnostic queries you can run to determine the current state of the data in your destination.

Running Diagnostics

Below is a simple diagnostic query. The query provides a high-level overview of what is resident in your destination for a given data feed. You will want to adjust the FROM to your target. The example contains the sp_orders_report_master view for reference.

SELECT COUNT(*) AS count, ob_date, MIN(ob_modified_date) as load
FROM `sp_orders_report_master`
WHERE ob_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY ob_date
ORDER BY ob_date DESC;

Example: Checking the sp_orders_report_master view

Below is the output of the query for sp_orders_report_master.

Looking at the results, you can see a column called ob_date or the "report" date. This is the report date Openbridge requested from the source system. Think of it like this; "Amazon, provide the report date 2023-05-17 for Order Tracking report(s) for Seller X". The response from Amazon to that request is loaded to sp_orders_report_master. As a result, the ob_date would-be 2023-05-17, the report date from Amazon.

Next, the load column reflected when data started to load in your destination for the report date. In our example query, we are looking at sp_orders_report_master. Since this report is DAILY, the settled, complete report date of 2023-05-17 would be requested on 2023-05-18. You can see an example of the data loading cadence below.

count

ob_date

load

115539

2023-05-17

2023-05-18T02:07:14

116955

2023-05-16

2023-05-17T02:09:22

134830

2023-05-15

2023-05-16T02:09:09

116634

2023-05-14

2023-05-15T02:10:26

110896

2023-05-13

2023-05-14T02:08:10

118933

2023-05-12

2023-05-13T02:06:08

123419

2023-05-11

2023-05-12T02:10:08

123023

2023-05-10

2023-05-11T02:16:37

You can also see the last load for each reporting date by adjusting MIN to MAX in your query:

count

ob_date

load

115539

2023-05-17

2023-05-18T14:54:01

116955

2023-05-16

2023-05-17T14:54:45

134830

2023-05-15

2023-05-16T14:42:38

116634

2023-05-14

2023-05-15T14:40:07

110896

2023-05-13

2023-05-14T14:42:07

118933

2023-05-12

2023-05-13T14:54:22

123419

2023-05-11

2023-05-12T14:46:54

123023

2023-05-10

2023-05-11T14:52:54

IMPORTANT: Variable Timing

First and last load times can change depending on the data source limits, errors, or outages. If an API is delayed in supplying a report, we queue a request to try again later. As a result, the first load for a report can be 24-48 hours from a report date. This is normal behavior.

To see why this occurs, see the reference docs :

Example: Bi-Weekly Reports

Here is another example using an infrequent bi-weekly report with variable report dates. This is not a function of "missing dates" but that the upstream system only makes data available infrequently.:

count

ob_date

load

1608

2023-06-08

2023-06-09T07:15:12

2524

2023-06-06

2023-06-07T02:16:35

13213

2023-06-01

2023-06-02T02:18:05

1148608

2023-05-31

2023-06-01T02:19:56

1709

2023-05-25

2023-05-26T07:15:25

1839

2023-05-23

2023-05-24T02:17:29

The cadence of the upstream source will dictate the frequency and cadence of timing.

Understanding Data Source LOOKBACKS

Data sources may require LOOKBACK windows to re-request older dates for updated information. For example, on June 15th, Amazon back-updated the report data for May 20th. If May 20th is not re-requested, the data for that date may not be accurate or complete.

In a modified version of the query, we can include the first and latest load for a report date in the same result set:

SELECT COUNT(*) AS count, ob_date, MIN(ob_modified_date) as first_load, MAX(ob_modified_date) as latest_load
FROM `sp_vendor_traffic_master`
WHERE ob_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY ob_date
ORDER BY ob_date DESC;

Example: sp_vendor_traffic_master

As a general rule, any data source with LOOKBACKS will exhibit this behavior we detail for sp_vendor_traffic .

Most Vendor Retail Analytics reports have an offset in report availability. Typically this can be 3-7 days from the end of the reporting period. For example, Amazon will make the 5/20 Traffic Report available on 5/24, 5/25, or later. This is the latency of the source system making data available. As a result, there will always be latency of 3-7 days from the report date and the first load.

This data also employs LOOKBACKS. When reviewing LOOKBACK timing, the first_load column reflects the oldest load date and the latest_load the most recent. In most cases, these will mirror each other. This is a function of LOOKBACKS running.

In our table below, the 2023-06-05 ob_date shows that the latest load was on 2023-06-09. However, if you look at ob_date 2023-05-24, the latest update was 2023-06-07. This means the source system supplied updated information for 2023-05-24 on 2023-06-07.

Not every prior date may get updated. For example, 2023-06-05 may not have data loaded past 2023-06-09 if the source system does not supply updates. As a result, if you are looking at this data in July, the 2023-06-05 report may still have a last_load of 2023-06-09

count

ob_date

first_load

latest_load

49

2023-06-05

2023-06-09T03:58:13

2023-06-09T03:58:13

47

2023-06-04

2023-06-09T09:34:21

2023-06-09T09:34:21

46

2023-06-03

2023-06-09T12:15:21

2023-06-09T12:15:21

46

2023-06-02

2023-06-09T13:36:36

2023-06-09T13:36:36

46

2023-06-01

2023-06-08T13:38:25

2023-06-08T13:38:25

48

2023-05-31

2023-06-07T14:11:56

2023-06-07T14:11:56

47

2023-05-30

2023-06-06T13:38:49

2023-06-06T13:38:49

46

2023-05-29

2023-06-05T13:40:05

2023-06-05T13:40:05

46

2023-05-28

2023-06-04T13:37:19

2023-06-04T13:37:19

45

2023-05-27

2023-06-03T13:37:50

2023-06-03T13:37:50

45

2023-05-26

2023-06-02T13:42:50

2023-06-02T13:42:50

48

2023-05-25

2023-06-08T20:00:10

2023-06-08T20:00:10

48

2023-05-24

2023-06-07T20:00:57

2023-06-07T20:00:57

43

2023-05-23

2023-06-06T20:03:21

2023-06-06T20:03:21

44

2023-05-22

2023-06-05T20:01:21

2023-06-05T20:01:21

46

2023-05-21

2023-06-04T20:01:34

2023-06-04T20:01:34

45

2023-05-20

2023-06-03T20:01:37

2023-06-03T20:01:37

47

2023-05-19

2023-06-02T20:15:21

2023-06-02T20:15:21

46

2023-05-18

2023-06-01T20:03:35

2023-06-01T20:03:35

47

2023-05-17

2023-05-31T20:02:06

2023-05-31T20:02:06

46

2023-05-16

2023-05-30T20:01:20

2023-05-30T20:01:20

44

2023-05-15

2023-05-22T13:36:44

2023-05-22T13:36:44

45

2023-05-14

2023-05-21T13:37:14

2023-05-21T13:37:14

48

2023-05-13

2023-05-20T13:37:59

2023-05-20T13:37:59

46

2023-05-12

2023-05-26T20:04:40

2023-05-26T20:04:40

46

2023-05-11

2023-05-25T20:54:40

2023-05-25T20:54:40

46

2023-05-10

2023-05-17T13:36:31

2023-05-17T13:36:31

Additional Diagnostics: By Hour

If you want a finer-grain diagnostic, you can inspect by the hour. This can be useful, especially for hourly grain data feeds.

Example Query By Hour

The query essentially retrieves data from a target table, calculates the day and hour-level information from the ob_date and ob_modified_date columns, and provides the minimum and maximum ob_modified_date values for each combination of transaction_day, transaction_hour, and ob_date within the past 14 days.

SELECT
DATE_TRUNC(DATE(ob_date), DAY) AS transaction_day,
TIMESTAMP_TRUNC(TIMESTAMP(ob_modified_date), HOUR) AS transaction_hour,
MIN(ob_modified_date) AS initial_load_date,
MAX(ob_modified_date) AS latest_load_date,
FROM `sp_orders_master`
WHERE DATE(TIMESTAMP(ob_date)) >= DATE_SUB(CURRENT_DATE(), INTERVAL 14 DAY)
GROUP BY transaction_day, transaction_hour, ob_date
ORDER BY transaction_day DESC;

This query uses ORDER BY transaction_day DESC; because the data reflects daily reports. The results show that multiple load events are occurring each day and at different hourly intervals;

transaction_day

transaction_hour

initial_load_date

latest_load_date

2023-06-08

2023-06-09 02:00:00 UTC

2023-06-09T02:09:26

2023-06-09T02:10:40

2023-06-08

2023-06-09 14:00:00 UTC

2023-06-09T14:38:38

2023-06-09T14:40:23

2023-06-07

2023-06-08 02:00:00 UTC

2023-06-08T02:04:41

2023-06-08T02:10:13

2023-06-07

2023-06-08 14:00:00 UTC

2023-06-08T14:40:05

2023-06-08T14:40:29

2023-06-06

2023-06-07 02:00:00 UTC

2023-06-07T02:15:21

2023-06-07T02:22:05

2023-06-06

2023-06-07 14:00:00 UTC

2023-06-07T14:40:44

2023-06-07T14:56:51

2023-06-05

2023-06-06 14:00:00 UTC

2023-06-06T14:56:10

2023-06-06T14:57:20

2023-06-05

2023-06-06 02:00:00 UTC

2023-06-06T02:08:22

2023-06-06T02:23:35

If you know your data source is hourly, you can adjust the results with ORDER BY transaction_hour DESC;. Here is an example of that query with a slight modification for ORDER BY and the target table;

SELECT
DATE_TRUNC(DATE(ob_date), DAY) AS transaction_day,
TIMESTAMP_TRUNC(TIMESTAMP(ob_modified_date), HOUR) AS transaction_hour,
MIN(ob_modified_date) AS initial_load_date,
MAX(ob_modified_date) AS latest_load_date,
FROM `sp_orders_master`
WHERE DATE(TIMESTAMP(ob_date)) >= DATE_SUB(CURRENT_DATE(), INTERVAL 14 DAY)
GROUP BY transaction_day, transaction_hour, ob_date
ORDER BY transaction_hour DESC;

This query will return results for sp_orders_master like this;

transaction_day

transaction_hour

initial_load_date

latest_load_date

2023-06-08

2023-06-09 16:00:00 UTC

2023-06-09T16:20:35

2023-06-09T16:31:12

2023-06-09

2023-06-09 16:00:00 UTC

2023-06-09T16:20:50

2023-06-09T16:38:06

2023-06-09

2023-06-09 15:00:00 UTC

2023-06-09T15:21:48

2023-06-09T15:35:50

2023-06-08

2023-06-09 15:00:00 UTC

2023-06-09T15:27:20

2023-06-09T15:27:20

2023-06-09

2023-06-09 14:00:00 UTC

2023-06-09T14:21:35

2023-06-09T14:27:35

2023-06-08

2023-06-09 14:00:00 UTC

2023-06-09T14:26:35

2023-06-09T14:26:35

2023-06-08

2023-06-09 13:00:00 UTC

2023-06-09T13:23:19

2023-06-09T13:23:19

2023-06-09

2023-06-09 13:00:00 UTC

2023-06-09T13:21:22

2023-06-09T13:27:23

2023-06-08

2023-06-09 12:00:00 UTC

2023-06-09T12:23:36

2023-06-09T12:23:36

2023-06-09

2023-06-09 12:00:00 UTC

2023-06-09T12:20:06

2023-06-09T12:27:37

2023-06-08

2023-06-09 11:00:00 UTC

2023-06-09T11:21:50

2023-06-09T11:21:50

2023-06-09

2023-06-09 11:00:00 UTC

2023-06-09T11:20:05

2023-06-09T11:30:50

2023-06-08

2023-06-09 10:00:00 UTC

2023-06-09T10:28:05

2023-06-09T10:28:05

2023-06-09

2023-06-09 10:00:00 UTC

2023-06-09T10:20:35

2023-06-09T10:40:04

2023-06-08

2023-06-09 09:00:00 UTC

2023-06-09T09:20:07

2023-06-09T09:20:08

Support Requests

Providing objective, measurable examples is critical to troubleshooting. If you feel data is delayed or missing, please provide the relevant diagnostics detailed in this document with a support request. Providing guidance or support without objective and quantifiable measurements will be slow, time-consuming, and limited.

This document details how objectively quantity what is resident in your destination against the known behavior of various source systems.

When you submit a support request, please supply your analysis and the outputs from the diagnostic queries.

Did this answer your question?