Healthchecks reflect a transactional log of data pipelines' health and well-being. These checks are typically used to ensure that a system is functioning correctly and to identify and diagnose any issues that may be present.
Healthcheck tests can include a wide range of different tests, including:
Checks the system's availability by sending a simple request to the server
Checks to see if the system can perform specific tasks or functions correctly.
Checks to determine a system's ability to handle a high workload or large amounts of data.
Checks to determine if systems have properly authorized access to targeted resources.
Checks for the system's ability to handle unexpected or undocumented conditions, such as bugs in APIs or non-standard behaviors.
Healthchecks reflect automated systems activities run at various intervals to ensure the system is healthy and performing well. They are also used to help customers identify potential issues at data sources or destinations that need their attention.
Understanding Your Healthchecks
The healthchecks reflect a system activity log, an in-depth collection of transactional events. Events are not summarized or aggregated into reports.
Two fundamental types of healthcheck errors are reported; blocker and transient.
What is a blocker error?
A blocker error refers to a permanent or critical problem with a service (data source or destination). For example, an upstream API from which requested data is no longer available or an API refuses to authorize a request with the provided credentials. Another example is when a resource like an Amazon Seller account or Shopify store is deleted. When a blocker error is present in health checks, they require the customer's immediate attention.
What is a transient error?
Unlike blocker errors, transient errors typically reflect a temporary issue. For example, the Amazon Selling Partner API is currently unavailable (overloaded or down), closes a connection, or behaves unexpectedly or in an undocumented manner. In most cases, transient errors are corrected by reattempting a request, using lookback windows, or using retry queues that hold requests until an API becomes available or is working as expected. As a result, transient errors are temporary and will "self-correct." (also see Key Considerations For Data Source and Destination Automation Timing.)
How To Identify Blocker Errors
There are two classes of blocker errors; those for data sources and those for data destinations.
Blocker Errors: Data Sources
Most blocker errors found in data sources are authorization or permission issues. For example, calling the system calls Amazon Selling Partner API for order reports, and the API responds that the prior authorization is invalid. While less common, there can be issues where a data sources resource, like a Facebook Page, YouTube Channel, or Amazon Advertiser, no longer exists.
Below are a typical collection of blocker errors returned by data sources.
"Not authorized to access scope 1234567890"
"Invalid API key or access token (unrecognized login or wrong password)"
"Worker error code 403: The customer account can't be accessed because it is not yet enabled or has been deactivated."
"Worker error code 402: Unavailable Shop"
"Worker error code 403: Forbidden"
"Worker error code 403: Reports are unavailable for this entity."
"Worker error code 400: Page access token not found"
"Worker error code 401: [API] Invalid API key or access token (unrecognized login or wrong password)"
"Worker error code 403: The caller does not have permission"
"An application does not have permission for this action."
"Worker error code 400: The account has been deleted"
"Object with ID 'act_1234567890' does not exist, cannot be loaded due to missing permissions, or does not support this operation"
"The request has an invalid grant parameter: refresh_token. User may have revoked or didn't grant the permission."
There are blocking errors from sources related to exceeding historical request windows;
"Report date is too far in the past. Reports are only available for 60 days."
"'startDate' cannot be older than '90' days"
Fixing Data Source Blocking Errors
In the case of a permission error, the best course of action is to attempt reauthentication or refresh our authorization. It is possible someone who manages the account inadvertently removed your permissions.
It is also possible that a resource was deleted or suspended. For example, an Amazon Advertising profile may have been deleted by someone on the account or suspended by Amazon. In those cases, a pipeline can not be reactivated as there is no longer an upstream resource (Amazon Ads profile) to collect data. if there is a suspicion the resource has been deleted or suspended, like an Amazon Advertiser profile, the best step is to pause the pipeline and investigate further @Amazon.
See the doc Amazon Selling Partner API Data Feed Limits, Errors, And Constraints which details the behavior of Amazon APIs for blocker and transient errors.
Blocker Errors: Data Destination
Blocker errors with data destinations prevent data from loading in most, if not all, cases. This means that while data could be collected from a source, a blocker in a destination will prevent loading. Not only does this prevent loading, it also creates back pressure on a destination. For example, if a data destination has been blocked for 24 hours, millions of records across 100s of tables can be waiting to load. In addition to the backlog of older data, new data is always in flight. Backpressure can create significant delays as the destination attempts to "catch up" with all the load requests in the system.
In a worst-case scenario, data can be lost if the blocking error persists long enough. For example, if Google BigQuery billing defaults, Google will schedule all your datasets to "expire." A common cue to this issue is Google reporting
Datasets must have a default expiration time and default partition expiration time of less than 60 days while in sandbox mode. This indicates something is wrong with the customer billing account in Google Cloud. As a result of the
expire setting, years' worth of data would be deleted (expired) from BigQuery.
Below are a collection of reference blocker errors for destinations;
"Could not translate host name "XYZ-redshift-001.cvfrr5678qa.eu-west-1.redshift.amazonaws.com" to address: Name or service not known"
"Credential should be scoped to a valid region"
"ERROR: XX000: Disk Full"
"Billing has not been enabled for this project. Enable billing at https://console.cloud.google.com/billing. Datasets must have a default expiration time and default partition expiration time of less than 60 days while in sandbox mode"
"accessDenied: This error returns when you try to access a resource or that you don't have access to. This error also returns when you try to modify a read-only object"
"FATAL: password authentication failed for user"
"FATAL: database "XXXXX" does not exist"
"Connection refused. Check that the hostname and port are correct and that the postmaster accepts TCP/IP connections."
Fixing Data Destination Blocking Errors
Depending on the destination error, fixing any specific issue will be a function of the destination itself. For example, if billing is not enabled for BigQuery, you should log into your Google Cloud account and ensure billing is activated with stable, consistent payment methods. Another example is Redshift
Disk Full or
Can not connect to host errors that would require ensuring adequate disk storage or no firewall rules blocking a connection.
Customers should reference the vendor documentation for the data destination in question.
How To Identify Transient Errors
Transient refers to temporary or intermittent problems that occur in pipelines. Many factors, such as network congestion, API unavailability, bugs, or other issues, can cause these errors. They are not permanent and can be resolved by retrying the same request after a short period. For example, Amazon communicates these errors to merchants when an issue or service outage occurs. This outage prevents requesting specific reports. Often, these issues are posted in the Amazon user interface;
"The issue causing Business Reports to display inaccurate data has been resolved, and we are working on updating the historical data as quickly as possible to ensure accuracy. While these updates are ongoing, you may continue to see inaccurate sales and order data in Business Reports, or Business Reports may become temporarily unavailable."
While the UI is highly descriptive, the response from the API is less descriptive. The same error from the Business Reports API will state
In cases of transient errors, we will continue to retry requests periodically, attempting to poll the API to see if the issue is resolved. Our approach to handling transient errors is to implement retry logic, which automatically retries the failed request after a short delay, with a maximum number of retries. There's no need for the customer to do anything.
Below are a collection of reference transient errors;
Could not find hostname: s3.amazonaws.com"
"Connection to the other side was lost in a non-clean fashion: Connection lost"
"Worker error code 500: An unexpected error has occurred. Please retry your request later"
"ConnectionDone: Connection was closed cleanly"
"err_msg": "Worker error code 429: Too Many Requests"
"Worker error code 429: Quota exceeded for quota metric 'Queries' and limit 'Queries per day' of service
Lastly, another form of transient error reflects schema issues. These errors typically occur because the upstream data source supplied data that does not align with their documentation or the output from the API needed to be corrected.
Columns not matching rules schema, first mismatching column is ob_marketplace_id != atvpdkikx0_der
The Openbridge team investigates these types of schema or malformed API output errors. Again, there is no action needed on the part of the customer.
"Circuit Breakers": Automated Deactivation Of Services
If errors persist long enough, a circuit breaker pattern in the system will detect and stop the flow of requests to a failing service. The system can be turned back on when the service becomes healthy again.
A circuit breaker may pause or deactivate a pipeline. For example, pipelines will be paused if a customer is not current with payments or an error has persisted for an extended period, such as a data destination being in a failed, unresponsive state.
The customer will need to "flip the switch" to reactive services, which means ensuring any errors that have been unresolved for extended periods, like failed destinations, are corrected.