This guide covers Amazon Unified, Multi-Marketplace Seller Accounts. If you have a multi-market Seller account, Amazon may supply data in unexpected and inconsistent ways for each market. For example, when requesting data for the US marketplace, Amazon may return data for CA. As a result, the data may be loaded for each marketplace, creating duplicate entries. For multi-market unified sellers, the outputs from Amazon can be confusing.
Settlement Reports are a prime use case of this behavior. For example, you have US settlement reports with ID 12345
and 54321
. Rather than return the report once for the US ID 12345
and 54321
, Amazon will return the US report ID for all four unified NA markets (US, CA, MX, and BR). This means that when a request is made to Amazon for the CA settlement report, it will return the US one. A request for MX will also return the US one.
Ideally, Amazon would reply with NO_DATA for a market without a report or provide a unified Settlement report by region. However, that is not the current state of their system.
As a result, you need to adjust how you query the data, being mindful of this behavior.
How To Perform De-Duplication
The ob_transaction_id is a unique key per record, so you can use this key to return only one row for a given marketplace_name. In our Settlement Report example, you can get a unique record per settlement report ID using the ob_transaction_id
. In this case, you want Amazon.com as the marketplace, given it reflects the US settlement activity.
SELECT
*
FROM
`mydata.sp_settlements_master`
WHERE
ob_transaction_id IN
(
SELECT
MAX(ob_transaction_id) AS ob_transaction_id
FROM
`mydata.sp_settlements_master`
WHERE
settlement_id IN (12345,54321)
AND order_id IS NOT NULL
GROUP BY
order_id,
ob_index,
ob_modified_date
)
GROUP BY
marketplace_name,
settlement_id;
Creating A "Dedupe View"
This query type can be translated into a view, meaning you only need to define the logic necessary once. Tools like Tableau, Power BI, Data Studio, or anything else can easily connect to the view without fuss.
โ
The given SQL creates a view named sp_settlements_dedupe
from the base table sp_settlements_master
. This view aims to create a deduplicated version of the sp_settlements_master. Deduplication is done based on the combination of order_item_code
, order_id
, and settlement_id
fields. For each unique combination of these fields, the view includes only the row with the most recent ob_modified_date
, effectively providing the latest state of each unique order item within each unique order and settlement.
CREATE OR REPLACE VIEW `sp_settlements_dedupe` AS
SELECT
T.*
FROM
`sp_settlements_master` T
JOIN
(
SELECT
order_item_code,
order_id,
MAX(TIMESTAMP(ob_modified_date)) AS max_modified_date
FROM
`sp_settlements_master`
GROUP BY
order_item_code,
order_id
) SubQuery
ON
T.order_item_code = SubQuery.order_item_code
AND T.order_id = SubQuery.order_id
AND TIMESTAMP(T.ob_modified_date) = SubQuery.max_modified_date;
This view is useful in several ways:
Data Simplification: It provides a simplified data version by removing duplicate entries based on
order_item_code
,order_id
, andsettlement_id
. This can make subsequent analyses and data processing tasks easier and more efficient.Data Currency: By selecting only the rows with the maximum
ob_modified_date
for each unique combination oforder_item_code
,order_id
, andsettlement_id
, the view ensures that the data reflects the most recent state of each order item in each order and settlement.Performance Improvement: By reducing the data to unique entries only, the view can improve the performance of queries run against it, as they would need to process less data than the full table.
Data Consistency: Since the view is updated whenever the underlying table changes, it ensures that users always have access to up-to-date, deduplicated data without needing to rerun the deduplication process manually.
This view would be particularly useful in scenarios where users need to track the latest status of order items, such as inventory management, sales reporting, and order processing tasks.
De-Dupe With Standard Date Formats
Here is another variation that standardizes the various post_date
and posted_date_time
values Amazon uses in NA and EU into a standard format:
CREATE OR REPLACE VIEW `xxxxx.standardized_settlements_view` AS
SELECT
T.*,
-- Standardizing `posted_date`
CASE
WHEN REGEXP_CONTAINS(T.posted_date, r'^\d{2}\.\d{2}\.\d{4}$') THEN
FORMAT_DATE('%Y-%m-%d', PARSE_DATE('%d.%m.%Y', T.posted_date))
ELSE
FORMAT_DATE('%Y-%m-%d', PARSE_DATE('%Y-%m-%d', T.posted_date))
END AS standardized_posted_date,
-- Standardizing `posted_date_time`
CASE
WHEN REGEXP_CONTAINS(T.posted_date_time, r'^\d{2}\.\d{2}\.\d{4} ') THEN
FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', PARSE_TIMESTAMP('%d.%m.%Y %H:%M:%S UTC', T.posted_date_time))
ELSE
FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', PARSE_TIMESTAMP('%Y-%m-%d %H:%M:%S UTC', T.posted_date_time))
END AS standardized_posted_date_time
FROM
`xxxxx.sp_settlements_master` T
JOIN
(
SELECT
COALESCE(order_item_code, '') AS order_item_code,
order_id,
MAX(TIMESTAMP(ob_modified_date)) AS max_modified_date
FROM
`xxxxx.sp_settlements_master`
GROUP BY
COALESCE(order_item_code, ''),
order_id
) SubQuery
ON
COALESCE(T.order_item_code, '') = SubQuery.order_item_code
AND T.order_id = SubQuery.order_id
AND TIMESTAMP(T.ob_modified_date) = SubQuery.max_modified_date;
Note: This is for reference only; you will need to modify it to suit your environment and desired modeling logic for what you define as a "duplicate".
Example Docs for View Creation
Here is an example guide for creating views in Google BigQuery for reference. The process is similar for destinations like Redshift, Snowflake, Athena, and others.