How To Query Amazon Multi-market Accounts With Data Duplication

This guide covers Amazon Unified, Multi-Marketplace Seller Accounts. If you have a multi-market Seller account, Amazon may supply data in unexpected and inconsistent ways for each market. For example, when requesting data for the US marketplace, Amazon may return data for CA. As a result, the data may be loaded for each marketplace, creating duplicate entries. For multi-market unified sellers, the outputs from Amazon can be confusing.

Settlement Reports are a prime use case of this behavior. For example, you have US settlement reports with ID 12345 and 54321. Rather than return the report once for the US ID 12345 and 54321, Amazon will return the US report ID for all four unified NA markets (US, CA, MX, and BR). This means that when a request is made to Amazon for the CA settlement report, it will return the US one. A request for MX will also return the US one.

Ideally, Amazon would reply with NO_DATA for a market without a report or provide a unified Settlement report by region. However, that is not the current state of their system.

As a result, you need to adjust how you query the data, being mindful of this behavior.

How To Perform De-Duplication

The ob_transaction_id is a unique key per record, so you can use this key to return only one row for a given marketplace_name. In our Settlement Report example, you can get a unique record per settlement report ID using the ob_transaction_id. In this case, you want Amazon.com as the marketplace, given it reflects the US settlement activity.

SELECT 
    *
FROM 
    `mydata.sp_settlements_master` 
WHERE 
    ob_transaction_id IN 
    (
        SELECT 
            MAX(ob_transaction_id) AS ob_transaction_id 
        FROM 
            `mydata.sp_settlements_master` 
        WHERE 
            settlement_id IN (12345,54321) 
            AND order_id IS NOT NULL 
        GROUP BY 
            order_id, 
            ob_index, 
            ob_modified_date
    ) 
GROUP BY 
    marketplace_name, 
    settlement_id;

Creating A "Dedupe View"

This query type can be translated into a view, meaning you only need to define the logic necessary once. Tools like Tableau, Power BI, Data Studio, or anything else can easily connect to the view without fuss.

The given SQL creates a view named sp_settlements_dedupe from the base table sp_settlements_master. This view aims to create a deduplicated version of the sp_settlements_master. Deduplication is done based on the combination of order_item_code, order_id, and settlement_id fields. For each unique combination of these fields, the view includes only the row with the most recent ob_modified_date, effectively providing the latest state of each unique order item within each unique order and settlement.

CREATE OR REPLACE VIEW `sp_settlements_dedupe` AS
SELECT
    T.*
FROM
    `sp_settlements_master` T
JOIN
    (
    SELECT
        order_item_code,
        order_id,
        MAX(TIMESTAMP(ob_modified_date)) AS max_modified_date
    FROM
        `sp_settlements_master`
    GROUP BY
        order_item_code,
        order_id
    ) SubQuery
ON
    T.order_item_code = SubQuery.order_item_code
    AND T.order_id = SubQuery.order_id
    AND TIMESTAMP(T.ob_modified_date) = SubQuery.max_modified_date;

This view is useful in several ways:

Data Simplification: It provides a simplified data version by removing duplicate entries based on order_item_code, order_id, and settlement_id. This can make subsequent analyses and data processing tasks easier and more efficient.
Data Currency: By selecting only the rows with the maximum ob_modified_date for each unique combination of order_item_code, order_id, and settlement_id, the view ensures that the data reflects the most recent state of each order item in each order and settlement.
Performance Improvement: By reducing the data to unique entries only, the view can improve the performance of queries run against it, as they would need to process less data than the full table.
Data Consistency: Since the view is updated whenever the underlying table changes, it ensures that users always have access to up-to-date, deduplicated data without needing to rerun the deduplication process manually.

This view would be particularly useful in scenarios where users need to track the latest status of order items, such as inventory management, sales reporting, and order processing tasks.

De-Dupe With Standard Date Formats

Here is another variation that standardizes the various post_date and posted_date_time values Amazon uses in NA and EU into a standard format:

CREATE OR REPLACE VIEW `xxxxx.standardized_settlements_view` AS
SELECT
    T.*,
    -- Standardizing `posted_date`
    CASE
        WHEN REGEXP_CONTAINS(T.posted_date, r'^\d{2}\.\d{2}\.\d{4}$') THEN 
            FORMAT_DATE('%Y-%m-%d', PARSE_DATE('%d.%m.%Y', T.posted_date))
        ELSE 
            FORMAT_DATE('%Y-%m-%d', PARSE_DATE('%Y-%m-%d', T.posted_date))
    END AS standardized_posted_date,
    -- Standardizing `posted_date_time`
    CASE
        WHEN REGEXP_CONTAINS(T.posted_date_time, r'^\d{2}\.\d{2}\.\d{4} ') THEN 
            FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', PARSE_TIMESTAMP('%d.%m.%Y %H:%M:%S UTC', T.posted_date_time))
        ELSE 
            FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S', PARSE_TIMESTAMP('%Y-%m-%d %H:%M:%S UTC', T.posted_date_time))
    END AS standardized_posted_date_time
FROM
    `xxxxx.sp_settlements_master` T
JOIN
    (
    SELECT
        COALESCE(order_item_code, '') AS order_item_code,
        order_id,
        MAX(TIMESTAMP(ob_modified_date)) AS max_modified_date
    FROM
        `xxxxx.sp_settlements_master`
    GROUP BY
        COALESCE(order_item_code, ''),
        order_id
    ) SubQuery
ON
    COALESCE(T.order_item_code, '') = SubQuery.order_item_code
    AND T.order_id = SubQuery.order_id
    AND TIMESTAMP(T.ob_modified_date) = SubQuery.max_modified_date;

Note: This is for reference only; you will need to modify it to suit your environment and desired modeling logic for what you define as a "duplicate".

Example Docs for View Creation

Here is an example guide for creating views in Google BigQuery for reference. The process is similar for destinations like Redshift, Snowflake, Athena, and others.