Handling Apparent Duplicates in Amazon Marketing Stream Data

If it looks like you're receiving duplicate records in your Amazon Marketing Stream (AMS) data, it's important to first inspect the structure and meaning of the data before assuming it's a duplicate. What may appear as duplication often turns out to be valid and intentional behavior due to how the Amazon stream operates.

Common Scenarios That Resemble Duplicates

Similar Metrics Across Different Hours
It’s entirely possible—and expected—that you will see the same/similar performance metrics for the same keyword or ASIN across different hours. For instance:

June 1, 10 AM: impressions = 10, clicks = 3
June 1, 11 AM : impressions = 10, clicks = 3

These are two valid entries for two distinct hours. Even though the metrics are the same, each record corresponds to a different time window. This is especially common for low-traffic keywords, where performance data changes infrequently.

Restatements Within the Same Hour
Amazon Marketing Stream delivers hourly deltas, and it may restate historical data to reflect updates or corrections. In such cases, two records may appear identical (same hour, same keyword, same performance metrics), but they serve different purposes:

One is the initial delta.
The other is a restated update.

These are not true duplicates. To distinguish them, examine the idempotency_id field:

If two records have the same idempotency_id, they are true duplicates.
If they have different idempotency_ids, they are two valid records, even if the metrics are the same.

Using `idempotency_id` for De-duplication

Amazon guarantees that every event in the stream has a unique idempotency_id per logical update. You can use this field to filter out true duplicates or ensure only the latest restatement is retained.

SQL Strategy 1: Keep the Most Recent Record

SELECT * FROM ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY keyword_id, hour ORDER BY received_at DESC -- or any timestamp field ) AS row_num FROM stream_data ) AS deduped WHERE row_num = 1;

This approach retains only the most recent record per keyword_id and hour.

SQL Strategy 2: Use `idempotency_id` to Remove Exact Duplicates

SELECT DISTINCT ON (idempotency_id) * FROM stream_data;

This removes true duplicates where the idempotency_id is repeated.

SQL Strategy 3: Use a View for Clean Data Access

You can also create a SQL view that always filters to the most recent update:

CREATE OR REPLACE VIEW clean_stream_data AS SELECT * FROM ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY keyword_id, hour ORDER BY received_at DESC ) AS row_num FROM stream_data ) AS deduped WHERE row_num = 1;

Now you can query clean_stream_data and always get de-duplicated, up-to-date results.

Examples

De-duplicate Examples Using `idempotency_id`: Only Keep Unique Records

This query removes true duplicates by selecting only one record per unique idempotency_id.

SELECT DISTINCT ON (idempotency_id) * FROM "amazon-marketing-stream"."sp-traffic" WHERE DATE(from_iso8601_timestamp(time_window_start)) = DATE '2021-03-24' AND match_type IN ('BROAD', 'PHRASE', 'EXACT');

Keep Only the Latest Record per Keyword and Hour

This query ensures that for each keyword_id and hour, only the most recent restatement is retained using ROW_NUMBER().

SELECT * FROM ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY keyword_id, time_window_start ORDER BY from_iso8601_timestamp(received_at) DESC ) AS row_num FROM "amazon-marketing-stream"."sp-traffic" WHERE DATE(from_iso8601_timestamp(time_window_start)) = DATE '2021-03-24' AND match_type IN ('BROAD', 'PHRASE', 'EXACT') ) AS ranked WHERE row_num = 1;

Create a View with De-duplicated Keyword Traffic

This view can be reused to always return only the latest hourly keyword traffic, avoiding duplicates automatically.

CREATE OR REPLACE VIEW clean_sp_traffic_keywords AS SELECT * FROM ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY keyword_id, time_window_start ORDER BY from_iso8601_timestamp(received_at) DESC ) AS row_num FROM "amazon-marketing-stream"."sp-traffic" WHERE match_type IN ('BROAD', 'PHRASE', 'EXACT') ) AS deduped WHERE row_num = 1;

You can now query this view like so:

SELECT * FROM clean_sp_traffic_keywords WHERE DATE(from_iso8601_timestamp(time_window_start)) = DATE '2021-03-24';

Summary

These queries align with Amazon’s traffic model and help ensure that your analytics reflect only the most accurate, up-to-date records per keyword and hour, while filtering out true duplicates using idempotency_id.

By using the idempotency_id and understanding the hourly delta model of Amazon Marketing Stream, you can confidently distinguish between legitimate restatements and actual duplicates—ensuring your reporting is both accurate and reliable.

Handling Apparent Duplicates in Amazon Marketing Stream Data