Understanding Data Duplicates

Deep dive on duplicates, record hashing, de-duplication, and edge cases

Openbridge Support avatar
Written by Openbridge Support
Updated over a week ago

A duplicate record is a record with the same values as another in all fields in a specific dataset (table). It can be a confusing topic at times, given it is difficult to identify and define what a duplicate record is by manually "eyeballing" a table.

To help with this topic, we have provided examples highlighting what constitutes a duplicate record and how Openbridge processes these cases.

What are duplicates records?

For example, let's assume we have a table called "customers" with the following columns: "customer_id, first_name, last_name, address, city, state, zip".

Example 1: Exact Duplicates

The following two records in our "customers" table would be considered duplicate records:

customer_id | first_name | last_name | address  | city  | state | zip
---------------------------------------------------------------
1 | John | Smith | 123 Main | Miami | FL | 12345
---------------------------------------------------------------
1 | John | Smith | 123 Main | Miami | FL | 12345

Notice how each column has the same information? These two records would be considered exact duplicates.

Example 1: Almost Duplicates

Using the same example, there is a slight change to the records. Can you spot the difference?

customer_id | first_name | last_name | address  | city  | state | zip
---------------------------------------------------------------
1 | John | Smith | 123 Main | Miami | FL | 12345
---------------------------------------------------------------
1 | John | Smith | 123 Main | Miami | FL | 12346

In the example above, the zip code is now 12346. Given the difference in each record, however slight, they are treated as distinct records. Why? In this case, we assume each to be different as there are no means to determine which is the duplicate. As such, both are preserved.

Identifying Duplicates: Autogenerated Record Hash

Trying to spot these variations can be difficult visually. The zip code example was simple enough, but doing this manually at scale is impossible. This is where a unique identifier can make a difference.

A unique record hash, a "fingerprint" or "checksum," can identify duplicate records. A unique record hash is a value generated based on the values of one or more fields in a record. The hash value is unique to that specific record, and any changes to the record's field values will result in a different hash value.

Openbridge includes a unique ID or record hash each time we load data called ob_transaction_id. The uniqueness of the ob_transaction_id is important to identify duplicates; otherwise, it will not be able to identify them accurately (or easily).

Example 1: Exact Duplicates Using A Hash

Building on our prior example, we now include the ob_transaction_id hash.
​

customer_id | first_name | last_name | address  | city  | state | zip    | ob_transaction_id
---------------------------------------------------------------
1 | John | Smith | 123 Main | Miami | FL | 12345 | a1b2c3d4e5f6
---------------------------------------------------------------
1 | John | Smith | 123 Main | Miami | FL | 12345 | a1b2c3d4e5f6

As you can see, the calculated ID a1b2c3d4e5f6 is the same. This indicates that records for customer ID 1 and 2 are exact duplicate records:

Example 2: Almost Duplicates Using A Hash

As we stated, a duplicate record is a record with the same values for all fields. Spotting what makes a record a duplicate can be difficult. The hash does the heavy lifting for you.

Due to a different value in the zip column, a new record hash is generated. This calculation is based on the values of all columns in the record, and the zip column is a part of that calculation.

customer_id | first_name | last_name | address  | city  | state | zip    | ob_transaction_id
---------------------------------------------------------------
1 | John | Smith | 123 Main | Miami | FL | 12345 | a1b2c3d4e5f6
---------------------------------------------------------------
1 | John | Smith | 123 Main | Miami | FL | 12346 | b1c2d3e4f5g6

Since the zip code is 12346, the calculated hash for the record is different. There is a unique hash a1b2c3d4e5f6 and b1c2d3e4f5g6.

NOTE: Openbridge will not load a record with the same ob_transaction_id. This is done to avoid loading records calculated to be duplicate records. See "caveats" below.

How To Check For Duplicate Transaction IDs

A query using ob_transaction_id can be used to determine the uniqueness of records in a given table. For example, to check for records with the same ob_transaction_id, you can run a simple SQL query like this:

SELECT ob_transaction_id, COUNT(*) 
FROM sp_inventory_ledger_summary_master
GROUP BY ob_transaction_id
HAVING COUNT(ob_transaction_id) > 1

If there are no duplicate transaction IDs, the result will be zero (as expected).

Duplicate Caveats

Deduplicating a table based on a unique hash can effectively identify and remove duplicate records. In most cases, it is an effective mechanism to ensure the hygiene of the data. However, there are edge cases where this may present challenges. We've shared below the two areas most commonly come up; False positives and Slowly changing data.

False Positives

In our earlier Example 1, the two records are exact duplicates. In most cases, they are. However, in some edge cases, such as transactional data, records may appear to be duplicated but are not. Why does this occur? These cases usually occur when upstream data sources do not properly supply a unique transactional identifier. This could lead to the loss of essential data. Here is an example of a table with five duplicate orders, all with the same values for each record:

Order ID

Customer ID

Product ID

Quantity

Order Date

1

1234

5678

1

2021-01-01

1

1234

5678

1

2021-01-01

1

1234

5678

1

2021-01-01

1

1234

5678

1

2021-01-01

1

1234

5678

1

2021-01-01

In this example, each order has the same customer ID, product ID, quantity, and order date.

This appears to have five duplicate orders submitted. However, the upstream data source needs to include important information. This data shows that the order ID reflects a single order with five items.


The data source should supply a unique order_item_id column to indicate there was an order that included five items.

Order Item ID

Order ID

Customer ID

Product ID

Quantity

Order Date

1

1

1234

5678

2

2021-01-01

2

1

1234

5678

2

2021-01-01

3

1

1234

5678

2

2021-01-01

4

1

1234

5678

2

2021-01-01

5

1

1234

5678

2

2021-01-01

As you can see, the order_item_id provides critical context for the data. It tells us that Order ID 1 includes a basket of 5 items.

If the upstream source omits a unique ID like order_item_id, data will often be deduplicated. Rather than having what appears to be five duplicate orders loaded, the system will keep only one of the duplicates and discard the rest.
​
​If you see use cases like this, don't hesitate to get in touch with Openbridge support.

Slowly Changing Data

If the unique hash is generated based on multiple columns, and if any of those columns are updated, the hash value will change. The updated record will be considered a new record instead of a match of the previous one. Here is an example of how changes in a product title over time can manifest in the data. Let's say a company has a product table with the following columns: Product ID, Product Title, Price, and Timestamp.


On Jan 1, 2023, there was a product called Red Widget Max in your listings:

Product ID

Product Title

Price

Timestamp

1

Red Widget Max

$10

2023-01-01

However, on March 1, 2023, a product manager decided to change the Product title to Red Widget Ultimate

Product ID

Product Title

Price

Timestamp

1

Red Widget Ultimate

$10

2023-03-01

This will result in two rows for the same product ID:

Product ID

Product Title

Price

Timestamp

1

Red Widget Max

$10

2023-01-01

1

Red Widget Ultimate

$10

2023-03-01

We are storing the unique snapshot in time for Product ID 1, resulting in two records. This situation is common in cases where an upstream UI allows changes/updates to the data. For example, a product title or campaign name may change over time. Rather than rely upon transitory labels like names or titles in your queries, always leverage unique IDs like a product or campaign ID.

In cases like this, recency is typically the critical dimension. That means you are only interested in the most recent entry for Product ID 1. Using the timestamp would allow you to get the latest entry;

SELECT Product_ID, MAX(Timestamp)
FROM Product_Table
GROUP BY Product_ID

This query will group all records by the Product ID column and then use the MAX function to find the most recent timestamp for each product. The result will be a table with two columns: the Product ID and the most recent timestamp.


You can also add a limit to the query to only return the most recent record. For example:

SELECT Product_ID, Timestamp 
FROM Product_Table
WHERE Timestamp = (SELECT MAX(Timestamp) FROM Product_Table)

You can use the following SQL query to return the maximum timestamp for all distinct Product IDs in the Product table:

SELECT Product_ID, MAX(Timestamp)
FROM Product_Table
GROUP BY Product_ID

You can also use a subquery to join the table with the max timestamps for each product, for example:

SELECT Product_Table.*
FROM Product_Table
JOIN (SELECT Product_ID, MAX(Timestamp) as Max_Timestamp
FROM Product_Table
GROUP BY Product_ID) as Max_Timestamps
ON Product_Table.Product_ID = Max_Timestamps.Product_ID AND Product_Table.Timestamp = Max_Timestamps.Max_Timestamp

This query will join the Product_Table with the subquery that returns the max timestamp for each product ID. The join will match the product_id and timestamp of the original table to the max timestamp and product_id of the subquery, returning the most recent records for each product.

Amazon Vendor Central Example

Amazon Vendor Central may drift, like advertising data, over time. As a result, you will want to use the most recent record for a given attribute. The query below will return the most recent records for a given ASIN. This can be refined to include other columns or conditionals as well.

SELECT asin, ob_date, ob_selling_program, MAX(ob_modified_date) AS max_date
FROM `sp_vendor_sales_sourcing_master`
GROUP BY asin, ob_date, ob_selling_program
ORDER BY ob_date DESC;
Did this answer your question?