How to perform data deduplication with SQL in Redshift.

25 Oct 2023

window

Data deduplication is the process of identifying and removing duplicate records from a database. This can be particularly useful in systems with large amounts of data, where duplicate entries can lead to inefficiencies and data integrity issues.

Amazon Redshift is a fully managed data warehousing service that allows you to analyze and query large volumes of data. In this tutorial, we will explore how to perform data deduplication using SQL in Redshift.

Table Structure

Before we dive into the deduplication process, let’s assume that we have a table named orders in our Redshift database, which contains the following columns:

order_id (unique identifier of the order)
customer_id (unique identifier of the customer who placed the order)
order_date (date of the order)
total_amount (total amount of the order)

Identify Duplicate Records

To identify duplicate records in the orders table, we can use the ROW_NUMBER() function along with a window function. The following SQL query demonstrates this approach:

WITH duplicate_orders AS (
    SELECT order_id, customer_id, order_date, total_amount,
           ROW_NUMBER() OVER (PARTITION BY order_id, customer_id, order_date, total_amount ORDER BY order_id) AS row_num
    FROM orders
)
SELECT order_id, customer_id, order_date, total_amount
FROM duplicate_orders
WHERE row_num > 1;

In this query:

The ROW_NUMBER() function assigns a unique number to each row within a partition. We define the partition by the columns (order_id, customer_id, order_date, total_amount) to identify duplicates based on these columns.
The PARTITION BY clause specifies the partitioning columns.
The ORDER BY clause determines the order in which the rows are numbered (in this case, by order_id).

The duplicate_orders table will contain the duplicate rows along with a row_num column indicating the sequence number of each row within the partition.

Remove Duplicate Records

Once we have identified the duplicate records, we can proceed to remove them from the orders table. We can do this by creating a new table, copying the non-duplicate records to it, and then replacing the original table with the new one.

The following SQL queries demonstrate this process:

-- Create a new table without duplicate records
CREATE TABLE deduplicated_orders AS
SELECT order_id, customer_id, order_date, total_amount
FROM orders
WHERE (order_id, customer_id, order_date, total_amount) NOT IN (
    SELECT order_id, customer_id, order_date, total_amount
    FROM duplicate_orders
);

-- Optional: Drop the original table
DROP TABLE orders;

-- Rename the new table to the original table name
ALTER TABLE deduplicated_orders RENAME TO orders;

In these queries:

We create a new table named deduplicated_orders using the CREATE TABLE statement. We only select the non-duplicate records from the orders table by filtering out the rows that exist in the duplicate_orders table.
If you want to replace the original table with the deduplicated table, you can drop the original table using the DROP TABLE statement.
Finally, we rename the deduplicated_orders table to orders using the ALTER TABLE statement.

Conclusion

Performing data deduplication is essential for maintaining data quality and improving query performance. Redshift provides powerful SQL capabilities that make it easy to identify and remove duplicate records from your database. By following the steps outlined in this tutorial, you can effectively perform data deduplication in Redshift.

Remember to backup your data before making any changes to your database.

References: