Delete or Remove Duplicate Rows from a Table Using SQL

In most of the SQL Server forum i found that they are asking to eliminate or delete duplicate rows form their table. The most common reasons are import twice, Front-end application save same data for each refresh etc. Here i want to share with you how we can remove duplicate rows data from a Sql Server database table. One thing keep in mind that for different scenario you need different SQL query or script to remove or eliminate duplicate data. Scenarios like:

1. Table has primary key(PK) but others column data duplicated.
2. Table has no primary key.

With Primary Key:

If your table contains primary key but other column data is duplicated then it will more easy to remove than if the table has no primary key. Since your table has primary key then you can write a query in such a way that first grouping the data(column or column list which is duplicated) then collect the Max or Min ID & then start deletion. For example i have the below table structure & dataset:

Remove_Duplicate_data_from_a _table_SQL_Server

Look at the table, let you want to delete or eliminate data where both col & col2 were same. So the query should be:
SQL Server 2005 & Above:

WITH tbl_online AS (
SELECT *,RANK () OVER (
PARTITION BY COL,COL2 ORDER BY ID ) AS Rnum
FROM Duplicate)
DELETE FROM tbl_online WHERE Rnum > 1

SQL Server 2000:

DELETE FROM DUPLICATE WHERE ID NOT IN
(SELECT MIN(ID) FROM DUPLICATE GROUP BY COL,COL2)

--OR

DELETE FROM DUPLICATE WHERE ID IN
(SELECT MAX(ID) FROM DUPLICATE GROUP BY COL,COL2 HAVING COUNT(*)>1

Note:
* If you want to remove duplicate value for only COL then remove COL2 from group clause.
* For DateTime column, must format first like CONVERT(…) in Group By Clause. Click Here for Format details.

WITHOUT Primary key:
For exactly same data rows just use DISTINCT Like: SELECT DISTINCT * FROM TableName. Otherwise its a bit complex. Since the above table contains 4 columns where you want to remove or delete duplicate rows for COL & COL2 then the solution will be varried from SQL Server 2000 to 2005 & Above.

FOR SQL SERVER 2005 2008 2012:
SQL Server provide us few built in function like RANK(), ROW_NUMBER() which will ease our lives:

WITH tbl_online AS (
SELECT *,ROW_NUMBER () OVER (
PARTITION BY COL,COL2 ORDER BY COL ) AS Rnum
FROM Duplicate)
DELETE FROM tbl_online WHERE Rnum > 1

Note: Since ordering may degrade the query performance, for alternative, read the below post:
http://www.sql-server-performance.com/articles/per/ranking_functions_performance_p1.aspx

SQL Server 2000:
Since we have no built in function so we have to use temporary table. Filter distinct columns first then merge first or top value for duplicate set & store into the temporary table. Now delete the main table & reinsert from temporary table. Before executing the below query remove the ID column first.

SELECT DISTINCT T1.Col,T1.COL2(
SELECT TOP 1 Col3 FROM DUPLICATE
WHERE COL=T1.COL AND COL2=T1.COL2) COL3 INTO #tab2
FROM Duplicate T1
TRUNCATE TABLE DUPLICATE
INSERT INTO DUPLICATE SELECT * FROM #tab2
DROP TABLE #tab2
Posted in Database, SQL Server

Leave a Reply

Your email address will not be published. Required fields are marked *

AlphaOmega Captcha Classica  –  Enter Security Code
     
 

*