Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove Partial Duplicate Rows in SQL Server 2016

I have a data set that has some column where values match, but the rest of the column values do not. I need to delete duplicates where SubCategory of a lower level (Level2, Level3 and Level 4) "IS NOT NULL" but its corresponding "duplicate partner" (grouped by [SubCategory Level 1 ID], [Product Category] and [Product Name]) has the same lower level SubCategory - "IS NULL". Per table below I need to remove ID 2, 4, 6 and 9 (see highlighted in red font).

I've tried Dense_Rank, Rank and Row_Number functions with Partition By but that did not give me the disired output. Maybe I need to use their combination...

Eg.: RowID 1 and 2 are duplicates by [Product Category], [Product Name], [Category Level 1]. "Category Level 1" is just an ID of "Product Category". In need to remove RowID 2 because its corresponding duplicate partner RowID 1 has no "Category Level 3" assigned when RowID 2 has. Same logic applues to RowID 9 and 10, but at this time RowID 9 has "Category Level 2" where Row 10 does not. If both duplicates (RowID 1 and 2) would have "Category Level 3" assigned we would not need to delete any of them

Sample Dataset

IF OBJECT_ID('tempdb..#Category', 'U') IS NOT NULL
  DROP TABLE #Category;
GO

CREATE TABLE #Category 
(
    RowID INT NOT NULL,
    CategoryID INT NOT NULL,
    ProductCategory VARCHAR(100) NOT NULL,
    ProductName VARCHAR(100) NOT NULL,
    [SubCategory Level 1 ID] INT NOT NULL,
    [SubCategory Level 2 ID] INT NULL,
    [SubCategory Level 3 ID] INT NULL,  
    [SubCategory Level 4 ID] INT NULL
);

INSERT INTO #Category (RowID, CategoryID, ProductCategory, ProductName, [SubCategory Level 1 ID], [SubCategory Level 2 ID], [SubCategory Level 3 ID], [SubCategory Level 4 ID])
VALUES 
(1, 111, 'Furniture', 'Table', 200, 111, NULL, NULL),
(2, 234, 'Furniture', 'Table', 200, 234, 123, NULL),
(3, 122, 'Furniture', 'Chair', 200, 122, NULL, NULL),
(4, 122, 'Furniture', 'Chair', 200, 122, 32, NULL),
(5, 12, 'Auto', 'Trucks', 300, 766, 12, NULL),
(6, 3434, 'Auto', 'Trucks', 300, 322, 3434, 333),
(7, 332, 'Auto', 'Sport Vehicles', 300, 332, NULL, NULL),
(8, 332, 'Auto', 'Sport Vehicles', 300, 332, NULL, NULL),
(9, 300, 'Auto', 'Sedans', 300, 231, NULL, NULL),
(10, 300, 'Auto', 'Sedans', 300, NULL, NULL, NULL),
(11, 300, 'Auto', 'Cabriolet', 300, 456, 688, NULL),
(12, 300, 'Auto', 'Cabriolet', 300, 456, 976, NULL),
(13, 300, 'Auto', 'Motorcycles', 300, 456, 235, 334),
(14, 300, 'Auto', 'Motorcycles', 300, 456, 235, 334);


SELECT * FROM #Category; 
-- ADD YOU CODE HERE TO RETURN the following RowIDs: 2, 4, 6, 9
like image 746
Data Engineer Avatar asked Jun 22 '26 17:06

Data Engineer


1 Answers

this thread helped me a lot with understanding a different method to removing duplicate date. I want to thank the original contributors. I did however notice that the final solution is incomplete. The original poster wanted the results to return RowId's 2,4,6,9 however the ToInclude != 1 filter doesnt allow that. I am adding the code to complete the query by adding a where > 1 filter which will produce the intended result. See the code below:

;with DataToSelect
as
(
    SELECT *,
           DENSE_RANK() OVER(PARTITION BY [ProductCategory], [ProductName], [SubCategory Level 1 ID] 
                    ORDER BY
                        CASE 
                            WHEN [SubCategory Level 4 ID] IS NOT NULL THEN 3
                            WHEN [SubCategory Level 3 ID] IS NOT NULL THEN 2
                            WHEN [SubCategory Level 2 ID] IS NOT NULL THEN 1
                            END) as [ToInclude]
    FROM #Category
)
SELECT *
FROM 
    DataToSelect
WHERE 
    ToInclude > 1
ORDER BY 
    RowID

This returns:

Results Table of Code

like image 170
Chadwick Avatar answered Jun 24 '26 08:06

Chadwick



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!