Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL Insert distinct records

I'm trying to batch some insertion scripts (avoiding duplicates) and I've come across some tables that have no primary key (I know...I didn't create them and I cannot modify them). Basically what I've done is grabbed the rows I need, put them into a temporary table ([TempTable]), and updated some values in them.

Now I need to re-insert DISTINCT TOP values from [TempTable] into [OriginalTable] in batches. To do this, I imagine I would need a column in the temp table (which I've created...let's call it [ValuesInserted]), that specifies which columns were just inserted.

I would do an INSERT statement to put DISTINCT values into the original table, using TOP to batch it.

INSERT INTO [OriginalTable]
SELECT DISTINCT TOP (1000) *
FROM [TempTable]

Then I would UPDATE the temp table to have ValuesInserted set to 1 for the records that were just inserted. This is where I'm stuck:

UPDATE /*TOP (1000) - Doesn't work*/ [TempTable]
SET [ValuesInserted] = 1
???

Then I would DELETE those records from the temp table so that my next INSERT statement (using TOP) will not capture the previous set of records.

DELETE
FROM [TempTable]
WHERE [ValuesInserted] = 1

The main problem I'm having is that just running an UPDATE on just the TOP (1000) rows, doesn't capture all of the records that may have duplicates in [TempTable]. I also cannot perform an INNER JOIN on all columns on two copies of [TempTable] because this is being run on many different tables using dynamic SQL. Basically, the script needs to be generic (not specific to any table), but it should be assumed that there is no primary key.

The following generic sample captures the idea:

Val1    Val2    ValuesInserted
1       1       0
1       2       0
1       3       0
1       4       0
1       5       0
1       6       0
1       7       0
1       8       0
1       9       0
1       1       0             <--Duplicate
2       1       0
2       2       0
2       3       0
2       4       0
2       5       0
2       6       0
2       7       0
2       8       0
2       9       0
2       1       0             <--Duplicate
3       1       0
3       2       0
3       3       0
3       4       0
3       5       0
3       6       0
3       7       0
3       8       0
3       9       0
3       1       0             <--Duplicate
1       2       0             <--Duplicate
1       3       0             <--Duplicate

Doing an UPDATE TOP (5) on this above data set will only update the first 5 records:

Val1    Val2    ValuesInserted
1       1       1             <--Updated
1       2       1             <--Updated
1       3       1             <--Updated
1       4       1             <--Updated
1       5       1             <--Updated
1       6       0
1       7       0
1       8       0
1       9       0
1       1       0             <--Duplicate
2       1       0
2       2       0
2       3       0
2       4       0
2       5       0
2       6       0
2       7       0
2       8       0
2       9       0
2       1       0             <--Duplicate
3       1       0
3       2       0
3       3       0
3       4       0
3       5       0
3       6       0
3       7       0
3       8       0
3       9       0
3       1       0             <--Duplicate
1       2       0             <--Duplicate
1       3       0             <--Duplicate

I need to update any records that match the top 5 records like so:

Val1    Val2    ValuesInserted
1       1       1             <--Updated
1       2       1             <--Updated
1       3       1             <--Updated
1       4       1             <--Updated
1       5       1             <--Updated
1       6       0
1       7       0
1       8       0
1       9       0
1       1       1             <--Updated
2       1       0
2       2       0
2       3       0
2       4       0
2       5       0
2       6       0
2       7       0
2       8       0
2       9       0
2       1       0             <--Duplicate
3       1       0
3       2       0
3       3       0
3       4       0
3       5       0
3       6       0
3       7       0
3       8       0
3       9       0
3       1       0             <--Duplicate
1       2       1             <--Updated
1       3       1             <--Updated

If you can make your idea work on this sample, I can apply it to my specific case.

Am I approaching this completely wrong, or am I missing something? I'm looking for a solution that doesn't hog resources because the script is batched and is running on very large databases on high-impact servers.

The closest topic I could find on this was: Using Distinct in SQL Update. However, the answers given would not work when using TOP.

EDIT: This apparently wasn't clear at the beginning. The first thing I'm doing is grabbing rows from [OriginalTable] and putting them into [TempTable]. These rows are initially unique. However, I perform an update that modifies some of the values, yielding data like the sample above. From there, I need to grab DISTINCT rows and re-insert them into [OriginalTable].

like image 744
Chris H Avatar asked Dec 12 '25 09:12

Chris H


1 Answers

It looks like you're really going out of your way to make this as complicated as possible. I would just remove the duplicates from the temporary table in the first place. Or never INSERT them there, which is even better. Or build an actual ETL solution, perhaps with SSIS.

Those things said, what you're looking for is the OUTPUT clause, which can be added to any INSERT, UPDATE, or DELETE statement:

DECLARE @inserted_ids TABLE (val1, val2)

INSERT INTO dbo.OriginalTable (val1, val2)
OUTPUT INSERTED.val1, INSERTED.val2 INTO @inserted_ids
SELECT DISTINCT TOP 1000 val1, val2
FROM dbo.TempTable

DELETE TT
FROM @inserte_ids II
INNER JOIN dbo.TempTable TT ON
    TT.val1 = II.val1 AND
    TT.val2 = II.val2
like image 76
Tom H Avatar answered Dec 15 '25 01:12

Tom H