Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimize update query

I am looking for suggestions to optimize this query that already runs for over an hour with about 300,000 rows in the table. We are using a reporting tool that requires data to be in this shape when it's pulled, so re-designing table structure is not an option. The table looks like this:

CREATE TABLE [datatable](
    [RowID] [int] IDENTITY(1,1) NOT NULL,
    [CampaignID] [int] NOT NULL,
    [CampaignName] [nvarchar](255) NULL,
    [Category] [nvarchar](255) NOT NULL,
    [PostID] [int] NOT NULL,
    [TopicName] [nvarchar](4000) NULL,
    [TopicFrequency] [int] NULL
)

Data is being constantly added to the table, so I have to periodically update topic frequencies. Here is my current query:

UPDATE  datatable
SET     TopicFrequency = b.TopicFrequency
FROM    datatable INNER JOIN
  (SELECT CampaignID, Category, TopicName, COUNT(DISTINCT PostID) AS TopicFrequency
    FROM datatable GROUP BY CampaignID, Category, TopicName) AS b 
    ON datatable.CampaignID = b.CampaignID 
    AND datatable.Category = b.Category 
    AND datatable.TopicName = b.TopicName

With topic name being nvarchar 4000 I can't create an index on the field. Looking for ideas. Thanks.

like image 586
Alex Polkhovsky Avatar asked Dec 29 '25 02:12

Alex Polkhovsky


1 Answers

General decision - is to split your table into two or more tables - ie - normalize the data structure. I think that two more tables can be introduced - for Campaigns and Topics

BUT

For your present data structures you can create an uniqueidentifier or bigint computed column as hash of TopicName field, index it and look for hash instead of string field. I'll provide you an example with bigint:

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO

CREATE FUNCTION [dbo].[HashString64SVF](@input NVARCHAR(4000))
RETURNS BIGINT
WITH SCHEMABINDING, RETURNS NULL ON NULL INPUT 
AS 
BEGIN
    RETURN
        CAST(SUBSTRING(HASHBYTES('SHA1', UPPER(@Input)), 1, 8) AS BIGINT) 
    ^   CAST(SUBSTRING(HASHBYTES('SHA1', UPPER(@Input)), 9, 8) AS BIGINT) 
    ^   CAST(SUBSTRING(HASHBYTES('SHA1', UPPER(@Input)), 17, 4) AS BIGINT) 
END
GO
ALTER TABLE datatable ADD TopicNameHash AS dbo.HashString64SVF(TopicName)
GO
CREATE INDEX NewIndexName ON DataTable(TopicNameHash, CampaignID, Category) INCLUDE(PostId)
GO
UPDATE  datatable
SET     TopicFrequency = b.TopicFrequency
FROM    datatable 
JOIN
  (SELECT CampaignID, Category, TopicNameHash, COUNT(DISTINCT PostID) AS TopicFrequency
    FROM datatable GROUP BY CampaignID, Category, TopicNameHash) AS b 
    ON datatable.CampaignID = b.CampaignID 
    AND datatable.Category = b.Category 
    AND datatable.TopicNameHash = b.TopicNameHash

AND

Create a primary key on RowId column

AND

Recreate the table in way like this:

CREATE TABLE [datatable](
    [RowID] [int] IDENTITY(1,1) PRIMARY KEY,
    [CampaignID] [int] NOT NULL,
    [Category] [nvarchar](255) NOT NULL,
    [PostID] [int] NOT NULL,
    --uncomment if needed [TopicNameHash] AS dbo.HashString64SVF(TopicName),
    [TopicFrequency] [int] NULL,
    [CampaignName] [nvarchar](255) NULL,
    [TopicName] [nvarchar](4000) NULL
)

The main reason - if your nullable variable columns is in the end of columns list and there are many NULL values in these columns - sql server can save a little space in row and thus - in IO

like image 67
Oleg Dok Avatar answered Dec 31 '25 14:12

Oleg Dok