Computational Complexity of SELECT DISTINC(column) FROM table on an indexed column

Question

Question

I'm not a comp sci major so forgive me if I muddle the terminology. What is the computational complexity for calling

 SELECT DISTINCT(column) FROM table

or

SELECT * FROM table GROUP BY column

on a column that IS indexed? Is it proportional to the number of rows or the number of distinct values in the column. I believe that would be O(1)*NUM_DISINCT_COLS vs O(NUM_OF_ROWS)

Background

For example if I have 10 million rows but only 10 distinct values/groups in that column visually you could simply count the last item in each group so the time complexity would be tied to the number of distinct groups and not the number of rows. So the calculation would take the same amount of time for 1 million rows as it would for 100. I believe the complexity would be

O(1)*Number_Of_DISTINCT_ELEMENTS

But in the case of MySQL if I have 10 distinct groups will MySQL still seek through every row, basically calculating a running some of each group, or is it set up in such a way that a group of rows of the same value can be calculated in O(1) time for each distinct column value? If not then I belive it would mean the complexity is

O(NUM_ROWS)

Why Do I Care?

I have a page in my site that lists stats for categories of messages, such as total unread, total messages, etc. I could calculate this information using GROUP BY and SUM() but I was under the impression this will take longer as the number of messages grow so instead I have a table of stats for each category. When a new message is sent or created I increment the total_messages field. When I want to view the states page I simply select a single row

SELECT total_unread_messages FROM stats WHERE category_id = x

instead of calculating those stats live across all messages using GROUP BY and/or DISINCT.

The performance hit either way is not large in my case and so this may seem like a case of "premature optimization", but it would be nice to know when I'm doing something that is or isn't scalable with regard to other options that don't take much time to construct.

Gordon Linoff · Accepted Answer

If you are doing:

select distinct column
from table

And there is an index on column, then MySQL can process this query using a "loose index scan" (described here).

This should allow the engine to read one key from the index and then "jump" to the next key without reading the intermediate keys (which are all identical). This suggests that the operation does not require reading the entire index, so it is, in general, less than O(n) (where n = number of rows in the table).

I doubt that finding the next value requires only one operation. I wouldn't be surprised if the overall complexity were something like O(m * log(n)), where m = number of distinct values.

Computational Complexity of SELECT DISTINC(column) FROM table on an indexed column

Tags:

time-complexity

sql

mysql

group-by

distinct

Usman Mutawakil

1 Answers

Gordon Linoff

Recent Activity

Donate For Us

Computational Complexity of SELECT DISTINC(column) FROM table on an indexed column

Tags:

time-complexity

sql

mysql

group-by

distinct

Usman Mutawakil

1 Answers

Gordon Linoff

Related questions

Recent Activity

Donate For Us