I have a MySQL table that contains millions of entries.
Each entry must be processed at some point by a cron job.
I need to be able to quickly locate unprocessed entries, using an index.
So far, I have used the following approach: I add a nullable, indexed processedOn column that contains the timestamp at which the entry has been processed:
CREATE TABLE Foo (
...
processedOn INT(10) UNSIGNED NULL,
KEY (processedOn)
);
And then retrieve an unprocessed entry using:
SELECT * FROM Foo WHERE processedOn IS NULL LIMIT 1;
Thanks to MySQL's IS NULL optimization, the query is very fast, as long as the number of unprocessed entries if small (which is almost always the case).
This approach is good enough: it does the job, but at the same time I feel like the index is wasted because it's only ever used for WHERE processedOn IS NULL queries, and never for locating a precise value or range of values for this field. So this has an inevitable impact on storage space and INSERT performance, as every single timestamp is indexed for nothing.
Is there a better approach? Ideally the index would just contain pointers to the unprocessed rows, and no pointer to any processed row.
I know I could split this table into 2 tables, but I'd like to keep it in a single table.
What comes to my mind is to create a isProcessed column, with default value = 'N' and you set to 'Y' when processed (at the same time you set the processedOn column). Then create an index on the isProcessed field. When you query (with the where clause WHERE isProcessed = 'N'), it will respond very fast.
UPDATE: ALTERNATIVE with partitioning:
Create your table with partitions and define a field that will have just 2 values 1 or 0. This will create one partition for records with the field = 1 and another for records with field = 0.
create table test (field1 int, field2 int DEFAULT 0)
PARTITION BY LIST(field2) (
PARTITION p0 VALUES IN (0),
PARTITION p1 VALUES IN (1)
);
This way, if you want to query only the records with the field equal to one of the values, just do this:
select * from test partition (p0);
The query above will show only records with field2 = 0. And if you need to query all records together, you just query the table normally:
select * from test;
As far as I was able to understand, this will help you with your need.
I have multiple answers and comments on others' answers.
First, let me assume that the PRIMARY KEY for Foo is id INT UNSIGNED AUTO_INCREMENT (4 bytes) and that the table is Engine=InnoDB.
Indexed Extra column
The index for the extra column would be, per row, the width of the extra column and the PRIMARY KEY, plus a bunch of overhead. With your processedOn, you are talking about 8 bytes (2 INTs). With a simple flag, 5 bytes.
Separate table
This table would have only id for the unprocessed items. It would take extra code to populate it. It's size would stay at some "high-water mark". So, if there were a burst of unprocessed items, it would grow, but not shrink back. (Here's a rare case where OPTIMIZE TABLE is useful.) InnoDB requires a PRIMARY KEY, and id would work perfectly. So, one column, no extra index. It is a lot smaller than the extra index discussed above. Finding something to work on:
$id = SELECT id FROM tbl LIMIT 1; -- don't care which one
process it
DELETE FROM tbl where id = $id
2 PARTITIONs, one processed, one not
No. When you change a row from processed to unprocessed, the row must be removed from one partition and inserted into the other. This is done behind the scenes by your UPDATE ... SET flag = 1. Also, both partitions have the "high-water" issue -- they will grow but not shrink. And the space overhead for partitioning may be as much as the other solutions.
SELECT by PARTITION ... requires 5.6. Without that, you would need an INDEX, so you are back to the index issues.
Continual Scanning
This incurs zero extra disk space. (That's better than you had hoped for, correct?) And it is not too inefficient. Here's how it works. Here is some pseudo-code to put into your cron job. But don't make it a cron job. Instead, let it run all the time. (The reason will become clear, I hope.)
SELECT @a := 0;
Loop:
# Get a clump
SELECT @z := id FROM Foo WHERE id > @a ORDER BY id LIMIT 1000,1;
if no results, Set @z to MAX(id)
# Find something to work on in that clump:
SELECT @id := id FROM Foo
WHERE id > @a
AND id <= @z
AND not-processed
LIMIT 1;
if you found something, process it and set @z := @id
SET @a := @z;
if @a >= MAX(id), set @a := 0; # to start over
SLEEP 2 seconds # or some amount that is a compromise
Go Loop
Notes:
id. (It could be made simpler if there were no gaps.) (If the PK is not AUTO_INCREMENT, it is almost identical.)sleep is to be a 'nice guy'.Selective Index
MariaDB's dynamic columns and MySQL 5.7's JSON can index things, and I think they are "selective". One state would be to have the column empty, the other would be to have the flag set in the dynamic/json column. This will take some research to verify, and may require an upgrade.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With