I am trying to process millions of records from my table (size is about 30 GB) and I am currently doing it using paging (mysql 5.1.36). The query I use in my for loop is
select blobCol from large_table 
where name= 'someKey' and city= 'otherKey' 
order by name
LIMIT <pageNumber*pageSize>, <pageSize>
This works perfectly fine for about 500K records. I have a page size of 5000 that I am using and after page 100, the queries start slowing down dramatically. The first ~80 pages are extracted in a 2-3 seconds but after around page 130, each page takes about 30 seconds to retrieve, at least until page 200. One of my queries has about 900 pages and that would take too long.
The table structure is (type is MyISAM)
    name char(11)
    id int // col1 & col2 is a composite key
    city varchar(80) // indexed
    blobCol longblob
what can i do to speed it up? The explain for the query shows this
select_type: SIMPLE
possible_keys: city
key : city
type: ref
key_len: 242
ref: const
rows: 4293720
Extra: using where; using filesort
In case it helps, the my.cnf for my server (24 GB ram, 2 quad core procs) has these entries
  key_buffer_size = 6144M
  max_connections = 20
  max_allowed_packet = 32M
  table_open_cache = 1024
  sort_buffer_size = 256M
  read_buffer_size = 128M
  read_rnd_buffer_size = 512M
  myisam_sort_buffer_size = 128M
  thread_cache_size = 16
  tmp_table_size = 128M
  max_heap_table_size = 64M
Here is what I did, and reduced the total execution time by a factor of 10.
What I realized form the execution plan of my original query was that it was using filesort for sorting all results and ignoring the indexes. That is a bit of a waste.
My test database: 5 M records, 20 GB size. table structure same as in the question
Instead of getting blobCol directly in the first query, i first get the value of 'name' for beginning of every page. Run this query indefinitely until it returns 0 results. Every time, add the result to a list
SELECT name
FROM my_table
where id = <anyId> // I use the id column for partitioning so I need this here
order by name
limit <pageSize * pageNumber>, 1
Sine page number is not previously known, start with value 0 and keep incrementing until the query returns null. You can also do a select count(*) but that itself might take long and will not help optimize anything. Each query took about 2 seconds to run once the page number exceeded ~60.
For me, the page size was 5000 so I got a list of 'name' strings at position 0, 5001, 10001, 15001 and so on. The number of pages turned out to be 1000 and storing a list of 1000 results in memory is not expensive.
Now, iterate through the list and run this query
SELECT blobCol
FROM my_table
where name >= <pageHeader>
and name < <nextPageHeader>
and city="<any string>"
and id= 1
This will run N times, where N = size of list obtained previously. Since 'name' is the primary key col, and 'city' is also indexed, EXPLAIN shows that this calculation is performed in memory using the index.
Now, each query takes 1 second to run, instead of the original 30-40. So combining the pre-processing time of 2 seconds per page, total time per page is 3-4 seconds instead of 30-40.
If anyone has a better solution or if there is something glaringly wrong with this one, please let me know
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With