I'm looking for some advice on how I might better optimize this query.
For each _piece_detail record that:
_scan record on (zip, zip_4,
zip_delivery_point, serial_number)mailing_groups (through a chain of relationships)first_scan_date_time that is greater than the MIN(scan_date_time) of the related _scan recordslatest_scan_date_time that is less than the MAX(scan_date_time) of
the related _scan recordsI will need to:
_piece_detail.first_scan_date_time to MIN(_scan.scan_date_time)
_piece_detail.latest_scan_date_time to MAX(_scan.scan_date_time)
Since I'm dealing with millions upon millions of records, I am trying to reduce the number of records that I actually have to search through. Here are some facts about the data:
job_id, so it seems to
make the most sense to run through these checks in the order of
_piece_detail.job_id, _piece_detail.piece_id._piece_detail (aside from scan_date_time)._piece_detail records belong to a mailing_group, but we don't know which ones these are until we run
through the full relationship of joins._piece_detail with a mailing_group._scan records per _piece_detail.Now, I am having a hell of a time finding a way to execute this in a decent way. I had originally started with something like this:
UPDATE _piece_detail
    INNER JOIN (
        SELECT _piece_detail.job_id, _piece_detail.piece_id, MIN(_scan.scan_date_time) as first_scan_date_time, MAX(_scan.scan_date_time) as latest_scan_date_time
        FROM _piece_detail
            INNER JOIN _container_quantity 
                ON _piece_detail.cqt_database_id = _container_quantity.cqt_database_id 
                AND _piece_detail.job_id = _container_quantity.job_id
            INNER JOIN _container_summary 
                ON _container_quantity.container_id = _container_summary.container_id 
                AND _container_summary.job_id = _container_quantity.job_id
            INNER JOIN _mail_piece_unit 
                ON _container_quantity.mpu_id = _mail_piece_unit.mpu_id 
                AND _container_quantity.job_id = _mail_piece_unit.job_id
            INNER JOIN _header 
                ON _header.job_id = _piece_detail.job_id
            INNER JOIN mailing_groups 
                ON _mail_piece_unit.mpu_company = mailing_groups.mpu_company
            INNER JOIN _scan
                ON _scan.zip = _piece_detail.zip 
                AND _scan.zip_4 = _piece_detail.zip_4 
                AND _scan.zip_delivery_point = _piece_detail.zip_delivery_point 
                AND _scan.serial_number = _piece_detail.serial_number 
        GROUP BY _piece_detail.job_id, _piece_detail.piece_id, _scan.zip, _scan.zip_4, _scan.zip_delivery_point, _scan.serial_number
    ) as t1 ON _piece_detail.job_id = t1.job_id AND _piece_detail.piece_id = t1.piece_id 
SET _piece_detail.first_scan_date_time = t1.first_scan_date_time, _piece_detail.latest_scan_date_time = t1.latest_scan_date_time
WHERE _piece_detail.first_scan_date_time < t1.first_scan_date_time 
    OR _piece_detail.latest_scan_date_time > t1.latest_scan_date_time;
I thought that this may have been trying to load too much into memory at once and might not be using the indexes properly.
Then I thought that I might be able to avoid doing that huge joined subquery and add two leftjoin subqueries to get the min/max like so:
UPDATE _piece_detail
    INNER JOIN _container_quantity 
        ON _piece_detail.cqt_database_id = _container_quantity.cqt_database_id 
        AND _piece_detail.job_id = _container_quantity.job_id
    INNER JOIN _container_summary 
        ON _container_quantity.container_id = _container_summary.container_id 
        AND _container_summary.job_id = _container_quantity.job_id
    INNER JOIN _mail_piece_unit 
        ON _container_quantity.mpu_id = _mail_piece_unit.mpu_id 
        AND _container_quantity.job_id = _mail_piece_unit.job_id
    INNER JOIN _header 
        ON _header.job_id = _piece_detail.job_id
    INNER JOIN mailing_groups 
        ON _mail_piece_unit.mpu_company = mailing_groups.mpu_company
    LEFT JOIN _scan fs ON (fs.zip, fs.zip_4, fs.zip_delivery_point, fs.serial_number) = (
        SELECT zip, zip_4, zip_delivery_point, serial_number
        FROM _scan
        WHERE zip = _piece_detail.zip 
            AND zip_4 = _piece_detail.zip_4 
            AND zip_delivery_point = _piece_detail.zip_delivery_point 
            AND serial_number = _piece_detail.serial_number
        ORDER BY scan_date_time ASC
        LIMIT 1
        )
    LEFT JOIN _scan ls ON (ls.zip, ls.zip_4, ls.zip_delivery_point, ls.serial_number) = (
        SELECT zip, zip_4, zip_delivery_point, serial_number
        FROM _scan
        WHERE zip = _piece_detail.zip 
            AND zip_4 = _piece_detail.zip_4 
            AND zip_delivery_point = _piece_detail.zip_delivery_point 
            AND serial_number = _piece_detail.serial_number
        ORDER BY scan_date_time DESC
        LIMIT 1
        )
SET _piece_detail.first_scan_date_time = fs.scan_date_time, _piece_detail.latest_scan_date_time = ls.scan_date_time
WHERE _piece_detail.first_scan_date_time < fs.scan_date_time 
    OR _piece_detail.latest_scan_date_time > ls.scan_date_time
These are the explains when I convert them to SELECT statements:
+----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+
| id | select_type | table               | type   | possible_keys                                      | key           | key_len | ref                                                                                                                    | rows   | Extra                                        |
+----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+
|  1 | PRIMARY     | <derived2>          | ALL    | NULL                                               | NULL          | NULL    | NULL                                                                                                                   | 844161 | NULL                                         |
|  1 | PRIMARY     | _piece_detail       | eq_ref | PRIMARY,first_scan_date_time,latest_scan_date_time | PRIMARY       | 18      | t1.job_id,t1.piece_id                                                                                                  |      1 | Using where                                  |
|  2 | DERIVED     | _header             | index  | PRIMARY                                            | date_prepared | 3       | NULL                                                                                                                   |     87 | Using index; Using temporary; Using filesort |
|  2 | DERIVED     | _piece_detail       | ref    | PRIMARY,cqt_database_id,zip                        | PRIMARY       | 10      | odms._header.job_id                                                                                                    |   9703 | NULL                                         |
|  2 | DERIVED     | _container_quantity | eq_ref | unique,mpu_id,job_id,job_id_container_quantity     | unique        | 14      | odms._header.job_id,odms._piece_detail.cqt_database_id                                                                 |      1 | NULL                                         |
|  2 | DERIVED     | _mail_piece_unit    | eq_ref | PRIMARY,company,job_id_mail_piece_unit             | PRIMARY       | 14      | odms._container_quantity.mpu_id,odms._header.job_id                                                                    |      1 | Using where                                  |
|  2 | DERIVED     | mailing_groups      | eq_ref | PRIMARY                                            | PRIMARY       | 27      | odms._mail_piece_unit.mpu_company                                                                                      |      1 | Using index                                  |
|  2 | DERIVED     | _container_summary  | eq_ref | unique,container_id,job_id_container_summary       | unique        | 14      | odms._header.job_id,odms._container_quantity.container_id                                                              |      1 | Using index                                  |
|  2 | DERIVED     | _scan               | ref    | PRIMARY                                            | PRIMARY       | 28      | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number |      1 | Using index                                  |
+----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+
+----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+
| id | select_type        | table               | type   | possible_keys                                                      | key           | key_len | ref                                                                                                                    | rows      | Extra                                                           |
+----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+
|  1 | PRIMARY            | _header             | index  | PRIMARY                                                            | date_prepared | 3       | NULL                                                                                                                   |        87 | Using index                                                     |
|  1 | PRIMARY            | _piece_detail       | ref    | PRIMARY,cqt_database_id,first_scan_date_time,latest_scan_date_time | PRIMARY       | 10      | odms._header.job_id                                                                                                    |      9703 | NULL                                                            |
|  1 | PRIMARY            | _container_quantity | eq_ref | unique,mpu_id,job_id,job_id_container_quantity                     | unique        | 14      | odms._header.job_id,odms._piece_detail.cqt_database_id                                                                 |         1 | NULL                                                            |
|  1 | PRIMARY            | _mail_piece_unit    | eq_ref | PRIMARY,company,job_id_mail_piece_unit                             | PRIMARY       | 14      | odms._container_quantity.mpu_id,odms._header.job_id                                                                    |         1 | Using where                                                     |
|  1 | PRIMARY            | mailing_groups      | eq_ref | PRIMARY                                                            | PRIMARY       | 27      | odms._mail_piece_unit.mpu_company                                                                                      |         1 | Using index                                                     |
|  1 | PRIMARY            | _container_summary  | eq_ref | unique,container_id,job_id_container_summary                       | unique        | 14      | odms._header.job_id,odms._container_quantity.container_id                                                              |         1 | Using index                                                     |
|  1 | PRIMARY            | fs                  | index  | NULL                                                               | updated       | 1       | NULL                                                                                                                   | 102462928 | Using where; Using index; Using join buffer (Block Nested Loop) |
|  1 | PRIMARY            | ls                  | index  | NULL                                                               | updated       | 1       | NULL                                                                                                                   | 102462928 | Using where; Using index; Using join buffer (Block Nested Loop) |
|  3 | DEPENDENT SUBQUERY | _scan               | ref    | PRIMARY                                                            | PRIMARY       | 28      | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number |         1 | Using where; Using index; Using filesort                        |
|  2 | DEPENDENT SUBQUERY | _scan               | ref    | PRIMARY                                                            | PRIMARY       | 28      | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number |         1 | Using where; Using index; Using filesort                        |
+----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+
Now, looking at the explains generated by each, I really can't tell which is giving me the best bang for my buck. The first one shows fewer total rows when multiplying the rows column, but the second appears to execute a bit quicker.
Is there anything that I could do to achieve the same results while increasing performance through modifying the query structure?
Disable update of index while doing the bulk updates
ALTER TABLE _piece_detail DISABLE KEYS;
UPDATE ....;
ALTER TABLE _piece_detail ENABLE KEYS;
Refer to the mysql docs : http://dev.mysql.com/doc/refman/5.0/en/alter-table.html
EDIT: After looking at the mysql docs I pointed to, I see the docs specify this for MyISAM table, and is nit clear for other table types. Further solutions here : How to disable index in innodb
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With