Background
I have a MySQL test environment with a table which contains over 200 million rows. On this table have to execute two types of queries;
client_id and a list of sgtins, which can hold up to 
50.000 items, I need to know which sgtins are present in the table.client_id and a list of sgtins, which can hold up to 
50.000 items, I need to fetch the full row. (store, gtin...)The table can grow to 200+ millions record for a single 'client_id'.
Test environment
Xeon E3-1545M / 32GB RAM / SSD. InnoDB buffer pool 24GB. (Production will be a larger server with 192GB RAM)
Table
CREATE TABLE `sgtins` (
  `client_id` INT UNSIGNED NOT NULL,
  `sgtin` varchar(255) NOT NULL,
  `store` varchar(255) NOT NULL,
  `gtin` varchar(255) NOT NULL,
  `timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  INDEX (`client_id`, `store`, `sgtin`),
  INDEX (`client_id`),
  PRIMARY KEY (`client_id`,`sgtin`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Tests
First I generated random sgtin values spread over 10 'client_id's to fill the table with 200 million rows.
I created a benchmark tool which executes various queries I tried. Also I used the explain plan to find out which performance best. This tool will read, for every test, new random data from the data I used to fill the database. To ensure every query is different.
For this post I will use 28 sgtins.
Temp table
CREATE TEMPORARY TABLE sgtins_tmp_table (`sgtin` varchar(255) primary key)
 engine=MEMORY;
Exist query
I use this query for find out if the sgtins exist. Also this is the fastest query I found. For 50K sgtins this query will take between 3 and 9 seconds.
-- cost = 17 for 28 sgtins loaded in the temp table.
SELECT sgtin
FROM sgtins_tmp_table
WHERE EXISTS 
  (SELECT sgtin FROM sgtins 
  WHERE sgtins.client_id = 4 
  AND sgtins.sgtin = sgtins_tmp_table.sgtin);

Select queries
-- cost = 50.60 for 28 sgtins loaded in the temp table. 50K not usable.
SELECT sgtins.sgtin, sgtins.store, sgtins.timestamp
FROM sgtins_tmp_table, sgtins
WHERE sgtins.client_id = 4
AND sgtins_tmp_table.sgtin = sgtins.sgtin;

-- cost = 64 for 28 sgtins loaded in the temp table.
SELECT sgtins.sgtin, sgtins.store, sgtins.timestamp
FROM sgtins
WHERE sgtins.client_id = 4
AND sgtins.sgtin IN ( SELECT sgtins_tmp_table.sgtin
 FROM sgtins_tmp_table);

-- cost = 50.60 for 28 sgtins loaded in the temp table.
SELECT sgtins_tmp_table.epc, sgtins.store
FROM sgtins_tmp_table, sgtins
WHERE exists (SELECT organization_id, sgtin FROM sgtins WHERE client_id = 4 AND sgtins.sgtin = sgtins_tmp_table.sgtin)
AND sgtins.client_id = 4
AND sgtins_tmp_table.sgtin = sgtins.sgtin;

Summary
The exist query is usable but the selects are to slow. What can I do about it? And any advice is welcome :)
1:- Check Indexes. 2:- There should be indexes on all fields used in the WHERE and JOIN portions of the SQL statement 3:- Limit Size of Your Working Data Set. 4:- Only Select Fields You select as Need. 5:- Remove Unnecessary Table and index 6:- Remove OUTER JOINS.
MongoDB is also considered to be the best database for large amounts of text and the best database for large data.
I would write your exists query like this:
SELECT stt.sgtin
FROM sgtins_tmp_table stt
WHERE EXISTS (SELECT 1
              FROM sgtins s
              WHERE s.client_id = 4 AND
                    s.sgtin = stt.sgtin
             );
For this query, you want an index on sgtins(sgtin, client_id).  
i would suggest to rewite your EXISTS SQL as corelated subqueries tends to optimize badly most off the time. 
The suggested query would be to use a INNER JOIN instead. 
SELECT filter.sgtin
FROM (SELECT '<value>' AS sgtin UNION ALL SELECT '<value>' ..) AS filter
INNER JOIN sgtins ON filter.sgtin = sgtins.sgtin WHERE sgtins.client_id = 4
As most likely this is faster then using a temporary table. 
But your are dealing with 50K values so i would make sense to generate the needed derived table SQL with dynamic SQL directly from the temporary table. 
Also like i suggested in the chat. 
Making a index (sgtins, client_id) would most likely make more sense depending on the data selectivity which is not really clear. 
As that index might make your corelated subquery faster. 
Query
# Maybe also needed to be changed with 50 K 
# SET SESSION max_allowed_packet = ??; 
# needed for GROUP_CONCAT as if defualts to only 1024 
SET SESSION group_concat_max_len = @@max_allowed_packet;
SET @UNION_SQL = NULL;
SELECT
  CONCAT(
       'SELECT '
    ,  GROUP_CONCAT(
          CONCAT("'", sgtins_tmp_table.sgtin,"'", ' AS sgtin')
          SEPARATOR ' UNION ALL SELECT '
       )
  )
FROM
 sgtins_tmp_table
INTO
 @UNION_SQL;
SET @SQL = CONCAT("
SELECT filter.sgtin
FROM (",@UNION_SQL,") AS filter
INNER JOIN sgtins ON filter.sgtin = sgtins.sgtin WHERE sgtins.client_id = 4
");
PREPARE q FROM @SQL;
EXECUTE q;
see demo
Editted because of comments
A more ideal approach would be using a fixed table which you index and use CONNECTION_ID() to separate the search values. 
CREATE TABLE sgtins_filter (
    connection_id INT
  , sgtin varchar(255) NOT NULL
  , INDEX(connection_id, sgtin)
);
Then you can simply join between both tables
SELECT sgtins_filter.sgtin
FROM sgtins_filter
INNER JOIN sgtins
ON
    sgtins_filter.sgtin = sgtins.sgtin
  AND
    sgtins_filter.connection_id = CONNECTION_ID()
  AND 
    sgtins.client_id = 4; 
see demo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With