Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

select most recent values in very large table

I am an operations guy tasked with pulling data from a very large table. I'm not a DBA and cannot partition it or change the indexing. Table has nearly a billion entries, is not partitioned, and could probably be indexed "better". I need two fields, which we'll call mod_date and obj_id (mod_date is indexed). EDIT: I also add a filter for 'client' which I've blurred out in my screenshot of the explain plan.

My data: Within the group of almost a billion rows, we have fewer than 10,000 obj_id values to query across several years (a few might even be NULL). Some of the <10k obj_ids -- probably between 1,000-2,500 -- have more than 10 million mod_date values each. When the obj_ids have over a few million mod_dates, each obj_id takes several minutes to scan and sort using MAX(mod_date). The full result set takes over 12 hours to query and no one has made it to completion without some "issue" (locked out, unplugged laptop, etc.). Even if we got the first 50 rows returned we'd still need to export to Excel ... it's only going to be about 8,000 rows with 2 columns but we can never make it to the end.

So here is a simplified query I'd use if it were a small table:

select MAX(trunc(mod_date,'dd')) as last_modified_date, obj_id
from my_table
where client = 'client_name'
and obj_type_id = 12
group by obj_id;

Cardinality is 317917582, "Cost" is 12783449

explain plan

The issue: The issue is the speed of the query with such a large unpartitioned table, given the current indexes. All the other answers I've seen about "most recent date" tend to use MAX, possibly in combination with FIRST_VALUE, which seem to require a full scan of all rows in order to sort them and then determine which is the most recent.

I am hoping there is a way to avoid that, to speed up the results. It seems that Oracle (I am using Oracle SQL developer) should be able to take an obj_id, look for the most recent mod_date row starting from "now" and working backwards, and move on as soon as it finds any mod_date value … because it's a date. Is there a way to do that?

Even with such a large table, obj_ids having fewer than 10,000 mod_dates can return the MAX(mod_date) very quickly (seconds or less). The issue we are having is the obj_ids having the most mod_dates (over 10 million) take the longest to scan and sort, when they "should" be the quickest if I could get Oracle to start looking at the most recent first … because it would find a recent date quickly and move on!

like image 280
DevOps Guy Avatar asked Sep 02 '25 15:09

DevOps Guy


2 Answers

First, I'd say its a common misconception that in order to make a query run faster, you need an index (or better indexes). Full table scan makes sense when you're pulling more than 10% of the data (rough estimate, depends on multiblock read count, block size, etc).

My advice is to setup a materialized view (MY_MV or whatever) that simply does the group by query (across all ids). If you need to limit the ids to a 10k subset, just make sure you full scan the table (check explain plan). You can add a full hint if needed (select /*+ full(t) */ .. from big_table t ...)

Then do:

dbms_mview.refresh('MY_MV','C',atomic_refresh=>false);

Thats it. No issues with a client only returning the first x rows and when you go to pull everything it re-runs the entire query (ugh). Full scans are also easier to track in long opts (harder to tell what progress you've made if you are doing nested loops on an index for example).

Once its done, dump entire MV table to a file or whatever you need.

like image 108
tbone Avatar answered Sep 05 '25 05:09

tbone


tbone has it right I think. Or, if you do not have authority to create a materialized view, as he suggests, you might create a shell script on the database server to run your query via SQL*Plus and spool the output to a file. Then, run that script using nohup and you shouldn't need to worry about laptops getting turned off, etc.

But I wanted to explain something about your comment:

Oracle should be able to take an obj_id, look for the most recent mod_date row starting from "now" and working backwards, and move on as soon as it finds any mod_date value … because it's a date. Is there a way to do that?

That would be a horrible way for Oracle to run your query, given the indexes you have listed. Let's step through it...

There is no index on obj_id, so Oracle needs to do a full table scan to make sure it gets all the distinct obj_id values.

So, it starts the FTS and finds obj_id 101. It then says "I need max(mod_date) for 101... ah ha! I have an index!" So, it does a reverse index scan. For each entry in the index, it looks up the row from table and checks it to see if it is obj_id 101. If the obj_id was recently updated, we're good because we find it and stop early. But if the obj_id has not been updated in a long time, we have to read many index entries and, for each, access the table row(s) to perform the check.

In the worst case -- if the obj_id is one of those few you mentioned where max(mod_date) will be NULL, we would use the index to look up EVERY SINGLE ROW in your table that has a non-null mod_date.

Doing so many index lookups would be an awful plan if it did that just once, but you're talking about doing it for several old or never-updated obj_id values.

Anyway, it's all academic. There is no Oracle query plan that will run the query that way. It's for good reason.

Without better indexing, you're just not going to improve upon a single full table scan.

like image 32
Matthew McPeak Avatar answered Sep 05 '25 05:09

Matthew McPeak