I am trying to write a function that groups by some columns in a very large table (millions of rows).  Is there any way to get find_each to work with this, or is it impossible given that I do not want to order by the id column?
The SQL of my query is:
SELECT derivable_type, derivable_id FROM "mytable" GROUP BY derivable_type, derivable_id ORDER BY "mytable"."id" ASC;
The rails find_each automatically adds the ORDER BY clause using a reorder statement.  I have tried changing the SQL to:
SELECT MAX(id) AS "mytable"."id", derivable_type, derivable_id FROM "mytable" GROUP BY derivable_type, derivable_id ORDER BY "mytable"."id" ASC;
but that doesn't work either.  Any ideas other than writing my own find_each function or overriding the private batch_order function in batches.rb?
There are at least two approaches to solve this problem:
I. Use subquery:
# query the table and select id, derivable_type and derivable_id
my_table_ids = MyTable
  .group("derivable_type, derivable_id")
  .select("MAX(id) AS my_table_id, derivable_type, derivable_id")
# use subquery to allow rails to use ORDER BY in find_each 
MyTable
  .where(id: my_table_ids.select('my_table_id'))
  .find_each { |row| do_something(row) } 
II. Write custom find_each function
rows = MyTable
  .group("derivable_type, derivable_id")
  .select("derivable_type, derivable_id")
find_each_grouped(rows, ['derivable_type', 'derivable_id']) do |row| 
  do_something(row)
end
def find_each_grouped(rows, columns, &block)
  offset = 0
  batch_size = 1_000
  loop do
    batch = rows
      .order(columns)
      .offset(offset)
      .limit(limit)
    batch.each(&block)
    break if batch.size < limit
    offset += limit
  end
end
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With