Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Working with large data sets and ruby

Could REALLY use some help here. Struggling with displaying a dashboard with large data.

When working with @ 2k records average @ 2 sec.

The query in MySql Console take less than 3.5 seconds to return 150k rows. Same query in Ruby takes over 4 + minutes from time query is performed until all objects and ready.

Goal: Optimize data even further before adding cache server. Working with Ruby 1.9.2, Rails 3.0 and Mysql (Mysql2 gem)

Questions:

  • Does working with Hashes hurt performance?
  • Should I first put everything in one primary hash then manipulate the data I need afterwards?
  • Is there anything else I can do to help with performance?

Rows in DB:

  • GasStations and US Census has @ 150,000 records
  • Person has @ 100,000 records
  • Cars has @ 200,000 records
  • FillUps has @ 2.3 Million

Required for dashboard (query based on time periods of last 24 hours, last week, etc). All data returned in JSON format for JS.

  • Gas Stations, with FillUps and US Census data (zip code, Name, City, Population)
  • Top 20 cities with the most fill ups
  • Top 10 cars with Fill Ups
  • Cars grouped by how many times they filled up their tank

Code (sample of 6 months. Returns @ 100k + records):

# for simplicity, removed the select clause I had, but removing data I don't need like updated_at, gas_station.created_at, etc. instead of returning all the columns for each table.
@primary_data = FillUp.includes([:car, :gas_staton, :gas_station => {:uscensus}]).where('fill_ups.created_at >= ?', 6.months.ago) # This would take @ 4 + minutes

# then tried

@primary_data = FillUp.find_by_sql('some long sql query...') # took longer than before.
# Note for others, sql query did some pre processing for me which added attributes to the return.  Query in DB Console took < 4 seconds.  Because of these extra attributes, query took longer as if Ruby was checking each row for mapping attributes

# then tried

MY_MAP = Hash[ActiveRecord::Base.connection.select_all('SELECT thingone, thingtwo from table').map{|one| [one['thingone'], one['thingtwo']]}] as seen http://stackoverflow.com/questions/4456834/ruby-on-rails-storing-and-accessing-large-data-sets
# that took 23 seconds and gained mapping of additional data that was processing later, so much faster

# currently using below which takes @ 10 seconds
# All though this is faster, query still only takes 3.5 seconds, but parsing it to the hashes does add overhead.
cars = {}
gasstations = {}
cities = {}
filled = {}

client = Mysql2::Client.new(:host => "localhost", :username => "root")
client.query("SELECT sum(fill_ups_grouped_by_car_id) as filled, fillups.car_id, cars.make as make, gasstations.name as name,  ....", :stream => true, :as => :json).each do |row|
  # this returns fill ups gouged by car ,fill_ups.car_id, car make, gas station name, gas station zip, gas station city, city population 
  if cities[row['city']]
    cities[row['city']]['fill_ups']  = (cities[row['city']]['fill_ups']  + row['filled'])
  else
    cities[row['city']] = {'fill_ups' => row['filled'], 'population' => row['population']}
  end
  if gasstations[row['name']]
    gasstations[row['name']]['fill_ups'] = (gasstations[row['name']]['fill_ups'] + row['filled'])
  else
    gasstations[row['name']] = {'city' => row['city'],'zip' => row['city'], 'fill_ups' => row['filled']}
  end
  if cars[row['make']]
    cars[row['make']] = (cars[row['make']] + row['filled'])
  else
    cars[row['make']] = row['filled']
  end
  if row['filled']
    filled[row['filled']] = (filled[row['filled']] + 1)
  else
    filled[row['filled']] = 1
  end
end

Have the following models:

def Person
 has_many :cars 
end

def Car
  belongs_to :person
  belongs_to :uscensus, :foreign_key => :zipcode, :primary_key => :zipcode
  has_many :fill_ups
  has_many :gas_stations, :through => :fill_ups
end

def GasStation
  belongs_to :uscensus, :foreign_key => :zipcode, :primary_key => :zipcode
  has_many :fill_ups
  has_many :cars, :through => :fill_ups
end

def FillUp
  # log of every time a person fills up there gas
  belongs_to :car
  belongs_to :gas_station
end

def Uscensus
  # Basic data about area based on Zip code
end
like image 349
pcasa Avatar asked Mar 08 '26 12:03

pcasa


1 Answers

I don't use RoR, but returning 100k rows for a dashboard is never going to be very fast. I strongly suggest building or maintaining summary tables and run GROUP BYs in the database to summarize your dataset before presentation.

like image 59
Joshua Martell Avatar answered Mar 11 '26 04:03

Joshua Martell



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!