I have a very large dataset that's organized like this:
users = [
{
username: "Bill",
gender: "Male",
details: {
city: "NY"
}
},
{
username: "Mary",
gender: "Female",
details: {
city: "LA"
}
}
]
I need a quick way to search for multiple records by multiple values from multiple keys.
I have dot-separated list of keys:
keys = ["gender", "details.city"]
I need to do something like this (written in pseudo code):
my_users = users.any? {|user|
keys.each do |key|
user.key == "NY"
end
}
I know this is not going to work. One of the reasons it will not work is that my list of keys is dot-separated, so I could either split it to an array of keys, as in ['gender'] and ['details']['city'], or convert the user hash to a dot-separated object with a method like:
def to_o
JSON.parse to_json, object_class: OpenStruct
end
I hope this method works like you want
def search(users, keys, value)
users.select do |user|
keys.any? do |key|
user.dig(*key.split('.').map(&:to_sym)) == value
end
end
end
search(users, keys, 'NY')
#=> [{ :username => "Bill", :gender => "Male", :details => { :city => "NY" } }]
For linear searching, demir's solution is a good one.
For the "must be quick" angle, you may find that an O(n) scan through your users array is too slow. To alleviate this, you may want to create an index:
require "set"
class Index
def initialize(dataset)
@index = make_index(dataset)
end
def find(conditions = {})
conditions.inject(Set.new) { |o, e| o | @index[e.join(".")] }.to_a
end
private
def make_keys(record, prefix = [])
record.flat_map do |key, val|
case val
when Hash
make_keys val, [key]
else
(prefix + [key, val]).join(".")
end
end
end
def make_index(dataset)
dataset.each_with_object({}) do |record, index|
make_keys(record).each { |key| (index[key] ||= []) << record }
end
end
end
index = Index.new(users)
p index.find("gender" => "Male", "details.city" => "NY")
# => [{:username=>"Bill", :gender=>"Male", :details=>{:city=>"NY"}}]
This takes O(n) time and costs extra memory to create the index once, but then each search of the dataset should happen in O(1) time. If you perform a bunch of searching after setting up the dataset once, something like this might be an option.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With