I'm working on a scraper which goes through websites and parses specific parts of them in Sidekiq workers. Imagine a situation when the scraper visits a website which contains 10 elements that I'm interested in and each of them is queued in Sidekiq. At the moment I pass the source code of the element as an argument which is loaded in Nokogiri later on. My question is - is it a good idea to pass a huge string as an argument to the Sidekiq worker? The string length is always between 77,000-80,000 characters so it's really huge. Or should I store it in a temporary table and find the specific record before loading by Nokogiri?
I would recommend storing the string on S3(or any other object store) and use the returned URL to fetch the string and process the job.
This way you can ensure that a small Redis server can support many concurrent sidekiq jobs and will not go out of RAM.
As others have commented, it's best to keep your worker params as small as possible. You should pass the minimum possible data your worker needs to accomplish it's task. If you're using Sidekiq you may need to consider memory size. See sidekiq memory usage reset
Storing large string objects may become a memory problem depending on concurrency. You can get some idea of memory of your string memory size in ruby:
require 'securerandom'
require 'objspace'
str = SecureRandom.hex(40000) # generate a random 80k length string
ObjectSpace.memsize_of(str) #=> 80041 # < 1 MB for your example
UPDATE:
If you want to check memory size of non-string data like a hash, you could use something like:
hash = {key: str};
ObjectSpace.memsize_of(hash.to_s)
=> 131112
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With