I'm building a multithreaded web crawler.
I launch a thread that gets first n href links and parses some data. Then it should add those links to a Visited list that other threads can access and adds the data to a global map that will be printed when the program is done. Then the thread launches new n new threads all doing the same thing.
How can I setup a global list of Visited sites that all threads can access and a global map that all threads can also write to.
You can't share data between processes. That doesn't mean that you can't share information.
the usual way is either to use a special process (a server) in charge of this job: maintain a state; in your case the list of visited links.
Another way is to use ETS (or Mnesia the database build upon ETS) which is designed to share information between processes.
Just to clarify, erlang/elixir uses processes rather than threads.
Given a list of elements, a generic approach:
processed is saved to ets, dets, mnesia or some DB.processed list so the Task is not unnecessarily repeated.Once all the tasks have returned or yielded,
processed list in the DB.Tasks which crash or timeout could be handled differently.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With