Correct way to do multithreaded computations in SBCL

Question

Context

I need to do computations using multi-threading. I use SBCL and portability is not a concern. I am aware that bordeaux-threads and lparallel exist but I want to implement something at the relatively low level provided by the specific SBCL threading implementation. I need maximal speed, even at the expense of readability/programming effort.

Example of computation intensive operation

We can define a sufficiently computation-intensive function that will benefit from multi-threading.

(defun intensive-sqrt (x)
  "Dummy calculation for intensive algorithm.
Approx 50 ms for 1e6 iterations."
  (let ((y x))
    (dotimes (it 1000000 t)
      (if (> y 1.01d0)
          (setf y (sqrt y))
          (setf y (* y y y))))
    y))

Mapping each computation to a thread and execute

Given a list of argument-lists llarg and a function fun, we want to compute nthreads results and return the list of results res-list. Here is what I came up with using the resources I found (see below).

(defmacro splice-arglist-help (fun arglist)
  "Helper macro.
   Splices a list 'arglist' (arg1 arg2 ...) into the function call of 'fun'
   Returns (funcall fun arg1 arg2 ...)"
  `(funcall ,fun ,@arglist))

(defun splice-arglist (fun arglist)
  (eval `(splice-arglist-help ,fun ,arglist)))

(defun maplist-fun-multi (fun llarg nthreads)
  "Maps 'fun' over list of argument lists 'llarg' using multithreading.
   Breaks up llarg and feeds it to each thread.
   Appends all the result lists at the end."
  (let ((thread-list nil)
        (res-list nil))
    ;; Create and run threads
    (dotimes (it nthreads t)
      (let ((larg-temp (elt llarg it)))
        (setf thread-list (append thread-list
                                  (list (sb-thread:make-thread
                                         (lambda ()
                                           (splice-arglist fun larg-temp))))))))
    ;; Join threads
    ;; Threads are joined in order, not optimal for speed.
    ;; Should be joined when finished ?
    (dotimes (it (list-length thread-list) t)
      (setf res-list (append res-list (list (sb-thread:join-thread (elt thread-list it))))))
    res-list))

nthreads does not necessarily match the length of llarg, but I avoid the extra book-keeping just for the example simplicity's sake. I also omitted the various declare used for optimization.

We can test the multi-threading and compare timings using :

(defparameter *test-args-sqrt-long* nil)
(dotimes (it 10000 t)
  (push (list (+ 3d0 it)) *test-args-sqrt-long*))

(time (intensive-sqrt 5d0))
(time (maplist-fun-multi #'intensive-sqrt *test-args-sqrt-long* 100))

The number of threads is quite high. I think the optimum would be to use as many threads as the CPU has, but I noticed the performance drop-off is barely noticeable in terms of time/operations. Doing more operations would involve breaking up the input lists into smaller pieces.

The above code outputs, on a 2 cores/4 threads machine :

Evaluation took:
  0.029 seconds of real time
  0.015625 seconds of total run time (0.015625 user, 0.000000 system)
  55.17% CPU
  71,972,879 processor cycles
  22,151,168 bytes consed

Evaluation took:
  1.415 seconds of real time
  4.703125 seconds of total run time (4.437500 user, 0.265625 system)
  [ Run times consist of 0.205 seconds GC time, and 4.499 seconds non-GC time. ]
  332.37% CPU
  3,530,632,834 processor cycles
  2,215,345,584 bytes consed

What's bugging me

The example I've given works very well and is robust (ie results don't get mixed up between threads, and I experience no crash). The speed gain is also there and the computations do use several cores/threads on the machines I've tested this code on. But there are a few things that I'd like an opinion/help on :

The use of the argument list llarg and larg-temp. Is this really necessary ? Is there any way to avoid manipulating potentially huge lists ?
Threads are joined in the order in which they are stored in the thread-list. I imagine this would not be optimal if operations each took a different time to complete. Is there a way to join each thread when it is finished, instead of waiting ?

The answers should be in the resources I already found, but I find the more advanced stuff hard to grapple with.

Resources found so far

http://www.sbcl.org/manual/#Threading
http://cl-cookbook.sourceforge.net/process.html
https://lispcookbook.github.io/cl-cookbook/process.html

Svante · Accepted Answer

Stylistic issues

The splice-arglist helpers are not needed at all (so I'll also skip details in them). Use apply in your thread function instead:
```
(lambda ()
  (apply fun larg-temp))
```
You don't need to (and should not) index into a list, because that is O(n) for each lookup—your loops are quadratic. Use dolist for simple side-effective loops, or loop when you have e. g. parallel iteration:
```
(loop :repeat nthreads
      :for args :in llarg
      :collect (sb-thread:make-thread (lambda () (apply fun args))))
```
For going over a list while creating a new list of the same length where each element is calculated from the corresponding element in the source list, use mapcar:
```
(mapcar #'sb-thread:join-thread threads)
```

Your function thus becomes:

(defun map-args-parallel (fun arglists nthreads)
  (let ((threads (loop :repeat nthreads
                       :for args :in arglists
                       :collect (sb-thread:make-thread
                                 (lambda ()
                                   (apply fun args))))))
    (mapcar #'sb-thread:join-thread threads)))

Performance

You are right that one usually creates only as many threads as ca. the number of cores available. If you test performance by always creating n threads, then joining them, then going to the next batch, you will indeed have not much difference in performance. That is because the inefficiency lies in creating the threads. A thread is about as resource intensive as a process.

What one usually does is to create a thread pool where the threads do not get joined, but instead reused. For that, you need some other mechanism to communicate arguments and results, e. g. channels (e. g. from chanl).

Note however that e. g. lparallel already provides a pmap function, and it does things right. The purpose of such wrapper libraries is not only to give the user (programmer) a nice interface, but also to think really hard about the problems and optimize sensibly. I am quite confident that pmap will be significantly faster than your attempt.

Correct way to do multithreaded computations in SBCL

Tags:

multithreading

common-lisp

sbcl

Context

Example of computation intensive operation

Mapping each computation to a thread and execute

What's bugging me

Resources found so far

Frank Jefferson

1 Answers

Stylistic issues

Performance

Svante

Recent Activity

Donate For Us

Correct way to do multithreaded computations in SBCL

Tags:

multithreading

common-lisp

sbcl

Context

Example of computation intensive operation

Mapping each computation to a thread and execute

What's bugging me

Resources found so far

Frank Jefferson

1 Answers

Stylistic issues

Performance

Svante

Related questions

Recent Activity

Donate For Us