How to speed up Python code for running on a powerful machine? [closed]

Question

I've completed writing a multiclass classification algorithm that uses boosted classifiers. One of the main calculations consists of weighted least squares regression. The main libraries I've used include:

statsmodels (for regression)
numpy (pretty much everywhere)
scikit-image (for extracting HoG features of images)

I've developed the algorithm in Python, using Anaconda's Spyder.

I now need to use the algorithm to start training classification models. So I'll be passing approximately 7000-10000 images to this algorithm, each about 50x100, all in gray scale.

Now I've been told that a powerful machine is available in order to speed up the training process. And they asked me "am I using GPU?" And a few other questions.

To be honest I have no experience in CUDA/GPU, etc. I've only ever heard of them. I didn't develop my code with any such thing in mind. In fact I had the (ignorant) impression that a good machine will automatically run my code faster than a mediocre one, without my having to do anything about it. (Apart from obviously writing regular code efficiently in terms of loops, O(n), etc).

Is it still possible for my code to get speeded up simply by virtue of being on a high performance computer? Or do I need to modify it to make use of a parallel-processing machine?

DrV · Accepted Answer

The comments and Moj's answer give a lot of good advice. I have some experience on signal/image processing with python, and have banged my head against the performance wall repeatedly, and I just want to share a few thoughts about making things faster in general. Maybe these help figuring out possible solutions with slow algorithms.

Where is the time spent?

Let us assume that you have a great algorithm which is just too slow. The first step is to profile it to see where the time is spent. Sometimes the time is spent doing trivial things in a stupid way. It may be in your own code, or it may even be in the library code. For example, if you want to run a 2D Gaussian filter with a largish kernel, direct convolution is very slow, and even FFT may be slow. Approximating the filter with computationally cheap successive sliding averages may speed things up by a factor of 10 or 100 in some cases and give results which are close enough.

If a lot of time is spent in some module/library code, you should check if the algorithm is just a slow algorithm, or if there is something slow with the library. Python is a great programming language, but for pure number crunching operations it is not good, which means most great libraries have some binary libraries doing the heavy lifting. On the other hand, if you can find suitable libraries, the penalty for using python in signal/image processing is often negligible. Thus, rewriting the whole program in C does not usually help much.

Writing a good algorithm even in C is not always trivial, and sometimes the performance may vary a lot depending on things like CPU cache. If the data is in the CPU cache, it can be fetched very fast, if it is not, then the algorithm is much slower. This may introduce non-linear steps into the processing time depending on the data size. (Most people know this from the virtual memory swapping, where it is more visible.) Due to this it may be faster to solve 100 problems with 100 000 points than 1 problem with 10 000 000 points.

One thing to check is the precision used in the calculation. In some cases float32 is as good as float64 but much faster. In many cases there is no difference.

Multi-threading

Python - did I mention? - is a great programming language, but one of its shortcomings is that in its basic form it runs a single thread. So, no matter how many cores you have in your system, the wall clock time is always the same. The result is that one of the cores is at 100 %, and the others spend their time idling. Making things parallel and having multiple threads may improve your performance by a factor of, e.g., 3 in a 4-core machine.

It is usually a very good idea if you can split your problem into small independent parts. It helps with many performance bottlenecks.

And do not expect technology to come to rescue. If the code is not written to be parallel, it is very difficult for a machine to make it parallel.

GPUs

Your machine may have a great GPU with maybe 1536 number-hungry cores ready to crunch everything you toss at them. The bad news is that making GPU code is a bit different from writing CPU code. There are some slightly generic APIs around (CUDA, OpenCL), but if you are not accustomed to writing parallel code for GPUs, prepare for a steepish learning curve. On the other hand, it is likely someone has already written the library you need, and then you only need to hook to that.

With GPUs the sheer number-crunching power is impressive, almost frightening. We may talk about 3 TFLOPS (3 x 10^12 single-precision floating-point ops per second). The problem there is how to get the data to the GPU cores, because the memory bandwidth will become the limiting factor. This means that even though using GPUs is a good idea in many cases, there are a lot of cases where there is no gain.

Typically, if you are performing a lot of local operations on the image, the operations are easy to make parallel, and they fit well a GPU. If you are doing global operations, the situation is a bit more complicated. A FFT requires information from all over the image, and thus the standard algorithm does not work well with GPUs. (There are GPU-based algorithms for FFTs, and they sometimes make things much faster.)

Also, beware that making your algorithms run on a GPU bind you to that GPU. The portability of your code across OSes or machines suffers.

Buy some performance

Also, one important thing to consider is if you need to run your algorithm once, once in a while, or in real time. Sometimes the solution is as easy as buying time from a larger computer. For a dollar or two an hour you can buy time from quite fast machines with a lot of resources. It is simpler and often cheaper than you would think. Also GPU capacity can be bought easily for a similar price.

One possibly slightly under-advertised property of some cloud services is that in some cases the IO speed of the virtual machines is extremely good compared to physical machines. The difference comes from the fact that there are no spinning platters with the average penalty of half-revolution per data seek. This may be important with data-intensive applications, especially if you work with a large number of files and access them in a non-linear way.

How to speed up Python code for running on a powerful machine? [closed]

Tags:

performance

python

numpy

cuda

gpu

user961627

1 Answers

DrV

Recent Activity

Donate For Us

How to speed up Python code for running on a powerful machine? [closed]

Tags:

performance

python

numpy

cuda

gpu

user961627

1 Answers

DrV

Related questions

Recent Activity

Donate For Us