Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiplying very large 2D-array in Python

I have to multiply very large 2D-arrays in Python for around 100 times. Each matrix consists of 32000x32000 elements.

I'm using np.dot(X,Y), but it takes very long time for each multiplication... Below an instance of my code:

import numpy as np

X = None
for i in range(100)
    multiplying = True
    if X == None:
        X = generate_large_2darray()
        multiplying = False
    else:
        Y = generate_large_2darray()

    if multiplying:
        X = np.dot(X, Y)

Is there any other method much faster?

Update

Here is a screenshot showing the htop interface. My python script is using only one core. Also, after 3h25m only 4 multiplications have been done.

enter image description here

Update 2

I've tried to execute:

import numpy.distutils.system_info as info
info.get_info('atlas')

but I've received:

/home/francescof/.local/lib/python2.7/site-packages/numpy/distutils/system_info.py:564: UserWarning: Specified path /home/apy/atlas/lib is invalid. warnings.warn('Specified path %s is invalid.' % d) {}

So, I think it's not well-configured.

Vice versa, regarding blas I just receive {}, with no warnings or errors.

like image 527
f_ficarola Avatar asked Oct 22 '25 00:10

f_ficarola


2 Answers

As suggested by ali_m, the using of a BLAS library can speed up the operations. However, the problem in my system was a bad configuration of numpy. Here is the solution:

1) make sure to have all required libraries (you can use ATLAS, OpenBLAS, etc.). I've chosen ATLAS in my case since directly supported in Ubuntu.

sudo apt-get install libatlas3gf-base libatlas-base-dev libatlas-dev

2) remove any previous numpy installations, e.g., pypm uninstall numpy (if you installed it using ActivePython)

3) install again numpy using pip: pip install numpy

4) make sure your atlas is correctly linked:

import numpy.distutils.system_info as info
info.get_info('atlas')

ATLAS version 3.8.4 built by buildd on Sat Sep 10 23:12:12 UTC 2011:
   UNAME    : Linux crested 2.6.24-29-server #1 SMP Wed Aug 10 15:58:57 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
   INSTFLG  : -1 0 -a 1
   ARCHDEFS : -DATL_OS_Linux -DATL_ARCH_HAMMER -DATL_CPUMHZ=1993 -DATL_USE64BITS -DATL_GAS_x8664
   F2CDEFS  : -DAdd_ -DF77_INTEGER=int -DStringSunStyle
   CACHEEDGE: 393216
   F77      : gfortran, version GNU Fortran (Ubuntu/Linaro 4.6.1-9ubuntu2) 4.6.1
   F77FLAGS : -fomit-frame-pointer -mfpmath=387 -O2 -falign-loops=4 -Wa,--noexecstack -fPIC -m64
   SMC      : gcc, version gcc (Ubuntu/Linaro 4.6.1-9ubuntu2) 4.6.1
   SMCFLAGS : -fomit-frame-pointer -mfpmath=387 -O2 -falign-loops=4 -Wa,--noexecstack -fPIC -m64
   SKC      : gcc, version gcc (Ubuntu/Linaro 4.6.1-9ubuntu2) 4.6.1
   SKCFLAGS : -fomit-frame-pointer -mfpmath=387 -O2 -falign-loops=4 -Wa,--noexecstack -fPIC -m64
{'libraries': ['lapack', 'f77blas', 'cblas', 'atlas'], 'library_dirs': ['/usr/lib/atlas-base/atlas', '/usr/lib/atlas-base'], 'define_macros': [('ATLAS_INFO', '"\\"3.8.4\\""')], 'language': 'f77', 'include_dirs': ['/usr/include/atlas']}
like image 131
f_ficarola Avatar answered Oct 24 '25 14:10

f_ficarola


Matrix multiplication is always expensive, specifically around O(n3). Performing this operation in Numpy is probably the fastest way to deal with it short of writing your own matrix multiplier in a compiled program that is "closer to the metal" (like C)... this would probably still be slower. I think you are doing this operation in the best way but you must realize that a 32000x32000 matrix is very large to be preforming any operations on, let alone matrix multiplication.

That was the bad news but here is the good news. I don't know what type of data you are working with but there can be, and often are, symmetries of the matrices in question which can greatly simplify the calculation. If your data is not entirely random there may be hope but you will have to look into the actual structure of the matrices you are working with. I suggest reading about some of the "special matrices" to see if your data might fall into one of those categories. Any information you find on the category your data should also discuss or cite efficient algorithms for managing expensive operations.

like image 20
user2645976 Avatar answered Oct 24 '25 15:10

user2645976