I've been trying to use f2py to interface an optimized fortran code for vector and matrix multiplication with python. To obtain a performance comparison useful for my purposes I perform the same product inside a cycle 100000 times. With a full fortran code the product takes 2.4 sec (ifort), while with f2py it takes approx 11 sec. Just for reference, with numpy it takes approx 20 sec. I ask both the fortran and the python part to write the time difference before and after the cycle and with f2py they both write 11 sec, so the code is not losing time in passing arrays. I triyed to understand if it is the way in which numpy array are stored, but I can't understand the problem. Do you have any idea? Thanks in advance
fortran Main
program Main
implicit none
save
integer :: seed, i, j, k
integer, parameter :: states =15
integer, parameter :: tessere = 400
real, dimension(tessere,states,states) :: matrix
real, dimension(states) :: vector
real :: start, finish
real :: prod(tessere)
do i=1,tessere
do j=1,states
do k=1,states
matrix(i,j,k) = i+j+k
end do
enddo
end do
do i=1,states
vector(i) = i
enddo
call doubleSum(vector,vector,matrix,states,tessere,prod)
end program
fortran subroutine:
subroutine doubleSum(ket, bra, M , states, tessere,prod)
integer :: its, j, k,t
integer :: states
integer :: tessere
real, dimension(tessere,states,states) :: M
real, dimension(states) :: ket
real, dimension(states) :: bra
real, dimension(tessere) :: prod
real,dimension(tessere,states) :: ctmp
call cpu_time(start)
do t=1,100000
ctmp=0.d0
do k=1,states
do j=1,states
do its=1,tessere
ctmp(its,k)=ctmp(its,k)+ M(its,k,j)*ket(j)
enddo
enddo
enddo
do its=1,tessere
prod(its)=dot_product(bra,ctmp(its,:))
enddo
enddo
call cpu_time(finish)
print '("Time = ",f6.3," seconds.")',finish-start
end subroutine
python script
import numpy as np
import time
import cicloS
M= np.random.rand(400,15,15)
ket=np.random.rand(15)
#M=np.asfortranarray(M)
#ket=np.asfortranarray(ket)
import time
start=time.time()
prod=cicloS.doublesum(ket,ket,M)
end=time.time()
print(end-start)
.pyf file generated with f2py and edited
! -*- f90 -*-
! Note: the context of this file is case sensitive.
python module cicloS
interface
subroutine doublesum(ket,bra,m,states,tessere,prod)
real dimension(states),intent(in) :: ket
real dimension(states),depend(states),intent(in) :: bra
real dimension(tessere,states,states),depend(states,states),intent(in) :: m
integer, optional,check(len(ket)>=states),depend(ket) :: states=len(ket)
integer, optional,check(shape(m,0)==tessere),depend(m) :: tessere=shape(m,0)
real dimension(tessere),intent(out) :: prod
end subroutine doublesum
end interface
end python module cicloS
The OP has indicated that the observed execution time difference, between standalone and F2PY compiled versions of the code, was due to different compilers and compiler flags being used.
In order to obtain consistent result, and thereby answer the question, it is necessary to ensure that F2PY uses the desired 1) compiler, and 2) compiler flags.
A list of Fortran compilers available to F2PY on the target system can be displayed by executing e.g. python -m numpy.f2py -c --help-fcompiler. On my system, this produces (truncated):
Fortran compilers found:
--fcompiler=gnu95 GNU Fortran 95 compiler (7)
--fcompiler=intelem Intel Fortran Compiler for 64-bit apps (19.0.1.144)
You can instruct F2PY which of the available Fortran compilers to use, by adding an appropriate --fcompiler flag to your compile command. For using ifort e.g. (assuming you have already created and edited a cicloS.pyf file):
python -m numpy.f2py --fcompiler=intelem -c cicloS.pyf sub.f90
Note that the output from --help-fcompiler in the previous step also displays the default compiler flags (see e.g. compiler_f90) that F2PY defines for each available compiler. Again on my system, this was (truncated and simplified to most relevant flags):
-O3 -funroll-loops-O3 -xSSE4.2 -axCORE-AVX2,COMMON-AVX512You can the specify preferred optimisation flags for F2PY with the --opt flag in you compile command (see also --f90flags in the documentation), that now becomes e.g.:
python -m numpy.f2py --fcompiler=intelem --opt='-O1' -c cicloS.pyf sub.f90
Compiling a standalone executable with ifort -O1 sub.f90 main.f90 -o main, and the F2PY compiled version from Part 2, I get the following output:
./main
Time = 5.359 seconds.
python test.py
Time = 5.297 seconds.
5.316878795623779
Then, compiling a standalone executable with ifort -O3 sub.f90 main.f90 -o main, and the (default) F2PY compiled version from Part 1, I get these results:
./main
Time = 1.297 seconds.
python test.py
Time = 1.219 seconds.
1.209657907485962
Thus showing similar results for the standalone and F2PY versions, as well as the influence of compiler flags.
Although not the cause of the slowdown you observe, do note that F2PY is forced to make temporary copies of the arrays M (and ket) in your Python example for two reasons:
M that you pass to cicloS.doublesum() is a default NumPy array, with C ordering (row-major). Since Fortran uses column-major ordering, F2PY will make array copies. The commented out np.asfortranarray() should correct this part of the problem.ket) is that there is a mismatch between the real kinds on the Python (default 64bit, double precision float) and Fortran (real gives a default precision, likely 32bit float) sides of your example. So copies are again made to account for this.You can get notification when array copies are made by adding a -DF2PY_REPORT_ON_ARRAY_COPY=1 flag (also in documentation) to your F2PY compile command. In your case, array copies can be avoided completely by changing the dtype of your M and ket matrices in Python (i.e. M=np.asfortranarray(M, dtype=np.float32)) and ket=np.asfortranarray(ket, dtype=np.float32)) , or alternatively by defining the real variables in your Fortran code with the appropriate kind (e.g. add use, intrinsic :: iso_fortran_env, only : real64 to your subroutine and main program and define reals with real(kind=real64).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With