I am looking for most efficient IDL code to replace the IDL matrix multiply (#) operator for a specific, diagonally-oriented (not diagonal, or diagonally-symmetric) matrix with 3 distinct values: unity on the diagonal; unity plus a delta to the right of the diagonal; unity minus the same delta to the left.
IDL (fixed; non-negotiable; sorry); image smear on shutter-less CCD imaging system.
a 1024x1024 matrix, "EMatrix," with unity on the diagonal; (1-delta) to the left of the diagonal; (1+delta) to the right; delta = 0.044.
another 1024x1024 matrix, Image
2014-09-16: See update below
Larger problem statement (of which the matrix multiply seems to be only the slowest part, and optimizing the whole routine would not hurt):
Section 9.3.1.2 (PDF page 47; internal page 34), and other documents in that same directory (sorry as a newbie I can only post two links)
That is now (2014-09-26) about an order of magnitude faster than the IDL # operator for a 1024x1024 matrix.
The naive operation is O(n^3) and performs about a billion (2^30) double-precision multiplies and about the same number of additions; Wikipedia further tells me Strassen's algorithm cuts that down to O(n^2.807), or ~282M+ multiplies for n=1024.
Breaking it down for a simple 3x3 case, say image and EMatrix are
   image        EMatrix
[ 0  1  2 ]   [ 1  p  p ]
[ 3  4  5 ] # [ m  1  p ]
[ 6  7  8 ]   [ m  m  1 ]
where p represents 1+delta (1.044) and m represents 1-delta (0.956).
Because of the repetition of m's and p's, there should be a simplification available: looking at the middle column for image, the result for the three rows should be
[1,4,7] . [1,p,p] = m*(0)   + 1*1 + p*(4+7)
[1,4,7] . [m,1,p] = m*(1)   + 1*4 + p*(7)
[1,4,7] . [m,m,1] = m*(1+4) + 1*7 + p*(0)
where . represents the dot (inner?) product i.e. [1,4,7].[a,b,c] = (1a + 4b + 7c)
Based on that, here's what I did so far:
The middle term is just the column itself, and the sums multiplied by m and p look a lot like cumulative sums of contiguous sections of the column, possibly reversed (for m), shifted one and with the first element set to zero.
i.e. for m:
; shift image down one:
imgminusShift01 = shift(image,0,1)
; zero the top row:
imgminusShift01[*,0] = 0
; Make a cumulative sum:
imgminusCum = total( imageShift01, 2, /cumulative)
For p, imgplusShift01Cum follows essentially the same path but with a rotate(...,7) before and after to flip things up and down.
Once we have those three matrices (the original image, with its offspring imgminusShift01Cum and imgplusShift01Cum), the desired result is
m * imgminusShift01Cum    +   1 * image   +   p * imgplusShift01Cum
To do the shifts and rotates, I use indices, shifted and rotated themselves and stored in a common block for subsequent calls which saves another 10-20%.
2014-09-16: See update below
And that gives a speedup of 5+.
I was expecting a bit more, because I think I am down to 3M multiplies and 2M additions, so maybe the memory allocation is the expensive part and I should be doing something like REPLICATE_INPLACE (my IDL is rusty - I have not done much since 6.4 and early 7.0).
Or maybe IDL's matrix multiplication is better than theory?
Alternate approaches and other thoughts:
Can we do something with the fact that the EMatrix is equal to unity plus a matrix with zeros on the diagonal and +/- delta in theupper and lower triangles?
By summing over columns I am accessing the data sequentially the "wrong" way, but would it actually save time to transpose Image first (it's only 8MB)?
Obviously choosing another language, getting a GPU array to help, or writing a DLM, would be other options, but let's keep this to IDL for now.
advTHANKSance (yes I am that old;-),
Brian Carcich 2014-07-20
I think we got it; my first pass was too complicated, with all the rotations.
To use Diego's notation, but transposed, I am looking for
[K] = [IMG] # [E]
Since that multiplies columns of [IMG] by rows of [E], there is no interaction between columns of [IMG], so for analysis we only need to look at one column of [IMG], the dot (inner) products of which, with the rows of [E], become one column of the result [K]. Expanding that idea to one column of a 3x3 matrix with elements x, y and z:
[Kx]   [x]   [1    1+d  1+d]
[Ky] = [y] # [1-d  1    1+d]
[Kz] = [z]   [1-d  1-d  1  ]
Looking specifically at element Ky above (one element of [K], corresponding to y in [IMG], broken down to a formula using only scalars):
Ky = x * (1-d)  +  y * 1  + z * (1+d)
Generalizing for any y in a column of any length:
Ky = sumx * (1-d)  +  y * 1  + sumz * (1+d)
Where scalars sumx and sumz are sums of all values above and below, respectively, y in that column of [IMG]. N.B. sumx and sumz are values specific to element y.
Rearranging:
Ky = (sumx + y + sumz) - d * (sumx - sumz)
Ky = tot - d * (sumx - sumz)
where
tot = (sumx + y + sumz) 
i.e. tot is the sum of all values in the column (e.g. in IDL: tot = total(IMG,2)).
So to this point I have basically duplicated Diego's work; the rest of this analysis converts that last equation for Ky into a form suitable for speedy evaluation in IDL.
Solving the tot equation for sumz:
sumz = tot - (y + sumx)
Substituting back into Ky:
Ky = tot - (sumx - (tot - (y + sumx)))
Ky = tot - ((2 * sumx) + y - tot)
Ky = tot + (tot - ((2 * sumx) + y)
Using sumxy to represent the sum of all values in the column from the top down to, and including, y (IDL: [SUMXY] = total([IMG],2,/CUMULATIVE))
sumxy = sumx + y
and
sumx = sumxy - y
Substituting back into Ky:
Ky = tot + (tot - ((2 * (sumxy - y)) + y)
Ky = tot + (tot + y - (2 * sumxy))
So if we can evaluate tot and sumxy for every element of [IMG], i.e. if we can evaluate the matrices [TOT] and [SUMXY], and we already have [IMG] as the matrix version of y, then it is a simple linear combination of those matrices.
In IDL, these are simply:
[SUMXY] = TOTAL([IMG],2,/CUMULATIVE)
[TOT] = [SUMXY][*,N-1] # REPLICATE(1D0,1,N)
I.e. [TOT] is the last row of [SUMXY], duplicated to form a matrix of N rows.
And the final code looks like this:
function lorri_ematrix_multiply,NxN,EMatrix
  NROWS = (size(NxN,/DIM))[1]
  SUMXY = TOTAL(NxN,2,/CUMULATIVE)
  TOT   = SUMXY[*,NROWS-1] # REPLICATE(1,NROWS,1d0)
  RETURN, TOT + ((EMatrix[1,0] - 1d0) * (TOT + NxN - (2d0 * SUMXY)))
end
which on our system is just shy of an order of magnitude faster than [IMG] # [E].
N.B. delta = (EMatrix[1,0] - 1d0)
Woo hoo!
I go straight on with math notation since I think that could be more clear than explaining by words:
      [ +1 +d +d +d +d ]   [ 1 0 0 0 0 ]       [ 0 1 1 1 1 ]        [ 0 0 0 0 0 ]
      [ -d +1 +d +d +d ]   [ 0 1 0 0 0 ]       [ 0 0 1 1 1 ]        [ 1 0 0 0 0 ]
[E] = [ -d -d +1 +d +d ] = [ 0 0 1 0 0 ] + d * [ 0 0 0 1 1 ] - d *  [ 1 1 0 0 0 ]
      [ -d -d -d +1 +d ] | [ 0 0 0 1 0 ]       [ 0 0 0 0 1 ]        [ 1 1 1 0 0 ]
      [ -d -d -d -d +1 ] | [ 0 0 0 0 1 ]       [ 0 0 0 0 0 ]        [ 1 1 1 1 0 ]
                         |
                         | [ 1 0 0 0 0 ]       [ 1 1 1 1 1 ]        [ 1 0 0 0 0 ]
                         | [ 0 1 0 0 0 ]       [ 0 1 1 1 1 ]        [ 1 1 0 0 0 ]
                         = [ 0 0 1 0 0 ] + d * [ 0 0 1 1 1 ] - d *  [ 1 1 1 0 0 ]
                         | [ 0 0 0 1 0 ]       [ 0 0 0 1 1 ]        [ 1 1 1 1 0 ]
                         | [ 0 0 0 0 1 ]       [ 0 0 0 0 1 ]        [ 1 1 1 1 1 ]
                         |     [ID]                [UT]                 [LT]
                         |
                         = [ID] + d * [UT] - d * [LT]
==>
[Img] # [E] = [E]##[Img] = [Img] + d * [UT] ## [Img] - d * [LT] ## [Img]
Now let's observe what is [LT] ## [Img]:
so an efficient way to compute it is:
TOTAL(Image, 2, /CUMULATIVE)
Analogous but a bit different is the result of [UT] ## [Img]:
so [UT] ## [Img] = REVERSE(TOTAL(REVERSE(Image,2), 2, /CUMULATIVE),2)
Then:
[Img] # [E] = [E]##[Img] = Image + d * (REVERSE(TOTAL(REVERSE(Image,2), 2, /CUMULATIVE),2) - TOTAL(Image, 2, /CUMULATIVE))
Note that we see that in the end the results on each column are coming only from data of that same column.
Now let's look to [K] = [UT] ## [Img] - [LT] ## [Img] and see how it looks like. If for each generic column we name the column elements r(1), r(2), r(3), .... ,r(i), ..., r(n) we can see that the corresponding [K] column elements R(i)
looks like this:
when n is even (that is the case of 1024)
row 1     => R(1) = +r(1)                 -r(1) -r(2) .... -r(n-1) -r(n) = -SUM(j=2, n, r(n))
row 2     => R(1) = +r(1) +r(2)           -r(1) -r(2) .... -r(n-1)       = -SUM(j=3, n-1, r(n))
row 3     => R(1) = +r(1) +r(2) +r(3)     -r(1) -r(2) .... -r(n-2)       = -SUM(j=4, n-2, r(n))
    :        :       :     :     :    :    :     :     :    :               :
row i (i < n/2) 
          => R(1) = +r(1) ... +r(i)       -r(1) -r(2) .... -r(n-i+1)     = -SUM(j=i+1, n-i+1, r(n))
    :        :       :     :     :    :    :     :     :    :               :
row n/2   => R(1) = +r(1) ... +r(n/2)     -r(1) -r(2) .... -r(n/2+1)     = -r(n/2+1) 
row n/2+1 => R(1) = +r(1) ... +r(n/2+1)   -r(1) -r(2) .... -r(n/2)       = +r(n/2+1) 
    :        :       :     :     :    :    :     :     :    :               :
row i (i > n/2)
          => R(1) = +r(1) ... + r(i)      -r(1) -r(2) .... -r(n-i+1)     = +SUM(j=n-i+2, i, r(n))
                                                                         = -R(n-i+1)
    :        :       :     :     :    :    :     :     :    :               :
row n     => R(1) = +r(1) ... + r(n)      -r(1)                          = +SUM(j=2, n, r(n))
                                                                         = -R(1)
when n is odd
  It is similar but R((n+1)/2) will be all 0. I will not go in detail with this.
What is important is that the matrix [K] = [UT] ## [Img] - [LT] ## [Img] is anti-symmetrical with respect to its horizontal halving line.
This means that we could calculate values only in half matrix (let's say the top part) and then populate the lower part by mirroring and changing the sign.
Note that efficient calculation of the top part can be done starting from R(n/2) = r(n/2+1) and going up reducing the index (R(n/2 -1), R(n/2 -2), R(n/2 -3)...) each time using R(i) = R(i+1) - r(i+1) - r(n-i+1) that can be well rewritten as R(i-1) = R(i) - r(i) - r(n-i+2).
As a matter of calculation involved this about halves the calculation done but as a matter of actual speed it needs to be tested in order to see if implementation with explicit operations are as quick as the internal implementations of built-in functions as TOTAL(/CUMULATIVE) and similar. There are good probabilities that it is quicker since we can here avoid also TRANSPOSE and/or REVERSE.
Let us know how it goes with a bit of profiling!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With