Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to optimize Matrix initialization and transposition to run faster using C

The dimension of this matrix is 40000*40000. I was supposed to consider spatial and temporal locality for program but I have no idea to optimize this code. It cost about 50+ seconds in my computer which is not acceptable for our group.The size of block is 500 now. Could someone help me to improve this code?

void      InitializeMatrixRowwise(){
    int i,j,ii,jj;
    double x;
    x = 0.0;
    for (i = 0; i < DIMENSION; i += BLOCKSIZE)
    {
        for (j = 0; j < DIMENSION; j += BLOCKSIZE)
        {
            for (ii = i; ii < i+BLOCKSIZE && ii < DIMENSION; ii++)
            {
                for (jj = j; jj < j+BLOCKSIZE && jj < DIMENSION; jj++)
                {
                    if (ii >= jj)
                    {
                        Matrix[ii][jj] = x++;
                    }
                    else
                        Matrix[ii][jj] = 1.0;
                 }
             }
         }
     }
 }




void        TransposeMatrixRowwise(){
int column,row,i,j;
double temp;
for (row = 0; row < DIMENSION; row += BLOCKSIZE)
{
    for (column = 0; column < DIMENSION; column += BLOCKSIZE)
    {
        for (i = row; i < row + BLOCKSIZE && i < DIMENSION; i++)
        {
            for (j = column; j < column + BLOCKSIZE && j < DIMENSION; j++)
            {
                if (i > j)
                {
                    temp = Matrix[i][j];
                    Matrix[i][j] = Matrix[j][i];
                    Matrix[j][i] = temp;
                 }
             }
         }
     }
 }
 }
like image 341
Cecilia Ren Avatar asked Dec 10 '25 01:12

Cecilia Ren


1 Answers

Your transpose function seems like it might be more complex than necessary and therefore perhaps slower than necessary. However, I created two versions of the code with timing inserted on the 'full size' (40k x 40k array, with 500 x 500 blocks), one using your transpose function and one using this much simpler algorithm:

static void TransposeMatrixRowwise(void)
{
    for (int row = 0; row < DIMENSION; row++)
    {
        for (int col = row + 1; col < DIMENSION; col++)
        {
            double temp = Matrix[row][col];
            Matrix[row][col] = Matrix[col][row];
            Matrix[col][row] = temp;
        }
    }
}

This looks much simpler; it has only two nested loops instead of four, but the timing turns out to be dramatically worse — 31.5s vs 14.7s.

# Simple transpose
# Count    = 7
# Sum(x1)  =  220.87
# Sum(x2)  = 6979.00
# Mean     =   31.55
# Std Dev  =    1.27 (sample)
# Variance =    1.61 (sample)
# Min      =   30.41
# Max      =   33.54

# Complex transpose
# Count    = 7
# Sum(x1)  =  102.81
# Sum(x2)  = 1514.00
# Mean     =   14.69
# Std Dev  =    0.82 (sample)
# Variance =    0.68 (sample)
# Min      =   13.59
# Max      =   16.21

The reason for the performance difference is almost certainly due to locality of reference. The more complex algorithm is working with two separate blocks of memory at a time, whereas the simpler algorithm is ranging over far more memory, leading to many more page misses, and the slower performance.

Thus, while you might be able to tune the transpose algorithm using different block sizes (it needn't be the same block size as was used to generate the matrices), there is little doubt based on these measurements that the more complex algorithm is more efficient.

I also did a check at 1/10th scale — 4k x 4k matrix, 50 x 50 block size — to ensure that the output from the transposition was the same (about 152 MiB of text). I didn't save the data at full scale with more than 100 times as much data. The times at 1/10th scale were dramatically better — less than 1/100th time — for both versions at the 1/10th scale:

< Initialization: 0.068667
< Transposition: 0.063927
---
> Initialization: 0.081022
> Transposition: 0.039169
4005c4005
< Print transposition: 3.901960
---
> Print transposition: 4.040136

JFTR: Testing on a 2016 MacBook Pro running macOS High Sierra 10.13.1 with 2.7 GHz Intel Core i7 CPU and 16 GB 2133 MHz LPDDR3 RAM. The compiler was GCC 7.2.0 (home-built). There was a browser running (but mostly inactive) and music playing in the background, so the machine wasn't idle, but I don't think those will dramatically affect the numbers.

like image 95
Jonathan Leffler Avatar answered Dec 11 '25 15:12

Jonathan Leffler



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!