I was wondering if there is a way of speeding up (maybe via vectorization?) the conditional filling of huge sparse matrices (e.g. ~ 1e10 x 1e10). Here's the sample code where I have a nested loop, and I fill in a sparse matrix only if a certain condition is met:
% We are given the following cell arrays of the same size:
% all_arrays_1
% all_arrays_2
% all_mapping_arrays
N = 1e10; 
% The number of nnz (non-zeros) is unknown until the loop finishes
huge_sparse_matrix = sparse([],[],[],N,N);
n_iterations = numel(all_arrays_1);
for iteration=1:n_iterations
  array_1 = all_arrays_1{iteration};
  array_2 = all_arrays_2{iteration};
  mapping_array = all_mapping_arrays{iteration};
  n_elements_in_array_1 = numel(array_1);
  n_elements_in_array_2 = numel(array_2);
  for element_1 = 1:n_elements_in_array_1
    element_2     = mapping_array(element_1);
    % Sanity check:
    if element_2 <= n_elements_in_array_2
       item_1 = array_1(element_1);
       item_2 = array_2(element_2);      
       huge_sparse_matrix(item_1,item_2) = 1;
    end     
  end   
end
I am struggling to vectorize the nested loop. As far as I understand the filling a sparse matrix element by element is very slow when the number of entries to fill is large (~100M). I need to work with a sparse matrix since it has dimensions in the 10,000M x 10,000M range. However, this way of filling a sparse matrix in MATLAB is very slow.
Edits:
I have updated the names of the variables to reflect their nature better. There are no function calls.
Addendum:
This code builds the matrix adjacency for a huge graph. The variable all_mapping_arrays holds mapping arrays (~ adjacency relationship) between nodes of the graph in a local representation, which is why I need array_1 and array_2 to map the adjacency to a global representation.
I think it will be the incremental update of the sparse matrix, rather than the loop based conditional that will be slowing things down.
When you add a new entry to a sparse matrix via something like A(i,j) = 1 it typically requires that the whole matrix data structure is re-packed. The is an expensive operation. If you're interested, MATLAB uses a CCS data structure (compressed column storage) internally, which is described under the Data Structure section here. Note the statement:
This scheme is not effcient for manipulating matrices one element at a time
Generally, it's far better (faster) to accumulate the non-zero entries in the matrix as a set of triplets and then make a single call to sparse. For example (warning - brain compiled code!!):
% Inputs:
% N 
% prev_array and  next_array
% n_labels_prev and n_labels_next
% mapping
% allocate space for matrix entries as a set of "triplets"
ii = zeros(N,1);
jj = zeros(N,1);
xx = zeros(N,1);
nn = 0;
for next_label_ix = 1:n_labels_next
    prev_label     = mapping(next_label_ix);
    if prev_label <= n_labels_prev
        prev_global_label = prev_array(prev_label);
        next_global_label = next_array(next_label_ix);   
        % reallocate triplets on demand
        if (nn + 1 > length(ii))
            ii = [ii; zeros(N,1)];
            jj = [jj; zeros(N,1)];
            xx = [xx; zeros(N,1)];
        end   
        % append a new triplet and increment counter
        ii(nn + 1) = next_global_label; % row index
        jj(nn + 1) = prev_global_label; % col index
        xx(nn + 1) = 1.0;               % coefficient
        nn = nn + 1;
    end     
end
% we may have over-alloacted our triplets, so trim the arrays
% based on our final counter
ii = ii(1:nn);
jj = jj(1:nn);
xx = xx(1:nn);
% just make a single call to "sparse" to pack the triplet data
% as a sparse matrix object
sp_graph_adj_global = sparse(ii,jj,xx,N,N); 
I'm allocating in chunks of N entries at a time. Assuming that you know alot about the structure of your matrix you might be able to use a better value here.
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With