Find max size sub-matrix with non-zero element

Question

I have one matrix which row is gene name and col is sample name. The element of this matrix is 0 or 1 (1 means gene expressed in this sample, while 0 is not expressed). I want to find the max subset of gene expressed in max subset samples. In other words, I want to find largest matrix with only 1s by rearrange the rows and columns.

such as:

mat <- matrix(c(1,0,1,1,1,0,1,0,1),nrow = 3,byrow = T)
mat
      [,1] [,2] [,3]
[1,]    1    0    1
[2,]    1    1    0
[3,]    1    0    1
###first swap column2 and column3
mat1 <- mat
mat1[,2] <- mat[,3]
mat1[,3] <- mat[,2]
mat1
      [,1] [,2] [,3]
[1,]    1    1    0
[2,]    1    0    1
[3,]    1    1    0
###then swap row2 and row3
mat2 <- mat1
mat2[2,] <- mat1[3,]
mat2[3,] <- mat1[2,]
mat2
      [,1] [,2] [,3]
[1,]    1    1    0
[2,]    1    1    0
[3,]    1    0    1
###then the up-left is wanted matrix

jay.sf · Accepted Answer

See, how this generalizes,

mat[order(matrixStats::rowSums2(matrixStats::rowCumsums(mat))), 
    order(-matrixStats::colSums2(matrixStats::colCumsums(mat)))]
#      [,1] [,2] [,3]
# [1,]    1    1    0
# [2,]    1    1    0
# [3,]    1    0    1

However, in following matrix, it seems rather the negative rowsums,

set.seed(42)
mat <- matrix(rbinom(25, 1, .5), 5, 5)
mat[order(-matrixStats::rowSums2(matrixStats::rowCumsums(mat))), 
    order(-matrixStats::colSums2(matrixStats::colCumsums(mat)))]
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    1    1    1    0    1
# [2,]    1    1    1    1    0
# [3,]    1    1    1    0    0
# [4,]    1    1    0    1    0
# [5,]    0    0    0    1    1

but this might be a start.

You probably can't do without mathematical proof, if no method exists yet.

Cary Swoveland · Answer

The problem

I understand we are given a matrix M with S rows, corresponding to samples, and G columns, corresponding to genes, such that each element of the matrix, m_s_g equals 1 if gene g is in sample s, else it equals 0, for all s = 0,1,...,S-1 and g = 0,1,...,G-1.

We wish to identify collections of rows and columns of M that form a matrix of all ones such that the size of that matrix is maximal, size presumably being defined as the product of the numbers of rows and columns.

Method of solution

I propose solving S problems, where for each n = 0,1,...,S-1, the number of samples selected is fixed as n. For each fixed n maximizing the size of a resultant matrix of all ones (t) is equivalent maximizing the number of columns (genes) selected (c), since t = n*c. One then accepts the solution for which t is largest.

Optimization problem

A linear programming model is problem in which variables are to be assigned non-negative real values in such a way that a given linear function of those variables is maximized (or minimized), subject to inequality constraints. Each such constraint is expressed as a given linear function of the variables whose value must be less than or equal to a given value. Additional constraints may replace "less than or equal" with "equal".

An integer linear programming model is a linear programming model with the addition requirement that some or all of the variables are restricted to taking on integer values.

One type of integer linear programming model is where some or all variables are restricted to taking on the value zero or one. Such variables are called binary variables. Binary variables are often used to enforce logical requirements. Below I propose solving a series of linear programs with binary variables.

Optimization packages (some accessible online, some free) are available to solve all of these types of linear problems. Note, however, that imposition of requirements that some or all variables be restricted to integer values makes the problem much harder and more time-consuming to solve. A very large linear programming model might solve in less than a second whereas a linear program of comparable size with all binary variables might require years to solve.

Optimization model

As discussed earlier, n below equals a constant equal to the number of samples to be selected. One model is to be optimized for each n = 0,1,...,S-1.

First define the binary variables:

x_s = 1 if sample s is selected, else 0, for all s = 0,1,...,S-1
y_g = 1 if gene g is selected, else 0, for all g = 0,1,...,G-1

Now state the linear programming model with binary variables.

max y_g summed over g = 0,1,...,G-1

subject to:

x_s = 0 or 1 for all s = 0,1,...,S-1

y_g = 0 or 1 for all g = 0,1,...,G-1

x_s summed over s = 0,1,...,S-1 <= n

x_s + y_g <= 1 for all s and g for which m_s_g = 0

Notes

The last group of constraints prevent both the row corresponding to sample s from being selected (x_s = 1) and the column corresponding to gene g from being selected (y_g = 1) when m_s_g = 0. For such a pair s and g either sample s is selected and gene g is not or vice-versa, or neither is selected.

The quest for optimality should force

x_s summed over s = 0,1,...,S-1

to equal n, but if the left side is less than n in the optimal solution that is not a problem. I expressed the constraint as an inequality to hopefully reduce solution time.

The reason I felt obliged to solve a sequence of problems, each with the number species fixed is because the size of a resultant matrix equals the product of the numbers of its rows and columns, a non-linear function of the model's variables. Suppose, for example, I had introduced (linear) constraints

nrows = y_s summed over s = 0,1,...,S-1

ncols = y_g summed over g = 0,1,...,G-1

the objective would become

max nrows*ncols

which is non-linear, destroying the linear structure of the model.

Solving the optimization problem may not be convenient but brute-force enumeration of all combinations of samples and genes may not be feasible, and anything short of that would be a heuristic (or ad hoc) algorithm that may or may not produce an optimal or even "close-to-optimal" solution.

Some heuristic algorithms produce bounds that give the maximum difference between the value of the solution produced by the heuristic and the optimal value. The problem with heuristics that do not produce such bounds (or do not produce "tight" bounds) is that one never knows when they have produced a "good" solution or a "bad" solution.

Find max size sub-matrix with non-zero element

Tags:

algorithm

r

matrix

igraph

wutao

2 Answers

jay.sf

Cary Swoveland

Recent Activity

Donate For Us

Find max size sub-matrix with non-zero element

Tags:

algorithm

r

matrix

igraph

wutao

2 Answers

jay.sf

Cary Swoveland

Related questions

Recent Activity

Donate For Us