I have one matrix which row is gene name and col is sample name. The element of this matrix is 0 or 1 (1 means gene expressed in this sample, while 0 is not expressed). I want to find the max subset of gene expressed in max subset samples. In other words, I want to find largest matrix with only 1s by rearrange the rows and columns.
such as:
mat <- matrix(c(1,0,1,1,1,0,1,0,1),nrow = 3,byrow = T)
mat
[,1] [,2] [,3]
[1,] 1 0 1
[2,] 1 1 0
[3,] 1 0 1
###first swap column2 and column3
mat1 <- mat
mat1[,2] <- mat[,3]
mat1[,3] <- mat[,2]
mat1
[,1] [,2] [,3]
[1,] 1 1 0
[2,] 1 0 1
[3,] 1 1 0
###then swap row2 and row3
mat2 <- mat1
mat2[2,] <- mat1[3,]
mat2[3,] <- mat1[2,]
mat2
[,1] [,2] [,3]
[1,] 1 1 0
[2,] 1 1 0
[3,] 1 0 1
###then the up-left is wanted matrix
See, how this generalizes,
mat[order(matrixStats::rowSums2(matrixStats::rowCumsums(mat))),
order(-matrixStats::colSums2(matrixStats::colCumsums(mat)))]
# [,1] [,2] [,3]
# [1,] 1 1 0
# [2,] 1 1 0
# [3,] 1 0 1
However, in following matrix, it seems rather the negative rowsums,
set.seed(42)
mat <- matrix(rbinom(25, 1, .5), 5, 5)
mat[order(-matrixStats::rowSums2(matrixStats::rowCumsums(mat))),
order(-matrixStats::colSums2(matrixStats::colCumsums(mat)))]
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 1 1 0 1
# [2,] 1 1 1 1 0
# [3,] 1 1 1 0 0
# [4,] 1 1 0 1 0
# [5,] 0 0 0 1 1
but this might be a start.
You probably can't do without mathematical proof, if no method exists yet.
The problem
I understand we are given a matrix M with S rows, corresponding to samples, and G columns, corresponding to genes, such that each element of the matrix, msg equals 1 if gene g is in sample s, else it equals 0, for all s = 0,1,...,S-1 and g = 0,1,...,G-1.
We wish to identify collections of rows and columns of M that form a matrix of all ones such that the size of that matrix is maximal, size presumably being defined as the product of the numbers of rows and columns.
Method of solution
I propose solving S problems, where for each n = 0,1,...,S-1, the number of samples selected is fixed as n. For each fixed n maximizing the size of a resultant matrix of all ones (t) is equivalent maximizing the number of columns (genes) selected (c), since t = n*c. One then accepts the solution for which t is largest.
Optimization problem
A linear programming model is problem in which variables are to be assigned non-negative real values in such a way that a given linear function of those variables is maximized (or minimized), subject to inequality constraints. Each such constraint is expressed as a given linear function of the variables whose value must be less than or equal to a given value. Additional constraints may replace "less than or equal" with "equal".
An integer linear programming model is a linear programming model with the addition requirement that some or all of the variables are restricted to taking on integer values.
One type of integer linear programming model is where some or all variables are restricted to taking on the value zero or one. Such variables are called binary variables. Binary variables are often used to enforce logical requirements. Below I propose solving a series of linear programs with binary variables.
Optimization packages (some accessible online, some free) are available to solve all of these types of linear problems. Note, however, that imposition of requirements that some or all variables be restricted to integer values makes the problem much harder and more time-consuming to solve. A very large linear programming model might solve in less than a second whereas a linear program of comparable size with all binary variables might require years to solve.
Optimization model
As discussed earlier, n below equals a constant equal to the number of samples to be selected. One model is to be optimized for each n = 0,1,...,S-1.
First define the binary variables:
Now state the linear programming model with binary variables.
max yg summed over g = 0,1,...,G-1
subject to:
xs = 0 or 1 for all s = 0,1,...,S-1
yg = 0 or 1 for all g = 0,1,...,G-1
xs summed over s = 0,1,...,S-1 <= n
xs + yg <= 1 for all s and g for which msg = 0
Notes
The last group of constraints prevent both the row corresponding to sample s from being selected (xs = 1) and the column corresponding to gene g from being selected (yg = 1) when msg = 0. For such a pair s and g either sample s is selected and gene g is not or vice-versa, or neither is selected.
The quest for optimality should force
xs summed over s = 0,1,...,S-1
to equal n, but if the left side is less than n in the optimal solution that is not a problem. I expressed the constraint as an inequality to hopefully reduce solution time.
The reason I felt obliged to solve a sequence of problems, each with the number species fixed is because the size of a resultant matrix equals the product of the numbers of its rows and columns, a non-linear function of the model's variables. Suppose, for example, I had introduced (linear) constraints
nrows = ys summed over s = 0,1,...,S-1
ncols = yg summed over g = 0,1,...,G-1
the objective would become
max nrows*ncols
which is non-linear, destroying the linear structure of the model.
Solving the optimization problem may not be convenient but brute-force enumeration of all combinations of samples and genes may not be feasible, and anything short of that would be a heuristic (or ad hoc) algorithm that may or may not produce an optimal or even "close-to-optimal" solution.
Some heuristic algorithms produce bounds that give the maximum difference between the value of the solution produced by the heuristic and the optimal value. The problem with heuristics that do not produce such bounds (or do not produce "tight" bounds) is that one never knows when they have produced a "good" solution or a "bad" solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With