Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

apply a function with two arguments, rowwise on a data frame

Tags:

r

apply

I have a reference and query sequences:

ref_seq <- "ATTT"
df <- data.frame(V1=c("AATT", "TTTT", "GGTT"))

I would like to return the mismatch positions in the sequence for each query compared to reference:

seqdiff <- function(seq1, seq2) {
  seq <- strsplit(c(seq1, seq2), split= '')
  mismatches <- which(seq[[1]] != seq[[2]])
  return(mismatches)
}
    
apply(X=df, MARGIN=2, function(x) seqdiff(x, ref_seq))

#      V1
# [1,]  1
# [2,]  2

Expected Output:

#      V1
# [1,]  2
# [2,]  1
# [3,]  1 2
like image 294
Paolo Lorenzini Avatar asked Oct 24 '25 14:10

Paolo Lorenzini


1 Answers

Given these are likely nucleotide sequences, you might consider the adist function. This can be used in other cases to determine the minimal weighted number of insertions, deletions, and substitutions needed to transform one string into another. This allows for counts to be computed in transformation, as well as the transformation sequence in the "trafos" attribute (M = match, I = insertion, D = deletion, S = Substitution).

df$trafos <- attr(adist(df$V1, ref_seq, counts = TRUE), "trafos")
df$substitution <- gregexpr("S", df$trafos)
df

    V1 trafos substitution
1 AATT   MSMM            2
2 TTTT   SMMM            1
3 GGTT   SSMM         1, 2

As noted in the comments by @AndreWildberg, different contexts may require different strategies. Other packages such as through Bioconductor or bedtoolsr may be more appropriate depending on need.

If using adist, you can also specify the "cost" argument, to apply costs for insertions, deletions, or substitutions when computing string distance.

In the case of ref_seq <- "GAAA" for example, you could apply costs for insertions and deletions, which would give a different (and probably more desirable) answer in this case.

df$trafos <- attr(adist(df$V1, 
                        ref_seq, 
                        counts = TRUE, 
                        costs = list(ins = 1, del = 1, sub = 0)), 
                  "trafos")
df$substitution <- gregexpr("S", df$trafos)
df

    V1 trafos substitution
1 AATT   SMSS      1, 3, 4
2 TTTT   SSSS   1, 2, 3, 4
3 GGTT   MSSS      2, 3, 4

In case there is interest in Bioconductor (Biostrings) to look at alignment and mismatches between two sequences, I'm adding a quick example.

library(Biostrings)
library(pwalign)

alg <- pairwiseAlignment("AATT", "GAAA")
mismatchTable(alg)

  PatternId PatternStart PatternEnd PatternSubstring PatternQuality SubjectStart SubjectEnd SubjectSubstring SubjectQuality
1         1            1          1                A              7            1          1                G              7
2         1            3          3                T              7            3          3                A              7
3         1            4          4                T              7            4          4                A              7
like image 68
Ben Avatar answered Oct 27 '25 03:10

Ben



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!