Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sliding window algorithm to analyze values of fasta segments

Tags:

r

fasta

I have two segments of a random fasta file

1 Segment1 AAGGTTCC
2 Segment2 CCTTGGAA

I have another random data set containing dinucleotides' energy values as

 AA -1.0
 AG -2.0
 GG -1.5
 GT -1.7
 TT -1.2
 TC -1.8
 CC -1.4
 CT -2.5
 TG -2.1
 GA -2.3

Here, I want to analyze and compare the nucleotides of the two fasta segments with the given energy values in a 'sliding window algorithm' such that the energy output value for fasta segment1 would be average of all the possile dinucleotide combination in an overlapping sliding window manner which will give the answer as -10.6 i.e. {(-1.0)+ (-2.0) + (-1.5) + (-1.7) + (-1.2) + (-1.8) + (-1.4)}/7 and the same computation would be performed for segment2, using the help of for and if else loop preferably.

like image 915
08BKS09 Avatar asked Dec 13 '25 19:12

08BKS09


1 Answers

Here is another way using tidytext. We are using the 'character shingles` tokenizer which breaks it up the way you are looking for.

library(tidytext)
library(dplyr)

df <- df1 %>% 
  unnest_character_shingles("Dinu", "Segment", n = 2L, to_lower = FALSE, drop = FALSE) %>%  
  left_join(df2, by = "Dinu") %>% 
  group_by(ID, Segment) %>% 
  summarize(mean = mean(Value))

Which gives the result:

> df
# A tibble: 2 x 3
# Groups:   ID [2]
  ID       Segment   mean
  <chr>    <chr>    <dbl>
1 Segment1 AAGGTTCC -1.51
2 Segment2 CCTTGGAA -1.71

To access the new columns after this analysis, use the df object. For example, mean(df$mean) will provide the average of the mean column.


Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!