df is a test dataframe in which I want to retain only the rows which contain the | character in the Hits column and the rows which contain the hits (i.e. Hit1, Hit2 etc.) and also the blank cells in the Hits column.
df <- data.frame(
Hits = c("# test114", "# ID|987129470", "# 13135", "Hit1", "", "Hit2", "Hit3", "", "# test739", "# ID|6971324987", "# 83771", "Hit1", "Hit2", "", "Hit3"),
Category1 = c(NA, NA, NA, 0.001, 0.001, 0.002, 0.003, 0.003, NA, NA, NA, 0.023, 0.341, 0.341, 0.569),
Category2 = c(NA, NA, NA, 100, 100, 99, 98, 98, NA, NA, NA, 100, 95, 95, 97),
Category3 = c(NA, NA, NA, 100, 100, 99, 98, 98, NA, NA, NA, 98, 97, 97, 92)
)
df looks like this

The output should look something like this

Since I want to retain only the rows having the | character and the ones not having the # character and the blank cells, maybe pattern matching could help. But I can't seem to figure out how to implement it in the code. Any suggestions?
Please note that in the original dataframe, the hits can have different names. Some hits don't start with the term 'Hit'
We could use grep with a regex pattern - "^[^#]+$|\\| (Hits column with no # from start (^) to end ($) or (|) the | character - escaped as it is metacharacter)
subset(df, grepl("^[^#]+$|\\|", Hits)|Hits == "")
Hits Category1 Category2 Category3
2 # ID|987129470 NA NA NA
4 Hit1 0.001 100 100
5 0.001 100 100
6 Hit2 0.002 99 99
7 Hit3 0.003 98 98
8 0.003 98 98
10 # ID|6971324987 NA NA NA
12 Hit1 0.023 100 98
13 Hit2 0.341 95 97
14 0.341 95 97
15 Hit3 0.569 97 92
Or do this in single grepl
> subset(df, grepl("^$|^[^#]+$|\\|", Hits))
Hits Category1 Category2 Category3
2 # ID|987129470 NA NA NA
4 Hit1 0.001 100 100
5 0.001 100 100
6 Hit2 0.002 99 99
7 Hit3 0.003 98 98
8 0.003 98 98
10 # ID|6971324987 NA NA NA
12 Hit1 0.023 100 98
13 Hit2 0.341 95 97
14 0.341 95 97
15 Hit3 0.569 97 92
or in dplyr
library(dplyr)
library(stringr)
df %>%
filter(str_detect(Hits, "^[^#]+$|\\|")|Hits == "")
Hits Category1 Category2 Category3
1 # ID|987129470 NA NA NA
2 Hit1 0.001 100 100
3 0.001 100 100
4 Hit2 0.002 99 99
5 Hit3 0.003 98 98
6 0.003 98 98
7 # ID|6971324987 NA NA NA
8 Hit1 0.023 100 98
9 Hit2 0.341 95 97
10 0.341 95 97
11 Hit3 0.569 97 92
Similar to akruns solution but using no regex solution:
library(dplyr)
library(stringr)
df %>%
filter(str_detect(Hits, 'ID|Hit'))
Hits Category1 Category2 Category3
1 # ID|987129470 NA NA NA
2 Hit1 0.001 100 100
3 Hit1 0.001 100 100
4 Hit2 0.002 99 99
5 Hit3 0.003 98 98
6 Hit3 0.003 98 98
7 # ID|6971324987 NA NA NA
8 Hit1 0.023 100 98
9 Hit2 0.341 95 97
10 Hit2 0.341 95 97
11 Hit3 0.569 97 92
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With