Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Change a dataframe in matrix with 0 and 1 informations with R

I have a dataframe such as :

Cluster sequence_name
1       species1
1       species1
1       species2
1       species3
1       species3
1       gene1
1       gene2
2       species4
2       species5
2       spciess5
2       species3
2       gene3
2       gene4

and I would like to get a matrix with it such as :

           gene1  gene2 gene3 gene4
species5   0      0     1     1
species4   0      0     1     1
species1   1      1     0     0
species2   1      1     0     0
species3   1      1     1     1

where 1 means that for the speciesX the gene is present, and 0 means it is nos present. Present means that the speciesX is present in the same cluster than a geneX. For exemple, gene1 is present in the cluster1 as the species1, 2 and 3. In contrary, species5 and 4 are notre present in the cluster1.

As you can also see; there are several duplicates (in the same cluster, a species can be representated several times). Thank you for your help.

The real data looks like:

cluster_names seq_names  
1             AP_000401.1  
1             NP_039001.1  
1             Canis_lupus  
1             Canis_familiaris
2             YP_0090909.1
2             Mustela_putorius
2             Mustela_furo
2             YP_0909200.1

....

...

AP and NP and other XX letters are genes and the Genus_specie the species

In response to Denis:

Here is a head of the real data:

cluster_names  seq_names
1   scf7180005155889:2745-3053(-):Drosophia_melanogaster
1   IDBA_scaffold_72878:85-225:292707-293006(+):Orussu_sp
1   scaffold_3615:40850-41320(-):Canis_lupus
1   scaffold_8697:754-1209(-):homo_sapiens
1   scf7180005155889:72-1908(-):homo_sapiens
1   YP_003969716.1
1   NP_003986717.1
2   scaffold_17536:2745-3053(-):Drosophia_melanogaster
2   scf7180005155889:2000-8900(-):Drosophia_melanogaster
2   scaffold_8697:754-1209(-):homo_sapiens
2   YP_003956764.1
2   YP_004894416.1
2   YP_008958968.1

and the output I should get is :

output

In respons to Denis:

> df <- read.table(text = "Cluster sequence_name
+ 1       :Drosophia_melanogaster
+                  1       scf7180005155889:2745-3053(-):Drosophila_melanogaster
+                  1       scf7180005155889:2745-3053(-):Orussu_sp
+                  1       scf7180005155889:2745-3053(-):Canis_lupus
+                  1       scf7180005155889:72-1908(-):Homo_sapiens
+                  1       scf7180005155889:2745-3053(-):Homo_sapiens
+                  1       YP_003970075.1
+                  1       YP_005070075.1
+                  2       scf7180005155889:72-1908(-):Drosophila_melanogaster
+                  2       scf7180005155889:72-1908(-):Drosophila_melanogaster
+                  2       scf7180005155889:72-1908(-):Homo_sapiens
+                  2       YP_039970075.1
+                  2       NP_003900075.1",header = T)
> df <- setDT(df)
> species <- df[grep("[0-9]+\\([+-]\\):[A-z ]+",sequence_name)]
> species[,sequence_name := str_extract(sequence_name,"(?<=[0-9]\\([+-]\\):)[A-z ]+")]
> genes <- df[grep("[0-9]+\\.1",sequence_name)]
> genes[,sequence_name :=sequence_name]
> plouf <- merge(genes,species,by = "Cluster",allow.cartesian=TRUE)
> result <- dcast(plouf,sequence_name.y~sequence_name.x,fun.aggregate = length)
Using 'sequence_name.y' as value column. Use 'value.var' to override
> row.names(result)<-result$sequence_name.y
> result$sequence_name.y<- NULL
> result
   NP_003900075.1 YP_003970075.1 YP_005070075.1 YP_039970075.1
1:              0              1              1              0
2:              2              1              1              2
3:              1              2              2              1
4:              0              1              1              0
like image 486
darwin98 Avatar asked Dec 29 '25 05:12

darwin98


2 Answers

library(data.table)
library(stringr)
df <- setDT(df)

I will use data.table here. So the idea is to create two data frame, one with the genes, one with the species

species <- df[grep("species",sequence_name)]
species[,sequence_name := str_extract(sequence_name,"(?<=:)[a-z0-9]+$")]
genes <- df[grep("gene",sequence_name)]

> species
   Cluster sequence_name
1:       1      species1
2:       1      species2
3:       1      species3
4:       2      species4
5:       2      species5
6:       2      species3
> genes
   Cluster sequence_name
1:       1         gene1
2:       1         gene2
3:       2         gene3
4:       2         gene4

You want to merge them together by cluster, with allow.cartesian=TRUE because your merging vector is not a single identifier for none of your data.frame:

plouf <- merge(genes,species,by = "Cluster",allow.cartesian=TRUE)

    Cluster sequence_name.x sequence_name.y
 1:       1           gene1        species1
 2:       1           gene1        species2
 3:       1           gene1        species3
 4:       1           gene2        species1
 5:       1           gene2        species2
 6:       1           gene2        species3
 7:       2           gene3        species4
 8:       2           gene3        species5
 9:       2           gene3        species3
10:       2           gene4        species4
11:       2           gene4        species5
12:       2           gene4        species3

Then, obtaining your result is just going to wide format while counting the number of occurence, which you can do with dcast here:

result <- dcast(plouf,sequence_name.y~sequence_name.x,fun.aggregate = length)


   sequence_name.y gene1 gene2 gene3 gene4
1:        species1     1     1     0     0
2:        species2     1     1     0     0
3:        species3     1     1     1     1
4:        species4     0     0     1     1
5:        species5     0     0     1     1

Et voilà. I let dplyr experienced users to propose the equivalent/improved solution with dplyr.

The data :

df <- read.table(text = "Cluster sequence_name
1       Scaffold_1:species1
                 1       Scaffold_2:species2
                 1       Scaffold_3:species3
                 1       gene1
                 1       gene2
                 2       Scaffold_4:species4
                 2       Scaffold_5:species5
                 2       Scaffold_6:species3
                 2       gene3
                 2       gene4",header = T)

With the real data you show:

df <- read.table(text ="cluster_names  seq_names
                 1   scf7180005155889:2745-3053(-):Drosophia_melanogaster
                 1   scaffold_2484:292707-293006(+):Orussu_sp
                 1   scaffold_3615:40850-41320(-):Canis_lupus
                 1   scaffold_8697:754-1209(-):homo_sapiens
                 1   scf7180005155889:72-1908(-):homo_sapiens
                 1   YP_003969716.1
                 1   NP_003986717.1
                 2   scaffold_17536:2745-3053(-):Drosophia_melanogaster
                 2   scf7180005155889:2000-8900(-):Drosophia_melanogaster
                 2   scaffold_8697:754-1209(-):homo_sapiens
                 2   YP_003956764.1
                 2   YP_004894416.1
                 2   YP_008958968.1",header = T)

You should change the step of creating the two data table by:

species <- df[grep("[0-9]+\\([+-]\\):[A-z ]+",seq_names)]
species[,sequence_name := str_extract(seq_names,"(?<=[0-9]\\([+-]\\):)[A-z ]+")]
genes <- df[grep("[0-9]+\\.1",seq_names)]
genes[,sequence_name :=seq_names]

Here "[0-9]+\\.1" suppose that all genes finish with 1, and that there is no point in the species description. To extract the species info, I suppose that it always contain (+): or (-)+ after numbers.

But that is a regex problem, and should be the matter of an other question if you have problem with it. Your question here was to find the way of shaping the data to obtain your result. I answered by giving you the steps working on the example data : creating the two genes and species data frame using regex, merging them and re-shaping them.

The rest works:

plouf <- merge(genes,species,by = "cluster_names",allow.cartesian=TRUE)
result <- dcast(plouf,sequence_name.y~sequence_name.x,fun.aggregate = length)
like image 104
denis Avatar answered Dec 31 '25 18:12

denis


Using tidyverse:

# data
df1 <- read.table(text = "Cluster sequence_name
1       species1
                  1       species1
                  1       species2
                  1       species3
                  1       species3
                  1       gene1
                  1       gene2
                  2       species4
                  2       species5
                  2       species5
                  2       species3
                  2       gene3
                  2       gene4", header = TRUE, stringsAsFactors = FALSE)

# so that we know which row is species
species <- paste("species", 1:5, sep = "")
#[1] "species1" "species2" "species3" "species4" "species5"

library(tidyverse)

res <- reduce(split(df1, df1$sequence_name %in% species), left_join, by = "Cluster") %>% 
  unique() %>% 
  spread(key = "sequence_name.x", value = "Cluster") %>% 
  mutate_if(is.numeric,  funs(as.numeric(!is.na(.))))

res
#   sequence_name.y gene1 gene2 gene3 gene4
# 1        species1     1     1     0     0
# 2        species2     1     1     0     0
# 3        species3     1     1     1     1
# 4        species4     0     0     1     1
# 5        species5     0     0     1     1
like image 40
zx8754 Avatar answered Dec 31 '25 18:12

zx8754