I have 2 factor columns, I want to create a third column which tells me what the second one has that the first does not. It's very similar to this post but I'm having trouble going from a <code>df</code> to using <code>setdiff()</code> function. For example: <pre class="prettyprint"><code>library(dplyr) y1 <- c("a.b.","a.","b.c.d.") y2 <- c("a.b.c.","a.b.","b.c.d.") df <- data.frame(y1,y2) </code></pre> Column <code>y1</code> has <code>a.b.</code> and column <code>y2</code> has <code>a.b.c.</code>. I want a thirds column to return <code>c.</code> or just <code>c</code>. <pre class="prettyprint"><code>> df y1 y2 col3 1 a.b. a.b.c. c. 2 a. a.b. b. 3 b.c.d. b.c.d. </code></pre> I think that is should be a combination of <code>strsplit</code> and <code>setdiff</code>, but I can't get it to work. I've tried to convert the <code>factor</code> into <code>character</code>, then I've tried applying <code>strsplit()</code> to the results, but the output seems a but weird to me. It seems to have created a list within a list, which makes it difficult to pass to <code>setdiff()</code> <pre class="prettyprint"><code>#convert factor to character df <- df %>% mutate_if(is.factor, as.character) lapply(df$y1,function(x)(strsplit(x,split = "[.]"))) > lapply(df$y1,function(x)(strsplit(x,split = "[.]"))) [[1]] [[1]][[1]] [1] "a" "b" [[2]] [[2]][[1]] [1] "a" [[3]] [[3]][[1]] [1] "b" "c" "d" </code></pre>

Update There was an issue when the difference had more than 1 character, it created an additional row. To overcome that we <code>paste</code> all the elements together for each difference. This also saves us from the <code>unlist</code> step. <pre class="prettyprint"><code>df$col3 <- mapply(function(x, y) paste0(setdiff(y, x), collapse = ""), strsplit(as.character(df$y1), "\\."), strsplit(as.character(df$y2), "\\.")) </code></pre> <hr> Original Answer We can use <code>mapply</code> and split both the columns on "." using <code>strsplit</code> and then take the difference between them using <code>setdiff</code>. <pre class="prettyprint"><code>df$col3 <- mapply(function(x, y) setdiff(y, x), strsplit(as.character(df$y1), "\\."), strsplit(as.character(df$y2), "\\.")) df # y1 y2 col3 #1 a.b. a.b.c. c #2 a. a.b. b #3 b.c.d. b.c.d. </code></pre> If we don't want <code>col3</code> as list we can <code>unlist</code> it however, one issue in that is if we <code>unlist</code> it removes the <code>character(0)</code> value from it. To retain that value we need to perform an additional check on it. Taken from here. <pre class="prettyprint"><code>unlist(lapply(df$col3,function(x) if(identical(x,character(0))) ' ' else x)) #[1] "c" "b" " " </code></pre>

You can also use <code>purrr:map2</code>: <pre class="prettyprint"><code>df %>% mutate_if(is.factor, as.character) %>% mutate(col3 = map2(strsplit(y2, "\\."), strsplit(y1, "\\."), setdiff)) # y1 y2 col3 #1 a.b. a.b.c. c #2 a. a.b. b #3 b.c.d. b.c.d. </code></pre> Explanation: Convert <code>factor</code>s to <code>character</code> vectors, use <code>setdiff</code> on the <code>"."</code>-split columns <code>y2</code> and <code>y1</code>. Note that <code>col3</code> is a <code>list</code>. <hr> <h3>Update</h3> It appears that <code>unnest</code> drops the zero-length <code>character</code> entries from the <code>list</code>. So to convert <code>col3</code> from a <code>list</code> to a <code>character</code> vector you can do: <pre class="prettyprint"><code>df %>% mutate_if(is.factor, as.character) %>% mutate(col3 = map2(strsplit(y2, "\\."), strsplit(y1, "\\."), setdiff)) %>% rowwise() %>% mutate(col3 = paste(col3, collapse = ".")) ## A tibble: 3 x 3 # y1 y2 col3 # <chr> <chr> <chr> #1 a.b. a.b.c. c #2 a. a.b. b #3 b.c.d. b.c.d. "" </code></pre> The idea here is to string-concatenate <code>col3</code> entries (if there are multiple); using <code>rowwise()</code> ensures row-wise <code>paste</code>. For the updated sample data from your comment: <pre class="prettyprint"><code>y1 <- c("a.b.","a.","b.c.d.") y2 <- c("a.b.c.e.","a.b.","b.c.d.") df <- data.frame(y1,y2) df %>% mutate_if(is.factor, as.character) %>% mutate(col3 = map2(strsplit(y2, "\\."), strsplit(y1, "\\."), setdiff)) %>% rowwise() %>% mutate(col3 = paste(col3, collapse = ".")) ## A tibble: 3 x 3 # y1 y2 col3 # <chr> <chr> <chr> #1 a.b. a.b.c.e. c.e #2 a. a.b. b #3 b.c.d. b.c.d. "" </code></pre>

R - difference between 2 sets in data frame

Tags:

r

set-difference

strsplit

I have 2 factor columns, I want to create a third column which tells me what the second one has that the first does not. It's very similar to this post but I'm having trouble going from a df to using setdiff() function.
For example:

library(dplyr)
y1 <- c("a.b.","a.","b.c.d.")
y2 <- c("a.b.c.","a.b.","b.c.d.")
df <- data.frame(y1,y2)

Column y1 has a.b. and column y2 has a.b.c.. I want a thirds column to return c. or just c.

> df
      y1     y2  col3
1   a.b.  a.b.c.  c.
2     a.    a.b.  b.
3 b.c.d.  b.c.d.

I think that is should be a combination of strsplit and setdiff, but I can't get it to work.

I've tried to convert the factor into character, then I've tried applying strsplit() to the results, but the output seems a but weird to me. It seems to have created a list within a list, which makes it difficult to pass to setdiff()

#convert factor to character
df <- df %>% mutate_if(is.factor, as.character)
lapply(df$y1,function(x)(strsplit(x,split = "[.]")))

> lapply(df$y1,function(x)(strsplit(x,split = "[.]")))
[[1]]
[[1]][[1]]
[1] "a" "b"


[[2]]
[[2]][[1]]
[1] "a"


[[3]]
[[3]][[1]]
[1] "b" "c" "d"

740

asked Apr 18 '18 01:04

jmich738

2 Answers

Update

There was an issue when the difference had more than 1 character, it created an additional row. To overcome that we paste all the elements together for each difference. This also saves us from the unlist step.

df$col3 <- mapply(function(x, y) paste0(setdiff(y, x), collapse = ""),
   strsplit(as.character(df$y1), "\\."), strsplit(as.character(df$y2), "\\."))

Original Answer

We can use mapply and split both the columns on "." using strsplit and then take the difference between them using setdiff.

df$col3 <- mapply(function(x, y) setdiff(y, x),
       strsplit(as.character(df$y1), "\\."), strsplit(as.character(df$y2), "\\."))

df
#     y1     y2 col3
#1   a.b. a.b.c.    c
#2     a.   a.b.    b
#3 b.c.d. b.c.d.

If we don't want col3 as list we can unlist it however, one issue in that is if we unlist it removes the character(0) value from it. To retain that value we need to perform an additional check on it. Taken from here.

unlist(lapply(df$col3,function(x) if(identical(x,character(0))) ' ' else x))

#[1] "c" "b" " "

160

answered Nov 15 '22 09:11

Ronak Shah

You can also use purrr:map2:

df %>%
    mutate_if(is.factor, as.character) %>%
    mutate(col3 = map2(strsplit(y2, "\\."), strsplit(y1, "\\."), setdiff))
#      y1     y2 col3
#1   a.b. a.b.c.    c
#2     a.   a.b.    b
#3 b.c.d. b.c.d.

Explanation: Convert factors to character vectors, use setdiff on the "."-split columns y2 and y1. Note that col3 is a list.

Update

It appears that unnest drops the zero-length character entries from the list. So to convert col3 from a list to a character vector you can do:

df %>%
    mutate_if(is.factor, as.character) %>%
    mutate(col3 = map2(strsplit(y2, "\\."), strsplit(y1, "\\."), setdiff)) %>%
    rowwise() %>%
    mutate(col3 = paste(col3, collapse = "."))
## A tibble: 3 x 3
#  y1     y2     col3
#  <chr>  <chr>  <chr>
#1 a.b.   a.b.c. c
#2 a.     a.b.   b
#3 b.c.d. b.c.d. ""

The idea here is to string-concatenate col3 entries (if there are multiple); using rowwise() ensures row-wise paste.

For the updated sample data from your comment:

y1 <- c("a.b.","a.","b.c.d.")
y2 <- c("a.b.c.e.","a.b.","b.c.d.")
df <- data.frame(y1,y2)
df %>%
    mutate_if(is.factor, as.character) %>%
    mutate(col3 = map2(strsplit(y2, "\\."), strsplit(y1, "\\."), setdiff)) %>%
    rowwise() %>%
    mutate(col3 = paste(col3, collapse = "."))
## A tibble: 3 x 3
#  y1     y2       col3
#  <chr>  <chr>    <chr>
#1 a.b.   a.b.c.e. c.e
#2 a.     a.b.     b
#3 b.c.d. b.c.d.   ""

answered Nov 15 '22 10:11

Maurits Evers

Related questions
                            
                                R: Increase my rvest scraper's speed?
                            
                                Count of unique elements of each row in a data frame in R
                            
                                Disable button in shiny while plot is loading
                            
                                How to install Caret package? While installing, I am getting this message
                            
                                How to detect a blank input for a date in Shiny
                            
                                R | ggplot2 | (remove tick marks + remove panel border) but keep axis lines
                            
                                Quick replace of NA - an error or warning
                            
                                Correlations for pairs of combinations
                            
                                R ggplot2 annotate with subscript and tilde
                            
                                backports 1.1.1 package fails to install
                            
                                Align shared legends to the center of the plot grid (with cowplot)
                            
                                How to rasterize a single layer of a ggplot?
                            
                                R Studio installing stringi fails
                            
                                How do I change a named vector to a data frame retaining the names?
                            
                                glm in python vs R
                            
                                How to call a function for each row of a data.frame?
                            
                                Remove rows in data.table according to another data.table
                            
                                Neatest way to build a data frame from a list of lists in R
                            
                                officer package function for adding an R plot to a presentation
                            
                                choices combination,order & tree

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With