Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

left_join produces NAs when key has spaces

Tags:

r

dplyr

I'm getting an unexpected pattern of NAs from a left join. The data come from this week's Tidy Tuesday.

library(tidyverse)

breed_traits <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_traits.csv') %>%
  select(Breed, `Affectionate With Family`)

# A tibble: 195 × 2
   Breed                         `Affectionate With Family`
   <chr>                                              <dbl>
 1 Retrievers (Labrador)                                  5
 2 French Bulldogs                                        5
 3 German Shepherd Dogs                                   5
 4 Retrievers (Golden)                                    5
 5 Bulldogs                                               4
 6 Poodles                                                5
 7 Beagles                                                3
 8 Rottweilers                                            5
 9 Pointers (German Shorthaired)                          5
10 Dachshunds                                             5     

breed_rank_all <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_rank.csv') %>%
  select(Breed, `Rank 2013`)

# A tibble: 195 × 2
   Breed                         `2013 Rank`
   <chr>                               <dbl>
 1 Retrievers (Labrador)                   1
 2 French Bulldogs                        11
 3 German Shepherd Dogs                    2
 4 Retrievers (Golden)                     3
 5 Bulldogs                                5
 6 Poodles                                 8
 7 Beagles                                 4
 8 Rottweilers                             9
 9 Pointers (German Shorthaired)          13
10 Dachshunds                             10  

The rows where Breed has spaces (e.g., Retrievers (Labrador)) results in NAs, despite having data in both tables:

breed_rank_all %>%
  left_join(breed_traits, by = "Breed")

# A tibble: 195 × 3
   Breed                         `2013 Rank` `Affectionate With Family`
   <chr>                               <dbl>                      <dbl>
 1 Retrievers (Labrador)                   1                         NA
 2 French Bulldogs                        11                         NA
 3 German Shepherd Dogs                    2                         NA
 4 Retrievers (Golden)                     3                         NA
 5 Bulldogs                                5                          4
 6 Poodles                                 8                          5
 7 Beagles                                 4                          3
 8 Rottweilers                             9                          5
 9 Pointers (German Shorthaired)          13                         NA
10 Dachshunds                             10                          5

I checked for extra white spaces, but that was not it. I also tried removing spaces in Breed and then joining, but no.

Something is not equivalent here:

breed_rank_all$Breed[1]
[1] "Retrievers (Labrador)"

breed_traits$Breed[1]
[1] "Retrievers (Labrador)"

breed_rank_all$Breed[1] == breed_traits$Breed[1]
[1] FALSE

Update

difference is c2 a0 in place of 20??

iconv(breed_rank_all$Breed[1], toRaw = TRUE)
[[1]]
 [1] 52 65 74 72 69 65 76 65 72 73 20 28 4c 61 62 72 61 64 6f 72 29

> iconv(breed_traits$Breed[1], toRaw = TRUE)
[[1]]
 [1] 52 65 74 72 69 65 76 65 72 73 c2 a0 28 4c 61 62 72 61 64 6f 72 29

Using stringi::stri_enc_toascii

> stringi::stri_enc_toascii(breed_traits$Breed[1])
[1] "Retrievers\032(Labrador)"

> stringi::stri_enc_toascii(breed_rank_all$Breed[1])
[1] "Retrievers (Labrador)"

This seems to fix it:

breed_traits <- breed_traits %>%
  mutate(Breed = stringi::stri_enc_toascii(Breed),
         Breed = gsub("\\\032", " ", Breed)) 
like image 967
Eric Green Avatar asked Oct 24 '25 17:10

Eric Green


1 Answers

I found the issue. On a hunch, I investigated the whitespace.

# space that isn't a space (like non-breaking space?)
utf8::utf8_print(breed_traits$Breed[1], utf8 = FALSE)
# [1] "Retrievers\u00a0(Labrador)"
# this is a non-breaking space

You can replace non-breaking spaces with regex.

(replSp = str_replace_all(string = breed_traits$Breed[1],
                pattern = "[[:space:]]",
                replacement = " ")) 
# [1] "Retrievers (Labrador)" 

breed_rank_all$Breed[[1]] == replSp
# [1] TRUE 


As requested...

To replace all non-breaking spaces in a data frame:

breed_traits <- breed_traits %>% 
   mutate(Breed = str_replace_all(string = Breed, 
                                  pattern = "[[:space:]]", 
                                  replacement = " "))
like image 177
Kat Avatar answered Oct 27 '25 06:10

Kat