Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dealing with character variables containing semicolons in CSV files

Tags:

r

readr

I have a file separated by semicolons in which one of the variables of type character contains semicolon inside it. The readr::read_csv2 function splits the contents of those variables that have semicolons into more columns, messing up the formatting of the file.

For example, when using read_csv2 to open the file below, Bill's age column will show jogging, not 41.

File:

name;hobbies;age
Jon;cooking;38
Bill;karate;jogging;41
Maria;fishing;32

Considering that the original file doesn't contain quotes around the character type variables, how can I import the file so that karate and jogging belong in the hobbies column?

like image 321
Igor Avatar asked Dec 19 '25 21:12

Igor


2 Answers

read.csv()

You can use the read.csv() function. But there would be some warning messages (or use suppressWarnings() to wrap around the read.csv() function). If you wish to avoid warning messages, using the scan() method in the next section.

library(dplyr)

read.csv("./path/to/your/file.csv", sep = ";", 
         col.names = c("name", "hobbies", "age", "X4")) %>% 
  mutate(hobbies = ifelse(is.na(X4), hobbies, paste0(hobbies, ";" ,age)), 
         age = ifelse(is.na(X4), age, X4)) %>% 
  select(-X4)

scan() file

You can first scan() the CSV file as a character vector first, then split the string with pattern ; and change it into a dataframe. After that, do some mutate() to identify your target column and remove unnecessary columns. Finally, use the first row as the column name.

library(tidyverse)
library(janitor)

semicolon_file <- scan(file = "./path/to/your/file.csv", character())

semicolon_df <- data.frame(str_split(semicolon_file, ";", simplify = T))

semicolon_df %>% 
  mutate(X4 = na_if(X4, ""), 
         X2 = ifelse(is.na(X4), X2, paste0(X2, ";" ,X3)), 
         X3 = ifelse(is.na(X4), X3, X4)) %>% 
  select(-X4) %>% 
  janitor::row_to_names(row_number = 1)

Output

   name        hobbies age
2   Jon        cooking  38
3  Bill karate;jogging  41
4 Maria        fishing  32
like image 98
benson23 Avatar answered Dec 22 '25 10:12

benson23


Assuming that you have the columns name and age with a single entry per observation and hobbies with possible multiple entries the following approach works:

  1. read in the file line by line instead of treating it as a table:
tmp <- readLines(con <- file("table.csv"))
close(con)
  1. Find the position of the separator in every row. The entry before the first separator is the name the entry after the last is the age:
separator_pos <- gregexpr(";", tmp)
name <- character(length(tmp) - 1)
age <- integer(length(tmp) - 1)
hobbies <- vector(length=length(tmp) - 1, "list")
  1. fill the three elements using a for loop:
# the first line are the colnames
for(line in 2:length(tmp)){
     # from the beginning of the row to the first";"
     name[line-1] <- strtrim(tmp[line], separator_pos[[line]][1] -1) 
     # between the first ";" and the last ";".
     # Every ";" is a different elemet of the list
     hobbies[line-1] <- strsplit(substr(tmp[line], separator_pos[[line]][1] +1,  
                                        separator_pos[[line]][length(separator_pos[[line]])]-1),";")
     #after the last ";", must be an integer
     age[line-1] <- as.integer(substr(tmp[line],separator_pos[[line]][length(separator_pos[[line]])]+1, 
                               nchar(tmp[line])))
} 
  1. Create a separate matrix to hold the hobbies and fill it rowwise:
hobbies_matrix <- matrix(NA_character_, nrow = length(hobbies), ncol = max(lengths(hobbies)))
for(line in 1:length(hobbies)) 
    hobbies_matrix[line,1:length(hobbies[[line]])] <-  hobbies[[line]]   

  1. Add all variable to a data.frame:
df <- data.frame(name = name, hobbies = hobbies_matrix, age = age)
> df
   name hobbies.1 hobbies.2 age
1   Jon   cooking      <NA>  38
2  Bill    karate   jogging  41
3 Maria   fishing      <NA>  32
like image 23
Grada Gukovic Avatar answered Dec 22 '25 12:12

Grada Gukovic



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!