Dealing with character variables containing semicolons in CSV files

Question

I have a file separated by semicolons in which one of the variables of type character contains semicolon inside it. The readr::read_csv2 function splits the contents of those variables that have semicolons into more columns, messing up the formatting of the file.

For example, when using read_csv2 to open the file below, Bill's age column will show jogging, not 41.

File:

name;hobbies;age
Jon;cooking;38
Bill;karate;jogging;41
Maria;fishing;32

Considering that the original file doesn't contain quotes around the character type variables, how can I import the file so that karate and jogging belong in the hobbies column?

benson23 · Accepted Answer

read.csv()

You can use the read.csv() function. But there would be some warning messages (or use suppressWarnings() to wrap around the read.csv() function). If you wish to avoid warning messages, using the scan() method in the next section.

library(dplyr)

read.csv("./path/to/your/file.csv", sep = ";", 
         col.names = c("name", "hobbies", "age", "X4")) %>% 
  mutate(hobbies = ifelse(is.na(X4), hobbies, paste0(hobbies, ";" ,age)), 
         age = ifelse(is.na(X4), age, X4)) %>% 
  select(-X4)

scan() file

You can first scan() the CSV file as a character vector first, then split the string with pattern ; and change it into a dataframe. After that, do some mutate() to identify your target column and remove unnecessary columns. Finally, use the first row as the column name.

library(tidyverse)
library(janitor)

semicolon_file <- scan(file = "./path/to/your/file.csv", character())

semicolon_df <- data.frame(str_split(semicolon_file, ";", simplify = T))

semicolon_df %>% 
  mutate(X4 = na_if(X4, ""), 
         X2 = ifelse(is.na(X4), X2, paste0(X2, ";" ,X3)), 
         X3 = ifelse(is.na(X4), X3, X4)) %>% 
  select(-X4) %>% 
  janitor::row_to_names(row_number = 1)

Output

   name        hobbies age
2   Jon        cooking  38
3  Bill karate;jogging  41
4 Maria        fishing  32

Grada Gukovic · Answer

Assuming that you have the columns name and age with a single entry per observation and hobbies with possible multiple entries the following approach works:

read in the file line by line instead of treating it as a table:

tmp <- readLines(con <- file("table.csv"))
close(con)

Find the position of the separator in every row. The entry before the first separator is the name the entry after the last is the age:

separator_pos <- gregexpr(";", tmp)
name <- character(length(tmp) - 1)
age <- integer(length(tmp) - 1)
hobbies <- vector(length=length(tmp) - 1, "list")

fill the three elements using a for loop:

# the first line are the colnames
for(line in 2:length(tmp)){
     # from the beginning of the row to the first";"
     name[line-1] <- strtrim(tmp[line], separator_pos[[line]][1] -1) 
     # between the first ";" and the last ";".
     # Every ";" is a different elemet of the list
     hobbies[line-1] <- strsplit(substr(tmp[line], separator_pos[[line]][1] +1,  
                                        separator_pos[[line]][length(separator_pos[[line]])]-1),";")
     #after the last ";", must be an integer
     age[line-1] <- as.integer(substr(tmp[line],separator_pos[[line]][length(separator_pos[[line]])]+1, 
                               nchar(tmp[line])))
}

Create a separate matrix to hold the hobbies and fill it rowwise:

hobbies_matrix <- matrix(NA_character_, nrow = length(hobbies), ncol = max(lengths(hobbies)))
for(line in 1:length(hobbies)) 
    hobbies_matrix[line,1:length(hobbies[[line]])] <-  hobbies[[line]]

Add all variable to a data.frame:

df <- data.frame(name = name, hobbies = hobbies_matrix, age = age)
> df
   name hobbies.1 hobbies.2 age
1   Jon   cooking      <NA>  38
2  Bill    karate   jogging  41
3 Maria   fishing      <NA>  32

Dealing with character variables containing semicolons in CSV files

Tags:

r

readr

Igor

2 Answers

read.csv()

scan() file

Output

benson23

Grada Gukovic

Recent Activity

Donate For Us

Dealing with character variables containing semicolons in CSV files

Tags:

r

readr

Igor

2 Answers

read.csv()

scan() file

Output

benson23

Grada Gukovic

Related questions

Recent Activity

Donate For Us