I have a file separated by semicolons in which one of the variables of type character contains semicolon inside it. The readr::read_csv2 function splits the contents of those variables that have semicolons into more columns, messing up the formatting of the file.
For example, when using read_csv2 to open the file below, Bill's age column will show jogging, not 41.
File:
name;hobbies;age
Jon;cooking;38
Bill;karate;jogging;41
Maria;fishing;32
Considering that the original file doesn't contain quotes around the character type variables, how can I import the file so that karate and jogging belong in the hobbies column?
You can use the read.csv() function. But there would be some warning messages (or use suppressWarnings() to wrap around the read.csv() function). If you wish to avoid warning messages, using the scan() method in the next section.
library(dplyr)
read.csv("./path/to/your/file.csv", sep = ";",
col.names = c("name", "hobbies", "age", "X4")) %>%
mutate(hobbies = ifelse(is.na(X4), hobbies, paste0(hobbies, ";" ,age)),
age = ifelse(is.na(X4), age, X4)) %>%
select(-X4)
You can first scan() the CSV file as a character vector first, then split the string with pattern ; and change it into a dataframe. After that, do some mutate() to identify your target column and remove unnecessary columns. Finally, use the first row as the column name.
library(tidyverse)
library(janitor)
semicolon_file <- scan(file = "./path/to/your/file.csv", character())
semicolon_df <- data.frame(str_split(semicolon_file, ";", simplify = T))
semicolon_df %>%
mutate(X4 = na_if(X4, ""),
X2 = ifelse(is.na(X4), X2, paste0(X2, ";" ,X3)),
X3 = ifelse(is.na(X4), X3, X4)) %>%
select(-X4) %>%
janitor::row_to_names(row_number = 1)
name hobbies age
2 Jon cooking 38
3 Bill karate;jogging 41
4 Maria fishing 32
Assuming that you have the columns name and age with a single entry per observation and hobbies with possible multiple entries the following approach works:
tmp <- readLines(con <- file("table.csv"))
close(con)
separator_pos <- gregexpr(";", tmp)
name <- character(length(tmp) - 1)
age <- integer(length(tmp) - 1)
hobbies <- vector(length=length(tmp) - 1, "list")
# the first line are the colnames
for(line in 2:length(tmp)){
# from the beginning of the row to the first";"
name[line-1] <- strtrim(tmp[line], separator_pos[[line]][1] -1)
# between the first ";" and the last ";".
# Every ";" is a different elemet of the list
hobbies[line-1] <- strsplit(substr(tmp[line], separator_pos[[line]][1] +1,
separator_pos[[line]][length(separator_pos[[line]])]-1),";")
#after the last ";", must be an integer
age[line-1] <- as.integer(substr(tmp[line],separator_pos[[line]][length(separator_pos[[line]])]+1,
nchar(tmp[line])))
}
hobbies_matrix <- matrix(NA_character_, nrow = length(hobbies), ncol = max(lengths(hobbies)))
for(line in 1:length(hobbies))
hobbies_matrix[line,1:length(hobbies[[line]])] <- hobbies[[line]]
df <- data.frame(name = name, hobbies = hobbies_matrix, age = age)
> df
name hobbies.1 hobbies.2 age
1 Jon cooking <NA> 38
2 Bill karate jogging 41
3 Maria fishing <NA> 32
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With