Update: I should have been clearer that I was trying to check out the Enhanced functionality in reshaping using data.tables https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html. Updated the title.
I have this data set with two sets of variables - Credit_Risk_Capital and Name_concentration. They are calculated per 2 methodologies - New and Old. When I melt them using the data.table package, the variable names default to 1 and 2. How can I change them to just Credit_Risk_Capital and Name_Concentration.
Here is the data set
df <-data.table (id = c(1:100),Credit_risk_Capital_old= rnorm(100, mean = 400, sd = 60),
NameConcentration_old= rnorm(100, mean = 100, sd = 10),
Credit_risk_Capital_New =rnorm(100, mean = 200, sd = 10),
NameConcentration_New = rnorm(100, mean = 40, sd = 10))
old <- c('Credit_risk_Capital_old','NameConcentration_old')
new<-c('Credit_risk_Capital_New','NameConcentration_New')
t1<-melt(df, measure.vars = list(old,new), variable.name = "CapitalChargeType",value.name = c("old","new"))
Now instead of the elements in the CapitalChargeType Column getting tagged as 1's and 2's, I want them to be changed to Credit_risk_Capital and NameConcentration. I can obviously change them in a subsequent step using a 'match' function, but is there anyway I can do it within melt itself.
The issue here is that melt()
doesn't know how to name the variables in case of more than one measure variable. So, it resorts to simply numbering the variables.
David has pointed out that there is a feature request. However, I will show two workarounds and compare them (plus the tidyr
answer) in terms of speed.
melt()
all measure variables (which keeps the variable names), create new variable names, and dcast()
the temporary result again to end up with two value columns. This recast approach is also being used by austensen.library(data.table) # CRAN version 1.10.4 used
# melt all measure variables
long <- melt(df, id.vars = "id")
# split variables names
long[, c("CapitalChargeType", "age") :=
tstrsplit(variable, "_(?=(New|old)$)", perl = TRUE)]
dcast(long, id + CapitalChargeType ~ age)
id CapitalChargeType New old 1: 1 Credit_risk_Capital 204.85227 327.57606 2: 1 NameConcentration 34.20043 104.14524 3: 2 Credit_risk_Capital 206.96769 416.64575 4: 2 NameConcentration 30.46721 95.25282 5: 3 Credit_risk_Capital 201.85514 465.06647 --- 196: 98 NameConcentration 45.38833 90.34097 197: 99 Credit_risk_Capital 203.53625 458.37501 198: 99 NameConcentration 40.14643 101.62655 199: 100 Credit_risk_Capital 203.19156 527.26703 200: 100 NameConcentration 30.83511 79.21762
Note that the variable names are split at the last _
before the final old
or New
, resp. This is achieved by using a regular expression with positive look-ahead: "_(?=(New|old)$)"
Here, we pick-up David's suggestion to use the patterns()
function which is equivalent to specify a list of measure variables.
As a side note: The order of the list (or the patterns) determines the order of the value columns:
melt(df, measure.vars = patterns("New$", "old$"))
id variable value1 value2 1: 1 1 204.85227 327.57606 2: 2 1 206.96769 416.64575 3: 3 1 201.85514 465.06647 ...
melt(df, measure.vars = patterns("old$", "New$"))
id variable value1 value2 1: 1 1 327.57606 204.85227 2: 2 1 416.64575 206.96769 3: 3 1 465.06647 201.85514 ...
As already pointed out by the OP, melting with multiple measure variables
long <- melt(df, measure.vars = patterns("old$", "New$"),
variable.name = "CapitalChargeType",
value.name = c("old", "New"))
returns numbers instead of the variable names:
str(long)
Classes ‘data.table’ and 'data.frame': 200 obs. of 4 variables: $ id : int 1 2 3 4 5 6 7 8 9 10 ... $ CapitalChargeType: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ... $ old : num 328 417 465 259 426 ... $ New : num 205 207 202 207 203 ... - attr(*, ".internal.selfref")=<externalptr>
Fortunately, these are factors which can be changed easily by replacing the factor levels with help of the forcats
package:
long[, CapitalChargeType := forcats::lvls_revalue(
CapitalChargeType,
c("Credit_risk_Capital", "NameConcentration"))]
long[order(id)]
id CapitalChargeType old New 1: 1 Credit_risk_Capital 327.57606 204.85227 2: 1 NameConcentration 104.14524 34.20043 3: 2 Credit_risk_Capital 416.64575 206.96769 4: 2 NameConcentration 95.25282 30.46721 5: 3 Credit_risk_Capital 465.06647 201.85514 --- 196: 98 NameConcentration 90.34097 45.38833 197: 99 Credit_risk_Capital 458.37501 203.53625 198: 99 NameConcentration 101.62655 40.14643 199: 100 Credit_risk_Capital 527.26703 203.19156 200: 100 NameConcentration 79.21762 30.83511
Note that melt()
numbers the variables in the order the columns appear in df
.
reshape()
The stats
package of base R has a reshape()
function. Unfortunately, it doesn't accept regular expressions with positive look-ahead. So, the automatic guessing of variable names can't be used. Instead, all relevant parameters have to be specified explicitely:
old <- c('Credit_risk_Capital_old', 'NameConcentration_old')
new <- c('Credit_risk_Capital_New', 'NameConcentration_New')
reshape(df, varying = list(old, new), direction = "long",
timevar = "CapitalChargeType",
times = c("Credit_risk_Capital", "NameConcentration"),
v.names = c("old", "New"))
id CapitalChargeType old New 1: 1 Credit_risk_Capital 367.95567 194.93598 2: 2 Credit_risk_Capital 467.98061 215.39663 3: 3 Credit_risk_Capital 363.75586 201.72794 4: 4 Credit_risk_Capital 433.45070 191.64176 5: 5 Credit_risk_Capital 408.55776 193.44071 --- 196: 96 NameConcentration 93.67931 47.85263 197: 97 NameConcentration 101.32361 46.94047 198: 98 NameConcentration 104.80926 33.67270 199: 99 NameConcentration 101.33178 32.28041 200: 100 NameConcentration 85.37136 63.57817
The benchmark includes all 4 approaches discussed so far:
tidyr
with modifications to use resgular expressions with positive look-ahead,recast
,melt()
of multiple value variables, andreshape()
.The benchmark data consist of 100 K rows:
n_rows <- 100L
set.seed(1234L)
df <- data.table(
id = c(1:n_rows),
Credit_risk_Capital_old = rnorm(n_rows, mean = 400, sd = 60),
NameConcentration_old = rnorm(n_rows, mean = 100, sd = 10),
Credit_risk_Capital_New = rnorm(n_rows, mean = 200, sd = 10),
NameConcentration_New = rnorm(n_rows, mean = 40, sd = 10))
For benchmarking, the microbenchmark
package is used:
library(magrittr)
old <- c('Credit_risk_Capital_old', 'NameConcentration_old')
new <- c('Credit_risk_Capital_New', 'NameConcentration_New')
microbenchmark::microbenchmark(
tidyr = {
r_tidyr <- df %>%
dplyr::as_data_frame() %>%
tidyr::gather("key", "value", -id) %>%
tidyr::separate(key, c("CapitalChargeType", "age"), sep = "_(?=(New|old)$)") %>%
tidyr::spread(age, value)
},
recast = {
r_recast <- dcast(
melt(df, id.vars = "id")[
, c("CapitalChargeType", "age") :=
tstrsplit(variable, "_(?=(New|old)$)", perl = TRUE)],
id + CapitalChargeType ~ age)
},
m2col = {
r_m2col <- melt(df, measure.vars = patterns("New$", "old$"),
variable.name = "CapitalChargeType",
value.name = c("New", "old"))[
, CapitalChargeType := forcats::lvls_revalue(
CapitalChargeType,
c("Credit_risk_Capital", "NameConcentration"))][order(id)]
},
reshape = {
r_reshape <- reshape(df, varying = list(new, old), direction = "long",
timevar = "CapitalChargeType",
times = c("Credit_risk_Capital", "NameConcentration"),
v.names = c("New", "old")
)
},
times = 10L
)
Unit: milliseconds expr min lq mean median uq max neval tidyr 705.20364 789.63010 832.11391 813.08830 825.15259 1091.3188 10 recast 215.35813 223.60715 287.28034 261.23333 338.36813 477.3355 10 m2col 10.28721 11.35237 38.72393 14.46307 23.64113 154.3357 10 reshape 143.75546 171.68592 379.05752 224.13671 269.95301 1730.5892 10
The timings show that melt()
of two columns simultaneously is about 15 times faster than the second fastest, reshape()
. Both recast
variants fall behind because they both require two reshaping operations. The tidyr
solution is particularly slow.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With