Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Functionally creating variables using string names

I'm trying to generate a function to create a bunch of columns on a data frame that have the same naming conventions and use the same logic. Unfortunately, I've bumped into some weird behavior when creating the variables, and I am hopeful someone else can explain what's going on here.

df <- data.frame(var1 = c(1,2,3), var2 = c(3,4,5), var3 = c("foo", "bar", "baz"))

DoesNotWork <- function(df, varname){
  df[paste(varname, "_square", sep = "")] <- df[varname]^2
  return(df)
}

dfBad <- DoesNotWork(df, "var1")

dfBad
      var1 var2 var3 var1
  1    1    3  foo    1
  2    2    4  bar    4
  3    3    5  baz    9

dfBad here has two variables called var1 rather than one variable called var1 and one variable called var1_squared as I had hoped.

The function below hacks around this problem by assigning all of the values of the original variable to the new variable name, then performing the same operation on only the new variable, but this is sort of obnoxious, and I'm not sure what would happen if I needed to use logic from multiple variables.

Works <- function(df, varname){
   df[paste(varname, "_square", sep = "")] <- df[varname]
   df[paste(varname, "_square", sep = "")] <- df[paste(varname, "_square", sep = "")]^2
   return(df)
}

dfGood <- Works(df, "var1")

dfGood
      var1 var2 var3 var1_square
  1    1    3  foo           1
  2    2    4  bar           4
  3    3    5  baz           9

Any guidance here would be greatly appreciated, especially if there's a nicer way to switch between strings for variable names and references to the column-objects.

like image 584
TuringMachin Avatar asked Jun 25 '26 09:06

TuringMachin


2 Answers

You're missing the commas.

df <- data.frame(var1 = c(1,2,3), var2 = c(3,4,5), var3 = c("foo", "bar", "baz"))

NowItWorks <- function(df, varname){
  df[,paste(varname, "_square", sep = "")] <- df[,varname]^2
  return(df)
}

NowItWorks(df, "var1")

>  var1 var2 var3 var1_square
 1    1    3  foo           1
 2    2    4  bar           4
 3    3    5  baz           9

EDIT: Ok so my above answer does work, but it does not really answer the question as to why the second one works.

For example:

MultiplicationWorks <- function(df, varname){
  df[paste(varname, "_square", sep = "")] <- df[varname]*2
  return(df)
}

As do all the other non exponential operators. If we look at the data.frame Operators source code, we see this interesting bit at the bottom:

Ops.data.frame

...
if (.Generic %in% c("+", "-", "*", "/", "%%", "%/%")) {
    names(value) <- cn
    data.frame(value, row.names = rn, check.names = FALSE,
        check.rows = FALSE)
}
else matrix(unlist(value, recursive = FALSE, use.names = FALSE),
    nrow = nr, dimnames = list(rn, cn))
...

Basically this is saying that if the operator is one of those listed, then return a data.frame with the given names, otherwise return a matrix with the given names. For some reason, the "^" operator is the only one not listed. We can confirm this pretty easily:

df <- data.frame(var1 = c(1,2,3), var2 = c(3,4,5), var3 = c("foo", "bar", "baz"))

class(df["var1"]*2)

>[1] "data.frame"

class(df["var1"]^2)

>[1] "matrix"

With exponentiaton, and only with exponentiation, the dimnames of the matrix overrule the new column name of your data.frame when you assign it. R is weird. Comically this means that you could also get your code to work by wrapping an as.data.frame() around your exponentiation part.

If you want to see something really strange using your initial function:

❥ names(dfBad)
[1] "var1"        "var2"        "var3"        "var1_square"
❥ dfBad
  var1 var2 var3 var1
1    1    3  foo    1
2    2    4  bar    4
3    3    5  baz    9
❥ str(dfBad)
'data.frame':   3 obs. of  4 variables:
 $ var1       : num  1 2 3
 $ var2       : num  3 4 5
 $ var3       : Factor w/ 3 levels "bar","baz","foo": 3 1 2
 $ var1_square: num [1:3, 1] 1 4 9
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr "var1"

R knows the column's correct name, but shows you the name of the matrix you stuck into it.

like image 184
jed Avatar answered Jun 26 '26 22:06

jed


I think you just need to use [[ instead of [. Try this.

ThisWorks <- function(df, varname){
  df[[paste(varname, "_square", sep = "")]] <- df[[varname]]^2
  return(df)
}

The problem is actually in the df[varname]; this returns a data frame with the original column name, which is kept when you add it on. Using [[ or specifying that you want that column by using a comma, as @jed suggests, will return a vector with no name.

like image 25
Aaron left Stack Overflow Avatar answered Jun 27 '26 00:06

Aaron left Stack Overflow