Could someone explain me why I get different results in my last two lines of code (identical() calls) below?
These two objects seem to be identical objects, but when I use them in an apply function, I get some trouble: 
df <- data.frame(a = 1:5, b = 6:2, c = rep(7,5))
df_ab <- df[,c(1,2)]
df_AB <- subset(df, select = c(1,2))
identical(df_ab,df_AB)
[1] TRUE
apply(df_ab,2,function(x) identical(1:5,x))
    a     b 
TRUE FALSE
apply(df_AB,2,function(x) identical(1:5,x))
    a     b 
FALSE FALSE
Subsetting in R is a useful indexing feature for accessing object elements. It can be used to select and filter variables and observations. You can use brackets to select rows and columns from your dataframe.
Select a subset of rows and columns combined The loc or iloc operators are needed. The section before the comma is the rows you choose, and the part after the comma is the columns you want to pick by using loc or iloc.
With Selection, Slicing, Indexing and Filtering There are many different ways of subsetting a Pandas DataFrame. You may need to select specific columns with all rows. Sometimes, you want to select specific rows with all columns or select rows and columns that meet a specific criterion, etc.
To subset a dataframe and store it, use the following line of code : This creates a separate data frame as a subset of the original one. 2. Selecting Rows You can use the indexing operator to select specific rows based on certain conditions. For example to select rows having population greater than 500 you can use the following line of code.
Select a Subset of a Dataframe using the Indexing Operator 1 Selecting Only Columns#N#To select a column using indexing operator use the following line of code.#N#housing... 2 Selecting Rows More ...
by default always produces a data frame. The additional differences follow the available keyword arguments: (but by default a data frame is returned). a SubDataFrame instead of a DataFrame.
If you are importing data into Python then you must be aware of Data Frames. A DataFrame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.
The apply() function coerces its first argument to a matrix before calling the function on each column. So your data frames are coerced to matrix objects.  A consequence of that conversion is that as.matrix(df_AB) has non-null rownames, while as.matrix(df_ab) does not:
> str(as.matrix(df_ab))
 int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "a" "b"
> str(as.matrix(df_AB))
 int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:5] "1" "2" "3" "4" ...
  ..$ : chr [1:2] "a" "b"
So when you apply() subset a column of df_AB, you get a named vector, which is not identical to an unnamed vector.
apply(df_AB, 2, str)
 Named int [1:5] 1 2 3 4 5
 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ...
 Named int [1:5] 6 5 4 3 2
 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ...
NULL
Contrast that with the subset() function, which selects rows using a logical vector for the value of i. And it looks like subsetting a data.frame with a non-missing value for i causes this difference in the row.names attribute:
> str(as.matrix(df[1:5, 1:2]))
 int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:5] "1" "2" "3" "4" ...
  ..$ : chr [1:2] "a" "b"
> str(as.matrix(df[, 1:2]))
 int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "a" "b"
You can see the all the gory details of the difference between the data.frames using the .Internal(inspect(x)) function.  You can look at those yourself, if you're interested.
As Roland pointed out in his comments, you can use the .row_names_info() function to see the differences in only the row names.
Notice that when i is missing, the result of .row_names_info() is negative, but it is positive if you subset with a non-missing i.
> .row_names_info(df_ab, type=1)
[1] -5
> .row_names_info(df_AB, type=1)
[1] 5
What these values mean is explained in ?.row_names_info:
type: integer. Currently ‘type = 0’ returns the internal ‘"row.names"’ attribute (possibly ‘NULL’), ‘type = 2’ the number of rows implied by the attribute, and ‘type = 1’ the latter with a negative sign for ‘automatic’ row names.
If you want to compare the values 1:5 with the values in the columns, you should not use apply since apply transforms the data frames to matrices before the functions are applied. Due to the row names in the subset created with [ (see @Joshua Ulrich's answer), the values 1:5 are not identical to a named vector including the same values.
You should instead use sapply to apply the identical function to the columns. This avoids transforming the data frames to matrices:
> sapply(df_ab, identical, 1:5)
    a     b 
 TRUE FALSE 
> sapply(df_AB, identical, 1:5)
    a     b 
 TRUE FALSE 
As you can see, in both data frames the values in the first column are identical to 1:5.
In one version (using [) your columns are integers, while in the other version (using subset) your columns are named integers.
apply(df_ab, 2, str)
 int [1:5] 1 2 3 4 5
 int [1:5] 6 5 4 3 2
NULL
apply(df_AB, 2, str)
 Named int [1:5] 1 2 3 4 5
 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ...
 Named int [1:5] 6 5 4 3 2
 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ...
NULL
Looking at the structure of those two object s before they get submitted to apply shows only one difference: in the rownames, but not a difference that I would have expected to produce the difference you are seeing. I do not see Joshua's current offer of 'subset' as logical indexing as explaining this. Why row.names = c(NA, -5L)) produces a named result when extracting with "[" is as yet unexplained.
> dput(df_AB)
structure(list(a = 1:5, b = c(6L, 5L, 4L, 3L, 2L)), .Names = c("a", 
"b"), row.names = c(NA, 5L), class = "data.frame")
> dput(df_ab)
structure(list(a = 1:5, b = c(6L, 5L, 4L, 3L, 2L)), .Names = c("a", 
"b"), class = "data.frame", row.names = c(NA, -5L))
I do agree that it is the as.matrix coercion which needs further investigation:
> attributes(df_AB[,1])
NULL
> attributes(df_ab[,1])
NULL
> attributes(as.matrix(df_AB)[,1])
$names
[1] "1" "2" "3" "4" "5"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With