Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scale rows of data

Tags:

r

scale

dplyr

I have a data frame with customer information in rows and periods (months) in columns. I use this format for clustering purposes. I want to scale the values in the rows. I can do it with the following code, but there are some problems:

  1. The code is too complex for something that should be a simple operation.
  2. The "scale" function returns "NaN" in some cases.
  3. Entering explicit customer names (vars=c("A","B",...) will not work since the real data has thousands of customers.

Here is my sample data and code:

mydata 
  cust P1  P2 P3  P4 P5  P6 P7  P8 P9 P10 P11 P12 P13 P14 P15 P16 P17 P18 P19 P20
1    A  1 1.0  1 1.0  1 1.0  1 1.0  1 1.0   1 1.0   1 1.0   1 1.0   1 1.0   1 1.0
2    B  5 5.0  5 5.0  5 5.0  5 5.0  5 5.0   5 5.0   5 5.0   5 5.0   5 5.0   5 5.0
3    C  9 9.0  9 9.0  9 9.0  9 9.0  9 9.0   9 9.0   9 9.0   9 9.0   9 9.0   9 9.0
4    D  0 1.0  2 1.0  0 1.0  2 1.0  0 1.0   2 1.0   0 1.0   2 1.0   0 1.0   2 1.0
5    E  4 5.0  6 5.0  4 5.0  6 5.0  4 5.0   6 5.0   4 5.0   6 5.0   4 5.0   6 5.0
6    F  8 9.0 10 9.0  8 9.0 10 9.0  8 9.0  10 9.0   8 9.0  10 9.0   8 9.0  10 9.0
7    G  2 1.5  1 0.5  0 0.5  1 1.5  2 1.5   1 0.5   0 0.5   1 1.5   2 1.5   1 0.5
8    H  6 5.5  5 4.5  4 4.5  5 5.5  6 5.5   5 4.5   4 4.5   5 5.5   6 5.5   5 4.5
9    I 10 9.5  9 8.5  8 8.5  9 9.5 10 9.5   9 8.5   8 8.5   9 9.5  10 9.5   9 8.5

code that I am using:

library(dplyr)
library(tidyr)
# first transpose the data
g_mydata = mydata %>% gather(period,value,-cust)
spr_mydata = g_mydata %>% spread(cust,value)
# then scale the values for each period
sc_mydata = spr_mydata %>% 
      mutate_each_(funs(scale),vars = c("A","B","C","D","E","F","G","H","I") )   
# then transpose again back to original format
g_scdata = sc_mydata %>% gather(cust,value,-period)
scaled_data = g_scdata %>% spread(period,value)

Thanks for any help or suggestions.

like image 641
Paul Avatar asked Dec 18 '25 20:12

Paul


2 Answers

You could always try apply():

sc_mydata = apply(spr_mydata[, -1], 1, scale)

If the NaN's are messing that up, you could transpose spr_mydata and try to run scale() directly:

scale(spr_mydata[-1, ])
like image 88
wmay Avatar answered Dec 20 '25 14:12

wmay


Here is a dplyr way of doing it.

long_data = 
  mydata %>% 
  gather(period, value,-cust)

to_scale = 
  long_data %>%
  group_by(cust) %>%
  summarize(sd = sd(value)) %>%
  filter(sd != 0) %>%
  select(-sd)

flat = 
  long_data %>%
  anti_join(to_scale) %>%
  mutate(value = 0)

wide_scale = 
  long_data %>%
  right_join(to_scale) %>%
  group_by(cust) %>%
  mutate(value = 
           value %>%
           scale %>%
           signif(7)) %>%
  bind_rows(flat) %>%
  spread(period, value)

type = 
  wide_scale %>%
  select(-cust) %>%
  distinct %>%
  mutate(type_ID = 1:n())

customer__type = 
  type %>%
  left_join(wide_scale) %>%
  select(type_ID, cust)
like image 40
bramtayl Avatar answered Dec 20 '25 15:12

bramtayl



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!