Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between two groups, data processing

I have this dataprocessing problem where I want to count the difference in goals between two teams in different matches. The data looks like this:

    matchId teamId   eventSec
1   2799331   6718  443.55984
2   2799331   6726 1152.62584
3   2799331   6718 2113.82370
4   2799331   6718 2066.28793
5   2799332   6710 1001.35775
6   2799332   6721 2469.32876
7   2799333   6734  744.35563

So one of these lines represents a goal scored by a team, and matchId represents the match, teamId the teams in the match and eventSec the time of the goal. What i want to accomplish is a variable for each row that shows the difference in score, so the first row in this example data should have a goal difference of 1, row two should have goal difference by 0, row three a goal difference by 1, etc.

I want to apply this to a big dataset with alot of different match and teamIds, so I need a pretty general solution to solve this. I've tried different dplyr solutions but gotten stuck a million times and would really like some help.

Thanks alot in advance!

like image 936
shieldgang Avatar asked Jan 24 '26 13:01

shieldgang


2 Answers

I hope I am following your logic correctly, let me know.

library(tidyverse)

df %>%
  group_by(matchId) %>%
  mutate(point = if_else(teamId == first(teamId), 1, -1),
         goal_diff = cumsum(point))

In addition, if you want to get the score for each match, you could add:

group_by(matchId, teamId) %>%
summarise(score = sum(abs(point)))

Output

# A tibble: 7 x 5
# Groups:   matchId [3]
  matchId teamId eventSec point goal_diff
    <int>  <int>    <dbl> <dbl>     <dbl>
1 2799331   6718     444.     1         1
2 2799331   6726    1153.    -1         0
3 2799331   6718    2114.     1         1
4 2799331   6718    2066.     1         2
5 2799332   6710    1001.     1         1
6 2799332   6721    2469.    -1         0
7 2799333   6734     744.     1         1
like image 151
Ben Avatar answered Jan 27 '26 01:01

Ben


Well, if your dataset is not like million rows, then try this:

x=structure(list(matchId = c(2799331L, 2799331L, 2799331L, 2799331L, 
2799332L, 2799332L, 2799333L), teamId = c(6718L, 6726L, 6718L, 
6718L, 6710L, 6721L, 6734L), eventSec = c(443.55984, 1152.62584, 
2113.8237, 2066.28793, 1001.35775, 2469.32876, 744.35563)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7"))

library(dplyr)
x %>% group_by(matchId) %>% mutate(goalD=cumsum(ifelse(teamId==teamId[1],1,-1)))

# A tibble: 7 x 4
# Groups:   matchId [3]
  matchId teamId eventSec goalD
    <int>  <int>    <dbl> <dbl>
1 2799331   6718     444.     1
2 2799331   6726    1153.     0
3 2799331   6718    2114.     1
4 2799331   6718    2066.     2
5 2799332   6710    1001.     1
6 2799332   6721    2469.     0
7 2799333   6734     744.     1
like image 31
StupidWolf Avatar answered Jan 27 '26 02:01

StupidWolf



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!