Consider the following data set,
Name Age Grade
Person1 18 50
Person2 19 100
Person3 20 0
Person4 21 -25
Person5 22 -125
Person6 23 80
Person7 24 -70
If I wanted to group together the students based on the aggregation of their grades, how would I go about it? What I mean is - I want to make groups of students such as the sum of the grades of all the students in that grade is 0 or as close to zero as possible.
So, ideally - the answer to the dataset would be something like -
Group 1 ----> Person1, Person2, Person4, Person5 (because their added grades are 0)
Group 2 ----> Person3 (because the grade is 0)
How I went on about approaching this was -
new_dt = pd.read_csv('../tt.csv')
a = list(itertools.chain.from_iterable(
[[j for i in el for j in i] for el in itertools.combinations(new_dt.values.tolist(),i)]
for i in range(1, len(new_dt)+1)
)
)
out = pd.DataFrame(a)
So, basically I am finding out all the possible combinations using itertools.combinations
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0 Person1 18 50
1 Person2 19 100
2 Person3 20 0
3 Person4 21 -25
4 Person5 22 -125
5 Person6 23 80
6 Person7 24 -70
7 Person1 18 50 Person2 19 100
8 Person1 18 50 Person3 20 0
9 Person1 18 50 Person4 21 -25
10 Person1 18 50 Person5 22 -125
11 Person1 18 50 Person6 23 80
12 Person1 18 50 Person7 24 -70
13 Person2 19 100 Person3 20 0
14 Person2 19 100 Person4 21 -25
15 Person2 19 100 Person5 22 -125
16 Person2 19 100 Person6 23 80
17 Person2 19 100 Person7 24 -70
18 Person3 20 0 Person4 21 -25
19 Person3 20 0 Person5 22 -125
20 Person3 20 0 Person6 23 80
21 Person3 20 0 Person7 24 -70
22 Person4 21 -25 Person5 22 -125
23 Person4 21 -25 Person6 23 80
24 Person4 21 -25 Person7 24 -70
25 Person5 22 -125 Person6 23 80
26 Person5 22 -125 Person7 24 -70
27 Person6 23 80 Person7 24 -70
28 Person1 18 50 Person2 19 100 Person3 20 0
29 Person1 18 50 Person2 19 100 Person4 21 -25
30 Person1 18 50 Person2 19 100 Person5 22 -125
31 Person1 18 50 Person2 19 100 Person6 23 80
32 Person1 18 50 Person2 19 100 Person7 24 -70
33 Person1 18 50 Person3 20 0 Person4 21 -25
34 Person1 18 50 Person3 20 0 Person5 22 -125
35 Person1 18 50 Person3 20 0 Person6 23 80
36 Person1 18 50 Person3 20 0 Person7 24 -70
37 Person1 18 50 Person4 21 -25 Person5 22 -125
38 Person1 18 50 Person4 21 -25 Person6 23 80
39 Person1 18 50 Person4 21 -25 Person7 24 -70
40 Person1 18 50 Person5 22 -125 Person6 23 80
41 Person1 18 50 Person5 22 -125 Person7 24 -70
42 Person1 18 50 Person6 23 80 Person7 24 -70
43 Person2 19 100 Person3 20 0 Person4 21 -25
44 Person2 19 100 Person3 20 0 Person5 22 -125
45 Person2 19 100 Person3 20 0 Person6 23 80
46 Person2 19 100 Person3 20 0 Person7 24 -70
47 Person2 19 100 Person4 21 -25 Person5 22 -125
48 Person2 19 100 Person4 21 -25 Person6 23 80
49 Person2 19 100 Person4 21 -25 Person7 24 -70
50 Person2 19 100 Person5 22 -125 Person6 23 80
51 Person2 19 100 Person5 22 -125 Person7 24 -70
52 Person2 19 100 Person6 23 80 Person7 24 -70
53 Person3 20 0 Person4 21 -25 Person5 22 -125
54 Person3 20 0 Person4 21 -25 Person6 23 80
55 Person3 20 0 Person4 21 -25 Person7 24 -70
56 Person3 20 0 Person5 22 -125 Person6 23 80
57 Person3 20 0 Person5 22 -125 Person7 24 -70
58 Person3 20 0 Person6 23 80 Person7 24 -70
59 Person4 21 -25 Person5 22 -125 Person6 23 80
60 Person4 21 -25 Person5 22 -125 Person7 24 -70
61 Person4 21 -25 Person6 23 80 Person7 24 -70
62 Person5 22 -125 Person6 23 80 Person7 24 -70
63 Person1 18 50 Person2 19 100 Person3 20 0 Person4 21 -25
64 Person1 18 50 Person2 19 100 Person3 20 0 Person5 22 -125
65 Person1 18 50 Person2 19 100 Person3 20 0 Person6 23 80
66 Person1 18 50 Person2 19 100 Person3 20 0 Person7 24 -70
67 Person1 18 50 Person2 19 100 Person4 21 -25 Person5 22 -125
68 Person1 18 50 Person2 19 100 Person4 21 -25 Person6 23 80
69 Person1 18 50 Person2 19 100 Person4 21 -25 Person7 24 -70
70 Person1 18 50 Person2 19 100 Person5 22 -125 Person6 23 80
71 Person1 18 50 Person2 19 100 Person5 22 -125 Person7 24 -70
72 Person1 18 50 Person2 19 100 Person6 23 80 Person7 24 -70
73 Person1 18 50 Person3 20 0 Person4 21 -25 Person5 22 -125
74 Person1 18 50 Person3 20 0 Person4 21 -25 Person6 23 80
75 Person1 18 50 Person3 20 0 Person4 21 -25 Person7 24 -70
76 Person1 18 50 Person3 20 0 Person5 22 -125 Person6 23 80
77 Person1 18 50 Person3 20 0 Person5 22 -125 Person7 24 -70
78 Person1 18 50 Person3 20 0 Person6 23 80 Person7 24 -70
79 Person1 18 50 Person4 21 -25 Person5 22 -125 Person6 23 80
80 Person1 18 50 Person4 21 -25 Person5 22 -125 Person7 24 -70
81 Person1 18 50 Person4 21 -25 Person6 23 80 Person7 24 -70
82 Person1 18 50 Person5 22 -125 Person6 23 80 Person7 24 -70
83 Person2 19 100 Person3 20 0 Person4 21 -25 Person5 22 -125
84 Person2 19 100 Person3 20 0 Person4 21 -25 Person6 23 80
85 Person2 19 100 Person3 20 0 Person4 21 -25 Person7 24 -70
86 Person2 19 100 Person3 20 0 Person5 22 -125 Person6 23 80
87 Person2 19 100 Person3 20 0 Person5 22 -125 Person7 24 -70
88 Person2 19 100 Person3 20 0 Person6 23 80 Person7 24 -70
89 Person2 19 100 Person4 21 -25 Person5 22 -125 Person6 23 80
90 Person2 19 100 Person4 21 -25 Person5 22 -125 Person7 24 -70
91 Person2 19 100 Person4 21 -25 Person6 23 80 Person7 24 -70
92 Person2 19 100 Person5 22 -125 Person6 23 80 Person7 24 -70
93 Person3 20 0 Person4 21 -25 Person5 22 -125 Person6 23 80
94 Person3 20 0 Person4 21 -25 Person5 22 -125 Person7 24 -70
95 Person3 20 0 Person4 21 -25 Person6 23 80 Person7 24 -70
96 Person3 20 0 Person5 22 -125 Person6 23 80 Person7 24 -70
97 Person4 21 -25 Person5 22 -125 Person6 23 80 Person7 24 -70
98 Person1 18 50 Person2 19 100 Person3 20 0 Person4 21 -25 Person5 22 -125
99 Person1 18 50 Person2 19 100 Person3 20 0 Person4 21 -25 Person6 23 80
100 Person1 18 50 Person2 19 100 Person3 20 0 Person4 21 -25 Person7 24 -70
101 Person1 18 50 Person2 19 100 Person3 20 0 Person5 22 -125 Person6 23 80
102 Person1 18 50 Person2 19 100 Person3 20 0 Person5 22 -125 Person7 24 -70
103 Person1 18 50 Person2 19 100 Person3 20 0 Person6 23 80 Person7 24 -70
104 Person1 18 50 Person2 19 100 Person4 21 -25 Person5 22 -125 Person6 23 80
105 Person1 18 50 Person2 19 100 Person4 21 -25 Person5 22 -125 Person7 24 -70
106 Person1 18 50 Person2 19 100 Person4 21 -25 Person6 23 80 Person7 24 -70
107 Person1 18 50 Person2 19 100 Person5 22 -125 Person6 23 80 Person7 24 -70
108 Person1 18 50 Person3 20 0 Person4 21 -25 Person5 22 -125 Person6 23 80
109 Person1 18 50 Person3 20 0 Person4 21 -25 Person5 22 -125 Person7 24 -70
110 Person1 18 50 Person3 20 0 Person4 21 -25 Person6 23 80 Person7 24 -70
111 Person1 18 50 Person3 20 0 Person5 22 -125 Person6 23 80 Person7 24 -70
112 Person1 18 50 Person4 21 -25 Person5 22 -125 Person6 23 80 Person7 24 -70
113 Person2 19 100 Person3 20 0 Person4 21 -25 Person5 22 -125 Person6 23 80
114 Person2 19 100 Person3 20 0 Person4 21 -25 Person5 22 -125 Person7 24 -70
115 Person2 19 100 Person3 20 0 Person4 21 -25 Person6 23 80 Person7 24 -70
116 Person2 19 100 Person3 20 0 Person5 22 -125 Person6 23 80 Person7 24 -70
117 Person2 19 100 Person4 21 -25 Person5 22 -125 Person6 23 80 Person7 24 -70
118 Person3 20 0 Person4 21 -25 Person5 22 -125 Person6 23 80 Person7 24 -70
119 Person1 18 50 Person2 19 100 Person3 20 0 Person4 21 -25 Person5 22 -125 Person6 23 80
120 Person1 18 50 Person2 19 100 Person3 20 0 Person4 21 -25 Person5 22 -125 Person7 24 -70
121 Person1 18 50 Person2 19 100 Person3 20 0 Person4 21 -25 Person6 23 80 Person7 24 -70
122 Person1 18 50 Person2 19 100 Person3 20 0 Person5 22 -125 Person6 23 80 Person7 24 -70
123 Person1 18 50 Person2 19 100 Person4 21 -25 Person5 22 -125 Person6 23 80 Person7 24 -70
124 Person1 18 50 Person3 20 0 Person4 21 -25 Person5 22 -125 Person6 23 80 Person7 24 -70
125 Person2 19 100 Person3 20 0 Person4 21 -25 Person5 22 -125 Person6 23 80 Person7 24 -70
126 Person1 18 50 Person2 19 100 Person3 20 0 Person4 21 -25 Person5 22 -125 Person6 23 80 Person7 24 -70
After getting that, all the grades are in every 3rd column, so I'm adding the sums of every 3rd row to that row with
only_grades['Total'] = only_grades.sum(axis=1)
out['Total'] = only_grades['Total']
out.to_csv('../tt_2.csv')
which adds a column "Total" with the following outputs on each line,
Total
50
100
0
-25
-125
80
-70
150
50
25
-75
130
-20
100
75
-25
180
30
-25
-125
80
-70
-150
55
-95
-45
-195
10
150
125
25
230
80
25
-75
130
-20
-100
105
-45
5
-145
60
75
-25
180
30
-50
155
5
55
-95
110
-150
55
-95
-45
-195
10
-70
-220
-15
-115
125
25
230
80
0
205
55
105
-45
160
-100
105
-45
5
-145
60
-20
-170
35
-65
-50
155
5
55
-95
110
30
-120
85
-15
-70
-220
-15
-115
-140
0
205
55
105
-45
160
80
-70
135
35
-20
-170
35
-65
-90
30
-120
85
-15
-40
-140
80
-70
135
35
10
-90
-40
10
I was then wondering, I'd make the groups out of these. What do you think of this approach? And how might you make groups using the "Totals" column so that every single row in the original dataset is considered in making the groups only once, not more.
Your answer seems to work but its complexity is exponential since you do an exhaustive search.
This problem seems to be like the subset sum problem, where you want to find if there is a subset that sums to zero in a larger set (It is a NP problem, so you won't find any perfect solution in a polynomial time. Here is an explanation of NP problems if you do not know that).
Since there are already existing polynomial algorithms to find an approximation of the solution to the subset sum problem, I think adapting your problem to that could lead you to an acceptable solution. Here is the backtracking algorithm I would use.
Now let's adapt your problem :
1) The backtracking problem only applies on positive numbers, so the first step would be to find the minimum grade in your set. Keep it, you will need it several times. A simple for loop will do it.
2) If this minimum grade is < 0, you can add it to your whole set to have positive grades. Again a for loop will do it. If the minimum grade is >= 0, you do not have any subset making 0 other than lone students getting zero as a grade (and you have your solution).
3) Now your problem is the exact same problem that on the video, with n the size of your complete set, and m the number you added to your whole set on step 2) (which is the zero before adding). Do the algorithm, find your subset, and remove it from the set.
4) Note that your algorithm may come to the end of the graph without finding the required subset. So each time you backtrack, keep the subset and its sum, compare it to the best found so far, replace it if it is better.
5) Apply step 3) to your remaining set until you get a sum of your remaining set < m or until it is empty. If not empty, it becomes your last set.
This may not be the best solution, feel free to propose any improvement. This algorithm will prioritize low distance subsets + a last bad subset, so if your need is to have all subsets approximately good this may not be a good solution.
The complexity of the algorithm depends a lot if you find your subsets fast each time in the graph, so ordering your students by grades can improve the time complexity. And it would remove the step 1) since the minimum is then the first element.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With