I have a dataframe with different people. Each row contains attributes which characterize the individual person. Basically I need something like a filter or matching algorithm which weights specific attributes. The dataframe looks like this:
df= pd.DataFrame({
'sex' : [m,f,m,f,m,f],
'food' : [0,0,1,3,4,3],
 'age': [young, young, young, old, young, young]
'kitchen': [0,1,2,0,1,2],
})
The dataframe df looks like this:
    sex food  age     kitchen
0   m    0    young    0
1   f    0    young    1
2   m    1    young    2
3   f    3    old      0
4   m    4    young    1
5   f    3    young    2
I am looking for an algorithm which groups all people of the dataframe to pairs. My plan is to find pairs of two people based on the following attributes:
One person must have a kitchen (kitchen=1)
It is important that at least one person has a kitchen.  
kitchen=0 --> person has no kitchen
kitchen=1 --> person has a kitchen
kitchen=2 --> person has a kitchen but only in emergency (when there is no other option)
Same food preferences
food=0 --> meat eater
food=1 --> does not matter
food=2 --> vegan
food=3 --> vegetarian
A meat eater (food=0) can be matched with a person who doesn't care about food preferences (food=1) but can't be matched with a vegan or vegetarian. A vegan (food=2) fits best with a vegetarian (food=3) and, if necessary, can go with food=1. And so on...
Similar age
There are nine age groups: 10-18; 18-22; 22-26; 26-29, 29-34; 34-40; 40-45; 45-55 and 55-75. People in the same age group match perfectly. The young age groups with the older age groups do not match very well. Similar age groups match a little bit better. There is no clearly defined condition. The meaning of "old" and "young" is relative.
The sex doesn't matter. There are many pair combinations possible. Because my actual dataframe is very long (3000 rows), I need to find an automated solution. A solution that gives me the best pairs in a dataframe or dictionary or something else.
I really do not know how to approach this problem. I was looking for similar problems on Stack Overflow, but I did not find anything suitable. Mostly it was just too theoretically. Also I could not find anything that really fits my problem.
My expected output here would be, for example a dictionary (not sure how) or a dataframe which is sorted in a way that every two rows can be seen as one pair.
Background: The goal is to make pairs for some free time activities. Therefore I think, people in same or similar age groups share same interest, therefore I want to consider this fact in my code.
I have done an addition by putting 'name' as a key to identify the person.
The approach is that I have scored the values which is further used to filter the final pairs according to the given conditions.
For kitchen scores we used:
We check that if [kitchen score of record 1] + [kitchen score of record 2] is greater than Zero. As the following cases will be there:
For food scores we used:
We check if *[food score of record 1] * [food score of record 2]* is greater than or equal to Zero. As the following cases will be there:
For scoring age groups, we assigned some values to the groups as:
For calculating Age Score the following formula has been used:
age_score = round((1 - (abs(Age Group Value Person 1 - Age Group Value of Person 2) / 10)), 2)
In the above formula we calculation has been done as follows:
Cases will be as:
 round(1 - (abs(2 - 2) / 10), 2) = 1.0 
 round(1 - (abs(8 - 8) / 10), 2) = 1.0 
 round(1 - (abs(2 - 8) / 10), 2) = 0.4 
 round(1 - (abs(1 - 9) / 10), 2) = 0.2 
For calculating final Score we used:
Final Score = Food Score + Kitchen Score + Age Score
Then we have sorted the data on Final Score to obtain best Pairs.
import pandas as pd
import numpy as np
# Creating the DataFrame, here I have added the attribute 'name' for identifying the record.
df = pd.DataFrame({
    'name' : ['jacob', 'mary', 'rick', 'emily', 'sabastein', 'anna', 
              'christina', 'allen', 'jolly', 'rock', 'smith', 'waterman', 
              'mimi', 'katie', 'john', 'rose', 'leonardo', 'cinthy', 'jim', 
              'paul'],
    'sex' : ['m', 'f', 'm', 'f', 'm', 'f', 'f', 'm', 'f', 'm', 'm', 'm', 'f', 
             'f', 'm', 'f', 'm', 'f', 'm', 'm'],
    'food' : [0, 0, 1, 3, 2, 3, 1, 0, 0, 3, 3, 2, 1, 2, 1, 0, 1, 0, 3, 1],
    'age' : ['10-18', '22-26', '29-34', '40-45', '18-22', '34-40', '55-75',
             '45-55', '26-29', '26-29', '18-22', '55-75', '22-26', '45-55', 
             '10-18', '22-26', '40-45', '45-55', '10-18', '29-34'],
    'kitchen' : [0, 1, 2, 0, 1, 2, 2, 1, 0, 0, 1, 0, 1, 1, 1, 0, 2, 0, 2, 1],
})
# Adding a normalized field 'k_scr' for kitchen
df['k_scr'] = np.where((df['kitchen'] == 2), 0.5, df['kitchen'])
# Adding a normalized field 'f_scr' for food
df['f_scr'] = np.where((df['food'] == 1), 0, df['food'])
df['f_scr'] = np.where((df['food'] == 0), -1, df['f_scr'])
df['f_scr'] = np.where((df['food'] == 2), 1, df['f_scr'])
df['f_scr'] = np.where((df['food'] == 3), 1, df['f_scr'])
# Adding a normalized field 'a_scr' for age
df['a_scr'] = np.where((df['age'] == '10-18'), 1, df['age'])
df['a_scr'] = np.where((df['age'] == '18-22'), 2, df['a_scr'])
df['a_scr'] = np.where((df['age'] == '22-26'), 3, df['a_scr'])
df['a_scr'] = np.where((df['age'] == '26-29'), 4, df['a_scr'])
df['a_scr'] = np.where((df['age'] == '29-34'), 5, df['a_scr'])
df['a_scr'] = np.where((df['age'] == '34-40'), 6, df['a_scr'])
df['a_scr'] = np.where((df['age'] == '40-45'), 7, df['a_scr'])
df['a_scr'] = np.where((df['age'] == '45-55'), 8, df['a_scr'])
df['a_scr'] = np.where((df['age'] == '55-75'), 9, df['a_scr'])
# Printing DataFrame after adding normalized score values
print(df)
commonarr = [] # Empty array for our output
dfarr = np.array(df) # Converting DataFrame to Numpy Array
for i in range(len(dfarr) - 1): # Iterating the Array row
    for j in range(i + 1, len(dfarr)): # Iterating the Array row + 1
        # Check for Food Condition to include relevant records
        if dfarr[i][6] * dfarr[j][6] >= 0: 
            # Check for Kitchen Condition to include relevant records
            if dfarr[i][5] + dfarr[j][5] > 0:
                row = []
                # Appending the names
                row.append(dfarr[i][0])
                row.append(dfarr[j][0])
                # Appending the final score
                row.append((dfarr[i][6] * dfarr[j][6]) +
                           (dfarr[i][5] + dfarr[j][5]) +
                           (round((1 - (abs(dfarr[i][7] -
                                            dfarr[j][7]) / 10)), 2)))
                # Appending the row to the Final Array
                commonarr.append(row)
# Converting Array to DataFrame
ndf = pd.DataFrame(commonarr)
# Sorting the DataFrame on Final Score
ndf = ndf.sort_values(by=[2], ascending=False)
print(ndf)
         name sex  food    age  kitchen  k_scr  f_scr a_scr
0       jacob   m     0  10-18        0    0.0     -1     1
1        mary   f     0  22-26        1    1.0     -1     3
2        rick   m     1  29-34        2    0.5      0     5
3       emily   f     3  40-45        0    0.0      1     7
4   sabastein   m     2  18-22        1    1.0      1     2
5        anna   f     3  34-40        2    0.5      1     6
6   christina   f     1  55-75        2    0.5      0     9
7       allen   m     0  45-55        1    1.0     -1     8
8       jolly   f     0  26-29        0    0.0     -1     4
9        rock   m     3  26-29        0    0.0      1     4
10      smith   m     3  18-22        1    1.0      1     2
11   waterman   m     2  55-75        0    0.0      1     9
12       mimi   f     1  22-26        1    1.0      0     3
13      katie   f     2  45-55        1    1.0      1     8
14       john   m     1  10-18        1    1.0      0     1
15       rose   f     0  22-26        0    0.0     -1     3
16   leonardo   m     1  40-45        2    0.5      0     7
17     cinthy   f     0  45-55        0    0.0     -1     8
18        jim   m     3  10-18        2    0.5      1     1
19       paul   m     1  29-34        1    1.0      0     5
             0          1    2
48   sabastein      smith  4.0
10        mary      allen  3.5
51   sabastein      katie  3.4
102      smith        jim  3.4
54   sabastein        jim  3.4
99       smith      katie  3.4
61        anna      katie  3.3
45   sabastein       anna  3.1
58        anna      smith  3.1
14        mary       rose  3.0
12        mary       mimi  3.0
84       allen     cinthy  3.0
98       smith       mimi  2.9
105   waterman      katie  2.9
11        mary      jolly  2.9
50   sabastein       mimi  2.9
40       emily      katie  2.9
52   sabastein       john  2.9
100      smith       john  2.9
90        rock      smith  2.8
47   sabastein       rock  2.8
0        jacob       mary  2.8
17        mary       paul  2.8
13        mary       john  2.8
119      katie        jim  2.8
116       mimi       paul  2.8
111       mimi       john  2.8
103      smith       paul  2.7
85       allen       paul  2.7
120      katie       paul  2.7
..         ...        ...  ...
This solution has further scope of optimization.
This seems like a very interesting problem to me. There are several ways to solve this problem. I will state you one, but will link you to another solution which I feel is somehow related.
A possible approach could be to create a additional column in your dataframe, including a 'code' which refers to the given attributes. For example:
    sex  food  age      kitchen   code
0   m    0     young    0         0y0
1   f    0     young    1         0y1
2   m    1     young    2         1y2
3   f    3     old      0         3o0
4   m    4     young    1         4y1
5   f    3     young    2         3y2
This 'code' is made up of shorts of your attributes. Since the sex doesn't matter, the first sign in the code stands for the 'food', the second one for the 'age' and the third for the 'kitchen'.
4y1 = food 4, age young, kitchen 1.
Based on these codes you can come up with a pattern. I recommend that you're working with Regular Expressions for this. You can then write something like this:
import re
haskitchen = r'(\S\S1)
hasnokitchen = r'(\S\S0)
df_dict = df.to_dict
match_kitchen = re.findall(haskitchen, df_dict)
match_nokitchen = re.dinfall(hasnokitchen, df_dict)
kitchendict["Has kitchen"] = [match_kitchen]
kitchendict["Has no kitchen"] = [match_notkitchen]
Based on this, you can loop over entries and put them together how you want. There may be a much easier solution and I didn't proof the code, but this just came up in my mind. One thing is for sure: Use regular expressions for matching.
Well, let's test for the kitchen.
for I in(kitchen):
    if (I != 0):
        print("Kitchen Found)
    else:
        print("No kitchen")
Okay now that we have found a kitchen in the people who have a kitchen's houses, let's find the people without the kitchen someone with similar food preferences. Let's create a variable that tells us how many people have a kitchen(x). Let's also make the person variable for counting people.
people = 0
x = 0
for I in(kitchen):
    x = x + 1
    for A in (food):
            if (I != 0):
                x = x + 1
                print("Kitchen Found)
            else:
                print("No kitchen")
                for J in(food):
                    if(i == J):
                        print("food match found")
                    elif(A == 0):
                        if(J == 1):
                            print("food match found for person" + x)
                    elif(A == 2 or A == 3):
                        if(J == 2 or J == 3 or J == 1):
                            print("food match found for person" + x)
I am currently working on the age part adjusting somethings
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With