I have the following code:
import json
import pandas as pd
import numpy as np
import random
pd.set_option('expand_frame_repr', False) # To view all the variables in the console
# read data
records = []
with open('./data/data_file.txt', 'r') as file:
for line in file:
record = json.loads(line)
records.append(record)
# construct list of ids
ids = set()
for record in records:
for w in record['A']:
ids.add(w['NAME'])
random.seed(1234); sampled_ids = random.sample(ids,50)
When I run this code one time in Pycharm IDE and then immediately after in a Jupyter Notebook - I get different ids sampled in each one. What's going on?
P.S
I used the semicolon on the last line because I found out that if I try to set the seed on one line and then sample on the next line - even in the same IDE I get different results each run. This is truly mysterious to me. I use Python 3.7
The cause of such a behaviour is lying in set. Set is constructed from objects based on their hash values (the elements of a set must be hashable, i.e. must have __hash__ method), and hash values differ when starting another console. (Not always, but that's another theme).
For example, there are results from two consols in the same IDE:
1/A:
arr1 = set('skevboa;gj[pvemoeprnjpdbr ]p')
random.seed(1234)
random.sample(arr1, 3)
Out[47]: ['p', 'k', ']']
random.seed(1234)
random.sample(arr1, 3)
Out[48]: ['p', 'k', ']']
hash('s')
Out[49]: 1861403979552045688
2/A:
arr1 = set('skevboa;gj[pvemoeprnjpdbr ]p')
random.seed(1234)
random.sample(arr1, 3)
Out[29]: [';', 'a', 'b']
random.seed(1234)
random.sample(arr1, 3)
Out[30]: [';', 'a', 'b']
hash('s')
Out[31]: -2409441490032867064
Knowing the source of problem you can choose a method to solve the issue. For example, using sorted:
1/A:
random.seed(1234)
random.sample(sorted(arr1), 3)
Out[50]: ['p', ']', ' ']
2/A:
random.seed(1234)
random.sample(sorted(arr1), 3)
Out[32]: ['p', ']', ' ']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With