Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to randomly sample from 4 csv files so that no more than 2/3 rows appear in order from each csv file, in Python

Tags:

python

random

csv

Hi I'm very new to python and trying to create a program that takes a random sample from a CSV file and makes a new file with some conditions. What I have done so far is probably highly over-complicated and not efficient (though it doesn't need to be).

I have 4 CSV files that contain 264 rows in total, where each full row is unique, though they all share common values in some columns. csv1 = 72 rows, csv2 = 72 rows, csv3 = 60 rows, csv4 = 60 rows. I need to take a random sample of 160 rows which will make 4 blocks of 40, where in each block 10 must come from each csv file. The tricky part is that no more than 2 or 3 rows from the same CSV file can appear in order in the final file.

So far I have managed to take a random sample of 40 from each CSV (just using random.sample) and output them to 4 new CSV files. Then I split each csv into 4 new files each containing 10 rows so that I have each in a separate folder(1-4). So I now have 4 folders each containing 4 csv files. Now I need to combine these so that rows that came from the original CSV file don't repeat more than 2 or 3 times and the row order will be as random as possible. This is where I'm completely lost, I'm presuming that I should combine the 4 files in each folder (which I can do) and then re-sample or shuffle in a loop until the conditions are met, or something to that effect but I'm not sure how to proceed or am I going about this in the completely wrong way. Any help anyone can give me would be greatly appreciated and I can provide any further details that are necessary.

var_start = 1
total_condition_amount_start = 1
    while (var_start < 5):
    with open("condition"+`var_start`+".csv", "rb") as population1:
            conditions1 = [line for line in population1]
            random_selection1 = random.sample(conditions1, 40)
            with open("./temp/40cond"+`var_start`+".csv", "wb") as temp_output:
                temp_output.write("".join(random_selection1))
            var_start = var_start + 1



while (total_condition_amount_start < total_condition_amount):

    folder_no = 1
    splitter.split(open("./temp/40cond"+`total_condition_amount_start`+".csv", 'rb'));

    shutil.move("./temp/output_1.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
    folder_no = folder_no + 1
    shutil.move("./temp/output_2.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
    folder_no = folder_no + 1
    shutil.move("./temp/output_3.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
    folder_no = folder_no + 1
    shutil.move("./temp/output_4.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")

    total_condition_amount_start = total_condition_amount_start + 1
like image 992
user3207355 Avatar asked Nov 27 '25 14:11

user3207355


1 Answers

You should probably try using the CSV built in lib: http://docs.python.org/3.3/library/csv.html

That way you can handle each file as a list of dictionaries, which will make your task a lot easier.

from random import randint, sample, choice


def create_random_list(length):
    return [randint(0, 100) for i in range(length)]

# This should be your list of four initial csv files
# with the 264 rows in total, read with the csv lib
lists = [create_random_list(264) for i in range(4)]

# Take a randomized sample from the lists
lists = map(lambda x: sample(x, 40), lists)

# Add some variables to the
lists = map(lambda x: {'data': x, 'full_count': 0}, lists)


final = [[] for i in range(4)]
for l in final:
    prev = None
    count = 0
    while len(l) < 40:
        current = choice(lists)

        if current['full_count'] == 10 or (current is prev and count == 3):
            continue
        # Take an item from the chosen list if it hasn't been used 3 times in a
        # row or is already used 10 times. Append that item to the final list

        total_left = 40 - len(l)
        maxx = 0
        for i in lists:
            if i is not current and 10 - i['full_count'] > maxx:
                maxx = 10 - i['full_count']

        current_left = 10 - current['full_count']
        max_left = maxx + maxx/3.0

        if maxx > 3 and total_left <= max_left:
            # Make sure that in te future it can still be split in to sets of
            # max 3
            continue

        l.append(current['data'].pop())
        count += 1
        current['full_count'] += 1

        if current is not prev:
            count = 0
            prev = current

    for li in lists:
        li['full_count'] = 0
like image 61
JelteF Avatar answered Nov 30 '25 02:11

JelteF



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!