I have the following strategy for creating dataframes with genomics data:
from hypothesis.extra.pandas import columns, data_frames, column
import hypothesis.strategies as st
def mysort(tp):
key = [-1, tp[1], tp[2], int(1e10)]
return [x for _, x in sorted(zip(key, tp))]
positions = st.integers(min_value=0, max_value=int(1e7))
strands = st.sampled_from("+ -".split())
chromosomes = st.sampled_from(elements=["chr{}".format(str(e)) for e in list(range(1, 23)) + "X Y M".split()])
genomics_data = data_frames(columns=columns(["Chromosome", "Start", "End", "Strand"], dtype=int),
rows=st.tuples(chromosomes, positions, positions, strands).map(mysort))
I am not really interested in empty dataframes as they are invalid, and I would also like to produce some really long dfs. How do I change the sizes of the dataframes created for test cases? I.e. min size 1, avg size large?
You can give the data_frames constructor an index argument which has min_size and max_size options:
from hypothesis.extra.pandas import data_frames, columns, range_indexes
import hypothesis.strategies as st
def mysort(tp):
key = [-1, tp[1], tp[2], int(1e10)]
return [x for _, x in sorted(zip(key, tp))]
chromosomes = st.sampled_from(["chr{}".format(str(e)) for e in list(range(1, 23)) + "X Y M".split()])
positions = st.integers(min_value=0, max_value=int(1e7))
strands = st.sampled_from("+ -".split())
dfs = data_frames(index=range_indexes(min_size=5), columns=columns("Chromosome Start End Strand".split(), dtype=int), rows=st.tuples(chromosomes, positions, positions, strands).map(mysort))
Produces dfs like:
Chromosome Start End Strand
0 chr11 1411202 8025685 +
1 chr18 902289 5026205 -
2 chr12 5343877 9282475 +
3 chr16 2279196 8294893 -
4 chr14 1365623 6192931 -
5 chr12 4602782 9424442 +
6 chr10 136262 1739408 +
7 chr15 521644 4861939 +
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With