I have a text file with many DNA sequences, each one on a separate line with 20 base pairs. I would like to read the file into a dataframe with each base as its own column without using a for loop or something else that requires an iteration through the entire file, since the file is very large.
I've tried using "" as the delimiter, but it just causes the entire line to be processed into one column. I've also tried using "." and "\w" which both did not do what I wanted it to.
For example, for a file that has:
ACGT
CGTA
GTAC
TACG
The dataframe should look like this:
1 2 3 4
1 A C G T
2 C G T A
3 G T A C
4 T A C G
You can read it as one column and split later
# csv
# ATGC
# CTAG
df = pd.read_csv(header=None)
# df
# 0
# 0 ATGC
# 1 CTAG
df[0].str.split('', expand=True)
Output:
0 1 2 3 4 5
0 A T G X
1 G T A X
which means you have two extra columns, one front and one back. But you can drop them easily, for example:
df[0].str.split('', expand=True).iloc[:,1:-1]
gives:
1 2 3 4
0 A T G C
1 C T A G
You can use pandas.read_fwf instead of pandas.read_csv to accomplish this.
If you have the file named "dna.txt" as below:
ACGT
CGTA
GTAC
TACG
You can execute the following:
df = pd.read_fwf("dna.txt", header=None, widths=[1] * 4)
print(df)
To output:
0 1 2 3
0 A C G T
1 C G T A
2 G T A C
3 T A C G
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With