I've been working on Python for around 2 months now so I have a OK understanding of it.
My goal is to create a matrix using CSV data, then populating that matrix from the data in the 3rd column of that CSV file.
I came up with this code thus far:
import csv
import csv
def readcsv(csvfile_name):
with open(csvfile_name) as csvfile:
file=csv.reader(csvfile, delimiter=",")
#remove rubbish data in first few rows
skiprows = int(input('Number of rows to skip? '))
for i in range(skiprows):
_ = next(file)
#change strings into integers/floats
for z in file:
z[:2]=map(int, z[:2])
z[2:]=map(float, z[2:])
print(z[:2])
return
After removing the rubbish data with the above code, the data in the CSV file looks like this:
Input:
1 1 51 9 3
1 2 39 4 4
1 3 40 3 9
1 4 60 2 .
1 5 80 2 .
2 1 40 6 .
2 2 28 4 .
2 3 40 2 .
2 4 39 3 .
3 1 10 . .
3 2 20 . .
3 3 30 . .
3 4 40 . .
. . . . .
The output should look like this:
1 2 3 4 . .
1 51 39 40 60
2 40 28 40 39
3 10 20 30 40
.
.
There are about a few thousand rows and columns in this CSV file, however I'm only interested is the first 3 columns of the CSV file. So the first and second columns are basically like co-ordinates for the matrix, and then populating the matrix with data in the 3rd column.
After lots of trial and error, I realised that numpy was the way to go with matrices. This is what I tried thus far with example data:
left_column = [1, 2, 1, 2, 1, 2, 1, 2]
middle_column = [1, 1, 3, 3, 2, 2, 4, 4]
right_column = [1., 5., 3., 7., 2., 6., 4., 8.]
import numpy as np
m = np.zeros((max(left_column), max(middle_column)), dtype=np.float)
for x, y, z in zip(left_column, middle_column, right_column):
x -= 1 # Because the indicies are 1-based
y -= 1 # Need to be 0-based
m[x, y] = z
print(m)
#: array([[ 1., 2., 3., 4.],
#: [ 5., 6., 7., 8.]])
However, it is unrealistic for me to specify all of my data in my script to generate the matrix. I tried using the generators to pull the data out of my CSV file but it didn't work well for me.
I learnt as much numpy as I could, however it appears like it requires my data to already be in matrix form, which it isn't.
You should seriously consider using pandas. It is really ideal for this sort of work. I can't give you an actual solution because I don't have your data, but I would try something like the following:
import pandas as pd
df = pd.read_csv('test.csv', usecols=[0,1,2], names=['A', 'B', 'C'])
pd.pivot_table(df, index='A', columns='B', values='C')
The second line imports the data into a pandas DataFrame object (change the names into something more useful for your application). The pivot table creates the matrix you are looking for, and gracefully handles any missing data.
You can use scipy.sparse.coo_matrix to load this data very conveniently.
Working with your input:
Input:
1 1 51 9 3
1 2 39 4 4
1 3 40 3 9
1 4 60 2 .
1 5 80 2 .
2 1 40 6 .
2 2 28 4 .
2 3 40 2 .
2 4 39 3 .
3 1 10 . .
3 2 20 . .
3 3 30 . .
3 4 40 . .
. . . . .
You could do:
l, c, v = np.loadtxt('test.txt', skiprows=1).T
m = coo_matrix((v, (l-1, c-1)), shape=(l.max(), c.max()))
Then you can convert the coo_matrix to a np.ndarray:
In [9]: m.toarray()
Out[9]:
array([[ 51., 39., 40., 60., 80.],
[ 40., 28., 40., 39., 0.],
[ 10., 20., 30., 40., 0.]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With