I'm trying to add column headers to the following set of data. As per specifications of the project, I cannot simply modify the file to add those headers manually.
Sample of the data that I'm working with:
38.049133 0.224026 0.05398 -19.11 -20.03
38.352526 0.212491 0.05378 -18.35 -19.19
38.363598 0.210654 0.05401 -20.11 -20.89
54.936819 0.216794 0.20114 -20.94 -21.88
54.534881 0.578615 0.12887 -19.75 -20.66
54.743075 0.508774 0.18331 -20.54 -21.53
54.867240 0.562636 0.13956 -19.95 -20.85
54.856908 0.544031 0.13938 -20.14 -21.03
54.977748 0.501912 0.13923 -20.27 -21.01
54.992762 0.460376 0.12723 -20.24 -20.83
I've created an array of 5 strings to act as the headers of each of the columns within this DataFrame. Using the designated header does select only that column (i.e. print(df['z']) does only print that one column (supposedly) but all of the data in the DataFrame, that displays just fine (i.e. shows the above sample lines exactly and detects the columns properly) when I do not specify columns, suddenly becomes "NaN" when I specify column titles from the array of strings.
Sample of my code:
... imports and whatnot not shown
dataColumns = ['RA', 'DEC', 'z', 'M(g)', 'M(r)']
dataFile = pd.read_csv(data = 'file_name', delim_whitespace = True)
df = pd.DataFrame(data = dataFile, columns = dataColumns)
print(df)
Sample output of the above code (it is supposed to display exactly the sample data above but with added column headers):
RA DEC z M(g) M(r)
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
Why is it that, without specifying the 'columns' parameter for DataFrame, the data will properly print wheras after specifying the parameter, everything displays as NaN?
Any help would be appreciated!
-- paanvaannd
To fix your problem, use this line instead:
df = pd.read_csv('file_name', header=None, names=dataColumns)
pd.read_csv returns a DataFrame, so the above line should handle the entirety of the import (i.e. calling pd.DataFrame on the result of pd.read_csv is superfluous). header=None indicates that pandas shouldn't interpret the first line of the CSV as headers, and then names=... allows you to specify the column names you'd like to use. delim_whitespace shouldn't be used, since commas, not whitespace, appears to be the delimiter in your data ('comma' is the 'c' in 'csv', after all). In fact, without testing your data, I'd say the use of delim_whitespace is the most likely culprit behind the NaN values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With