I have some data that is badly formatted. Specifically I have numeric columns that have some elements with spurious text in them (e.g. "8 meters" instead of "8"). I want to use readtable to read in the data, make the necessary fixes to the data and then convert the column to a Float64 so that it behaves correctly (comparison, etc).
There seems to have been a macro called @transform that would do the conversion but it has been deleted. How do I do this now?
My best solution at the moment is to clean up the data, write it out as a csv and then re-read it using readtable and specify eltypes. But that is horrible.
What else can I do?
There is no need to run things via a csv file. You can change or update the DataFrame directly.
using DataFrames
# Lets make up some data
df=DataFrame(A=rand(5),B=["8", "9 meters", "4.5", "3m", "12.0"])
# And then make a function to clean the data
function fixdata(arr)
    result = DataArray(Float64, length(arr))
    reg = r"[0-9]+\.*[0-9]*"
    for i = 1:length(arr)
        m = match(reg, arr[i])
        if m == nothing
            result[i] = NA
        else
            result[i] = float64(m.match)
        end
    end
    result
end
# Then just apply the function to the column to clean the data
# and then replace the column with the cleaned data.
df[:B] = fixdata(df[:B])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With