I have a large dataset that is a CSV file and I've already cleaned it up quite a bit. However there are instances where I have a column which is not numeric and I want to remove those rows.
So far I thought something like this would work. I am thinking the solution is probably simple. I am not sure whether I can even do something like this. It is a CSV file so I tried to tell awk that its delimited by commas. Also an example of a good value in column one is this: 323870133825187840
awk '/,/$1 != numeric'
Are there any suggestions on this one. I initially thought about doing some kind of a one liner such as, [0-9], etc. I would greatly appreciate it if anyone could help me out with this. For someone out there this is probably like child's play :)
Assuming that you're only dealing with unsigned integer values, you can use either:
awk -F, '$1 ~ /^[[:digit:]]+$/'
or
awk -F, '$1 !~ /[^[:digit:]]/'
Both use the 'implicit print' action. The first checks that $1 only contains digits; the second checks that $1 does not contain a non-digit.
If your numbers are more general (signed, possibly with decimal point, possibly with exponential notation — 6.0221413e+23 (Avogadro's number), for example), then you need a more complex regular expression and will use positive matching, only selecting rows where field 1 matches your regex for a valid number.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With