Here is the code:
from py4j.protocol import Py4JJavaError
def parse_clf_time(s):
try:
#return "{0:04d}-{1:02d}-{2:02d} {3:02d}:{4:02d}:{5:02d}".format(int(s[7:11]),month_map[s[3:6]],int(s[0:2]),int(s[12:14]),int(s[15:17]),int(s[18:20]))
return "{0:04d}-{1:02d}-{2:02d} {3:02d}:{4:02d}:{5:02d}".format(
int(s[7:11]),
month_map[s[3:6]],
int(s[0:2]),
int(s[12:14]),
int(s[15:17]),
int(s[18:20])
)
except Py4JJavaError as e:
return "2016-08-11 00:00:01".format(
int(s[7:11]),
month_map[s[3:6]],
int(s[0:2]),
int(s[12:14]),
int(s[15:17]),
int(s[18:20])
u_parse_time = udf(parse_clf_time)
final_df = cleaned_df.select('*', u_parse_time(cleaned_df['timestamp']).cast('timestamp').alias('time')).drop('timestamp')
total_log_entries = final_df.count()
The df may contain bad data so I want to use a silly try except to handle it, please let me what is the best practice to exclude bad data.
For unknown reason, I got error:

So what's wrong with the code? It works in another project on the same environment so I am pretty sure the error should not be from the code itself.
Thank you very much, any clue is appreciated.
You missed a ) for return "2016-08-11 00:00:01".format(
Also, you didn't have
from pyspark.sql.functions import udf
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With