I'm following the Athena getting started guide and trying to parse my own Cloudfront logs. However, the fields are not being parsed.
I used a small test file, as follows:
#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type
2016-02-02  07:57:45    LHR5    5001    86.177.253.38   GET d3g47gpj5mj0b.cloudfront.net    /foo    404 -   Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_10_5)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36   -   -   Error   -tHYQ3YpojqpR8yFHCUg5YW4OC_yw7X0VWvqwsegPwDqDFkIqhZ_gA==    d3g47gpj5mj0b.cloudfront.net    https421    0.076   -   TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Error
2016-02-02  07:57:45    LHR5    1158241 86.177.253.38   GET d3g47gpj5mj0b.cloudfront.net    /images/posts/cover/404.jpg 200 https://d3g47gpj5mj0b.cloudfront.net/foo    Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_10_5)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36   -   -   Miss    oUdDIjmA1ON1GjWmFEKlrbNzZx60w6EHxzmaUdWEwGMbq8V536O4WA==    d3g47gpj5mj0b.cloudfront.net    https   419 0.440   -   TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Miss
And created the table with this SQL:
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
  `Date` DATE,
  Time STRING,
  Location STRING,
  Bytes INT,
  RequestIP STRING,
  Method STRING,
  Host STRING,
  Uri STRING,
  Status INT,
  Referrer STRING,
  os STRING,
  Browser STRING,
  BrowserVersion STRING
  ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
  WITH SERDEPROPERTIES (
  "input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$"
  ) LOCATION 's3://test/athena-csv/'
But no data comes back:

I can see it returns 4 rows, but the first 2 should be excluded because they start with a #, so it's like the regex isn't being parsed correctly.
Am I doing something wrong? Or is the regex wrong (seems unlikely, as it's in the docs, and looks fine to me)?
This is what I ended up with:
CREATE EXTERNAL TABLE logs (
  `date` date,
  `time` string,
  `location` string,
  `bytes` int,
  `request_ip` string,
  `method` string,
  `host` string,
  `uri` string,
  `status` int,
  `referer` string,
  `useragent` string,
  `uri_query` string,
  `cookie` string,
  `edge_type` string,
  `edget_requiest_id` string,
  `host_header` string,
  `cs_protocol` string,
  `cs_bytes` int,
  `time_taken` string,
  `x_forwarded_for` string,
  `ssl_protocol` string,
  `ssl_cipher` string,
  `result_type` string,
  `protocol` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'input.regex' = '^(?!#.*)(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s*(\\S*)'
) LOCATION 's3://logs'
Note the double backslashes are intentional.
The format of the cloudfront logs changed at some point to add the protocol. This handles older and newer files.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With