Parsing Multiple Date Formats in Athena

This article looks at how to use Amazon Athena with a column which contains dates in multiple different formats. The sections below work through step-by-step how we get to our solution.

TL;DR; Just show me the answer.

Sample Data for Testing

To start with we need some sample data containing dates in a variety of formats. The following query uses UNION ALL to create a set of test data. We can use this to test our date parsing code on.

SELECT *
FROM
(
  SELECT '2020-01-01 13:01:01' AS MultiDateCol
  UNION ALL
  SELECT '2020/01/02 13:01:02'
  UNION ALL
  SELECT '2020/01/03'
  UNION ALL
  SELECT '04 JAN 2020'
  UNION ALL
  SELECT '05 JAN 2020 13:01:05' 
  UNION ALL
  SELECT '06/01/2020 13:01:06'
  UNION ALL
  SELECT '07-01-2020 13:01:07'
) AS TestData
2 - test data results.png

 

The Date_Parse Function in Athena

Athena provides the date_parse function, this allows you to specify a string containing a date time and parse it using the provided format. The syntax is:

date_parse(string, format) → timestamp
Parses string into a timestamp using format.
— https://prestodb.io/docs/current/functions/datetime.html

Supported Format Strings

The functions in this section use a format string that is compatible with the MySQL date_parse and str_to_date functions. The following table, based on the MySQL manual, describes the format specifiers:
Specifier Description
%a Abbreviated weekday name (Sun .. Sat)
%b Abbreviated month name (Jan .. Dec)
%c Month, numeric (1 .. 12) [4]
%D Day of the month with English suffix (0th, 1st, 2nd, 3rd, ...)
%d Day of the month, numeric (01 .. 31) [4]
%e Day of the month, numeric (1 .. 31) [4]
%f Fraction of second (6 digits for printing: 000000 .. 999000; 1 - 9 digits for parsing: 0 .. 999999999) [1]
%H Hour (00 .. 23)
%h Hour (01 .. 12)
%I Hour (01 .. 12)
%i Minutes, numeric (00 .. 59)
%j Day of year (001 .. 366)
%k Hour (0 .. 23)
%l Hour (1 .. 12)
%M Month name (January .. December)
%m Month, numeric (01 .. 12) [4]
%p AM or PM
%r Time, 12-hour (hh:mm:ss followed by AM or PM)
%S Seconds (00 .. 59)
%s Seconds (00 .. 59)
%T Time, 24-hour (hh:mm:ss)
%U Week (00 .. 53), where Sunday is the first day of the week
%u Week (00 .. 53), where Monday is the first day of the week
%V Week (01 .. 53), where Sunday is the first day of the week; used with %X
%v Week (01 .. 53), where Monday is the first day of the week; used with %x
%W Weekday name (Sunday .. Saturday)
%w Day of the week (0 .. 6), where Sunday is the first day of the week [3]
%X Year for the week where Sunday is the first day of the week, numeric, four digits; used with %V
%x Year for the week, where Monday is the first day of the week, numeric, four digits; used with %v
%Y Year, numeric, four digits
%y Year, numeric (two digits) [2]
%% A literal % character
%x x, for any x not listed above
[1] Timestamp is truncated to milliseconds.
[2] When parsing, two-digit year format assumes range 1970 .. 2069, so “70” will result in year 1970 but “69” will produce 2069.
[3] This specifier is not supported yet. Consider using day_of_week() (it uses 1-7 instead of 0-6).
[4] (1, 2, 3, 4) This specifier does not support 0 as a month or day.
Warning
The following specifiers are not currently supported: %D %U %u %V %w %X
— https://prestodb.io/docs/current/functions/datetime.html

Getting Started with Date_Parse

The following shows how date_parse can be used to convert a single line from our test data.

SELECT 
       date_parse(TestData.MultiDateCol, '%Y-%m-%d %H:%i:%s')  as DateConvertedToTimestamp
FROM
(
  SELECT '2020-01-01 13:01:01' AS MultiDateCol
) AS TestData
3 - test small conversion results.png

Converting Multiple Different Formats

Initial Error

Using the same query as above, but with the full test data, will result in an error.

4 - Initial Errors.png
Your query has the following error(s):
INVALID_FUNCTION_ARGUMENT: Invalid format: "05 JAN 2020 13:01:05" is malformed at " JAN 2020 13:01:05"
This query ran against the "cloudjunkie" database, unless qualified by the query.

This error is due to only one of the dates matching the format provided to date_parse.

Incorporating Try into the Query

Wrapping the date_parse function in try will result in a NULL value being returned if an error occurs. Using try will remove the error and allow the query to execute, but still won't convert the dates in the non-conforming formats (we'll get there though).

The query below also includes an additional column to assist with identifying which dates were successfully converted.

SELECT TestData.FormatId, 
       try(date_parse(TestData.MultiDateCol, '%Y-%m-%d %H:%i:%s')) as DateConvertedToTimestamp
FROM
(
  SELECT '2020-01-01 13:01:01' AS MultiDateCol, 'Format 1' AS FormatId
  UNION ALL
  SELECT '2020/01/02 13:01:02', 'Format 2'
  UNION ALL
  SELECT '2020/01/03', 'Format 3'
  UNION ALL
  SELECT '04 JAN 2020', 'Format 4'
  UNION ALL
  SELECT '05 JAN 2020 13:01:05', 'Format 5'
  UNION ALL
  SELECT '06/01/2020 13:01:06', 'Format 6'
  UNION ALL
  SELECT '07-01-2020 13:01:07', 'Format 7'
) AS TestData
ORDER BY TestData.FormatId
5 - Adding Try.png

Using Coalesce to Support Additional Formats

The coalesce conditional expression returns the first non-null value. Using this in the query will allow us to support multiple try functions with their own date_parse functions.

SELECT TestData.FormatId, 
       Coalesce(
         try(date_parse(TestData.MultiDateCol, '%Y-%m-%d %H:%i:%s')),
         try(date_parse(TestData.MultiDateCol, '%Y/%m/%d %H:%i:%s')),
         try(date_parse(TestData.MultiDateCol, '%Y/%m/%d')),
         try(date_parse(TestData.MultiDateCol, '%d %M %Y')),
         try(date_parse(TestData.MultiDateCol, '%d %M %Y %H:%i:%s')),
         try(date_parse(TestData.MultiDateCol, '%d/%m/%Y %H:%i:%s')),
         try(date_parse(TestData.MultiDateCol, '%d-%m-%Y %H:%i:%s'))
       )  as DateConvertedToTimestamp,
       TestData.MultiDateCol
FROM
(
  SELECT '2020-01-01 13:01:01' AS MultiDateCol, 'Format 1' AS FormatId
  UNION ALL
  SELECT '2020/01/02 13:01:02', 'Format 2'
  UNION ALL
  SELECT '2020/01/03', 'Format 3'
  UNION ALL
  SELECT '04 JAN 2020', 'Format 4'
  UNION ALL
  SELECT '05 JAN 2020 13:01:05', 'Format 5'
  UNION ALL
  SELECT '06/01/2020 13:01:06', 'Format 6'
  UNION ALL
  SELECT '07-01-2020 13:01:07', 'Format 7'
) AS TestData
ORDER BY TestData.FormatId
6 - Converted.png

Using this you can support a wide range of custom date formats without causing errors in your query.