Searching 1.4 Billion Credentials with Amazon Athena

On the 5th of December 2017 a 41GB dump of usernames and passwords was discovered - 4iQ have a post about their discovery on Medium here.

4iQ's initial analysis suggests that this includes 385 million new credential pairs. This post looks at how you can discover if any of your users are included in this dump as well as performing your own analysis.

The Plan

Steps to achieve goal of performing ad hoc queries (produced using draw.io)

Steps to achieve goal of performing ad hoc queries (produced using draw.io)

Before we can perform efficient ad hoc queries we need to:

  • Flatten the folder structure of the files in this breach compilation
  • Upload to S3
  • Convert to parquet

Tools

The volume of data involved here has led me to look at:

I've also used a PowerShell script to flatten the directory structure that the files were in originally, and CloudBerry Explorer for uploading to S3.

File Structure

If you download a copy of the dump as it exists in the wild, you'll find it has been optimized for a search based on the username. This is achieved by splitting the dump into 1,981 files in a structure that represents up to the first 3 characters of the username.

e.g. all users starting with A are in a folder called A with sub-folders which represent the second letter.

This allows for quick searches on username but does not really help if you're looking for all the users in a given domain. This presents a problem for larger corporates and managed service providers which may have dozens to multiple hundreds of domain names to search for.

Step 1 - Flatten Files

Having a nested folder structure causes problems with Athena, as it expects multiple files in one folder. So the first step is to reorganize the files into a flattened structure (i.e. removing the directory structure so all files reside in one folder). 

The PowerShell script below takes files from G:\breach-compilation\data and moves them to a flattened structure in G:\breach-compilation-flat. A file that starts off at G:\breach-compilation\data\a\l\c is moved to G:\breach-compilation-flat\a-l-c.

(Get-ChildItem -Path G:\breach-compilation\data -Recurse -File) | % {
    $file = $_
    $sourcePath = $file.FullName
    $newFileName = [System.String]::Join("-", $file.FullName.Replace("G:\breach-compilation\data\", "").Split("\"))
    $destPath = "G:\breach-compilation-flat\$newFileName"
    Move-Item -Path $sourcePath -Destination $destPath -Verbose
}

This produces output of:

powershell output lists files moved

The flattened file structure meets Athena's requirements, so we can proceed with uploading the data to S3.

Step 2 - Upload to S3

Either upload to S3 using a tool like CloudBerry Explorer, or the AWS Command Line Interface and a command like 

AWS S3 sync g:\breach-compilation-flat\ s3://glue-import-bucket/breach-compilation-flat/

The goal is to end up with your flattened files in an S3 bucket.

listing of s3 bucket objects (shown in cloudberry explorer)

Step 3 - Create a Table in Athena for the Flattened Files

So now we have our files in S3 we need to tell Athena about the structure of the data.

Looking inside a file we can see the data is in a username:plaintextpassword format (and sometimes the colon : is a semi-colon ; ).

0autumn@163.com:plaintextpassword
0autumnapples0@gmail.com:plaintextpassword
0autumneyes0@163.com:plaintextpassword
0autumneyes0@aol.com:plaintextpassword
0auujzvssk1vss4@mail.ru:plaintextpassword
0auw2se02@sneakemail.com:plaintextpassword
0auwk04dvikk3t7@mail.ru:plaintextpassword
0auxfg@hotmail.com:plaintextpassword
0auz@sohu.com:plaintextpassword

To be useful for looking up users at a specific domain, we need to separate this into:

  • localpart - e.g. 0autum from 0autum@163.com:plaintextpassword
  • domain - e.g. 163.com from 0autum@163.com:plaintextpassword
  • password - e.g. plaintextpassword from 0autum@163.com:plaintextpassword

By splitting the username into localpart and domain we'll have a column which is just the domain name. This will allow us to write queries based on the domain name without worrying about the username. The real value of this is exposed later once we've converted the data into a format more optimized for querying.

To achieve the split I used the following Grok custom patterns:

LOCALPART ([^@]+)
DOMAIN ([^:\;]+)
PASSWORD (.*)

Using the above custom patterns, the format for a single row in the file is:

%{LOCALPART:LocalPart}@%{DOMAIN:Domain}[:\;]%{PASSWORD:Password}

This breaks down into the localpart which is everything before the first @ symbol, followed by the domain which is everything between the first @ symbol and the next : colon or ; semi-colon, followed by the password which is everything remaining.

Some parts of the file separate the domain from the password with a colon : and other parts use a semi-colon ; so the Data Definition Language (DDL) statement takes this into account (in-case you're wondering where the semi-colon came from).

The final DDL statement to define the table in Athena looks like:

CREATE EXTERNAL TABLE `2017_12_09_breachcompilation`(
  `localpart` string, 
  `domain` string, 
  `password` string
)
ROW FORMAT SERDE 
  'com.amazonaws.glue.serde.GrokSerDe' 
WITH SERDEPROPERTIES ( 
  'input.format'='%{LOCALPART:LocalPart}@%{DOMAIN:Domain}[:\;]%{PASSWORD:Password}', 
  'input.grokCustomPatterns'='LOCALPART ([^@]+)\r\nDOMAIN ([^:\;]+)\r\nPASSWORD (.*)'
) 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://glue-import-bucket/breach-compilation-flat'

At this point we can perform ad hoc queries against the breach files but it isn't optimized for speed or cost. Amazon Athena pricing is based on the amount of data scanned and at the moment most queries will result in all the data being scanned.

The following query will search the data for the number of times a specific password was used, in this case the password being searched for is jimbojambo.

SELECT count(*) as pwcount
FROM "2017_12_09_breachcompilation" 
where password = 'jimbojambo'
;

Screenshot from athena showing that jimbojambo occurs 16 times in the data

The results of this are impressive when you consider that Athena applied our Grok pattern to 41.12GB of data in 31.63 seconds and collated the results with a cost of ~ USD $0.21.

Step 4 - Use Glue to Convert to Parquet

While we can query the files using the Grok SerDe, it's not efficient as each query scans all the data files, which is more time consuming and also more monetarily costly than it has to be.

The trick to optimizing this is to use a columnar based file format that supports compression, such as Apache Parquet. If you'd like to find out more about Parquet then a quick search will yield copious results.

In my mind, the easiest way to convert to Parquet is to use the AWS Glue service. One of the capabilities of this service is to perform Extract, Transform, and Load (ETL) operations.

Defining a Table Within Glue

Before we can perform our conversion, we need to define the table that will be the destination for our data. This is done within Glue by selecting to add a table.

Define the table properties - a table name and the database within glue that this table sits within

Define the table properties - a table name and the database within glue that this table sits within

Define the location for data stored in this table

Define the location for data stored in this table

Select the data format from the options supported by glue

Select the data format from the options supported by glue

define the schema for the data stored in this table

define the schema for the data stored in this table

Review the selections and finish to create the table

Review the selections and finish to create the table

Create an IAM Role for the AWS Glue Service

Before we can create the ETL job in Glue, we'll need a service role to allow the AWS Glue service to access resources within our account. If you're already using Glue then you've probably done this - you only need to do it if you haven't already made a service role with the appropriate permissions.

In the console switch to IAM

In the console switch to IAM

Select to create a role

Select to create a role

create a service role for Glue

create a service role for Glue

Use the Existing AWS Glue Service Role Managed Policy

Use the Existing AWS Glue Service Role Managed Policy

Review the settings and provide a name for the role

Review the settings and provide a name for the role

Create a Job within Glue

To have Glue perform the conversion from the flattened raw data that we uploaded, into files stored in the Parquet format, we need to create a Job.

create a glue job for converting our files to parquet

create a glue job for converting our files to parquet

select the data source

select the data source

select the data target

select the data target

specify the mappings from the source to the target

specify the mappings from the source to the target

review the settings for the job

review the settings for the job

review the generated code for performing the conversion, then select run job from the buttons at the top

review the generated code for performing the conversion, then select run job from the buttons at the top

configure the parameters for the job. I increased the dpus to 40 - the job took ~45 minutes to complete.

configure the parameters for the job. I increased the dpus to 40 - the job took ~45 minutes to complete.

Step 5 - Perform Ad Hoc Queries

Once the Glue job has completed we now have the data in the Parquet format and can query it in Athena. The table that we defined in Glue is visible from within Athena.

We can run the same query as before (searching for a specific password) and see the difference.

SELECT count(*) as pwcount
FROM "passworddumps"."2017_12_09_breachcompilation_parquet" 
where password = 'jimbojambo'
;
ParquetJimboJambo.png

We're now down to 7.52 seconds and only 10.64GB of data scanned compared to 31 seconds and 41 GBs of data scanned.