15 - August - 2018

Serverless Big Data Visualisation

Post by Andrew A

A (mostly) serverless log processing, searching and visualisation solution.

Overview

  • Approx. 10 million log entries per day delivered from Akamai
  • Fully searchable
  • Dashboard with logins
  • Archiving of processed logs with expiration policies

Here we will detail a basic setup for a mostly serverless solution for Akamai log processing using AWS services. Akamai's content delivery network is one of the world's largest distributed computing platforms, responsible for serving between 15% and 30% of all web traffic.

Mostly serverless?

The brief stipulated that log delivery would be via FTP, meaning that at least one part of the solution had to utilise the more classical approach of EC2 instances.

This first part of this article will focus on a manual setup, using the AWS console, with a follow-up article detailing the creation of the same solution but using Terraform to create the infrastructure in code.

The diagram below shows the high-level flow of the records as they pass through the solution.

Elasticsearch

Setting up the Elasticsearch domain

Amazon’s Elasticsearch service does not support Kinesis Firehose streams (discussed further below) for domains that reside within a virtual private cloud (VPC), as such, the domain will be created in a publicly accessible location and access will be restricted using logins through Amazon’s Cognito service.

Setting up Cognito

Cognito provides authentication, authorisation, and user management for applications. Users can sign-in directly with a username and password, or through a third party such as Facebook, Amazon, or Google.

We’ll be using the two main components of Cognito, user and identity pools, to configure authentication and authorisation.

  • User pools handle user registration, authentication and account recovery.
  • Identity pools handle the authorisation of the users in the user pools on behalf of the requested services.

The User Pool

The account/password attributes and policies can be configured as needed. Examples include tailoring password length and complexity enforcement. Once the initial setup of the pool has been completed, we’ll need to create an app client and a domain name for the user pool.

Within a Cognito context, an app is an entity within a user pool that has permission to call the APIs that will allow users to register, sign in, and handle forgotten passwords. In order to be able to call these APIs, we need an app client ID and an optional client secret.

Make sure to select Cognito’s user pool as an identity provider.

Choosing a .amazoncognito.com domain

The Identity Pool

Enter the user pool’s ID and the app client’s ID from the previous step and give the identity pool a name:

When creating the identity pool, you will be prompted to create IAM roles for the identities:

Once done, select “Enable Amazon Cognito for authentication” and choose the pools that were just created in the drop downs:

Access policies

Copy the Cognito Identity Pool IAM role ARN and add it to the textbox to give Cognito authenticated users access to Kibana:

Step through the rest of the setup and creating the ES Domain.

Clicking the “Kibana” link in the Elasticsearch console’s Overview tab will take you to the login screen from which you can login to see the dashboard:

Creating users

Cognito supports federated authentication with services such as Google, Github and OpenID. If selected, users can register themselves or administrators can create accounts via the user pool interface.

Kinesis Firehose

Amazon’s Kinesis Data Firehose is used as the primary delivery mechanism. Kinesis uses the concept of “delivery streams” which load data, automatically and continuously, to the specified destinations.

Setting up the Delivery Stream

The name of the delivery stream will be used later when setting up the Kinesis Agent.

Choosing the source

The Delivery Stream’s source will be the Kinesis Agent which uses the “Direct PUT” method rather than being a Kinesis stream.

Processing the records

Elasticsearch won’t be able to directly index the data due to the formatting of the logs. They need to be processed and transformed into a format that Elasticsearch can use.

Kinesis supports the transformation of records being passed to the stream using AWS Lambda functions. One of the provided templates, kinesis-firehose-apachelog-to-json-python, can be used, with some tweaking, to read and process the logs sent from Akamai.

Setup the IAM role using the default settings:

Archiving

Firehose supports the automatic backing up of records to S3 and can deliver all or only the failed records. The failed records can be analysed, fixed and uploaded to be reimported so that data is not lost or corrupted.

FTP and the Kinesis Agent

Akamai transfers gzipped logs using FTP to an intermediate server. This server extracts the gzipped files and the Kinesis agent monitors the upload directory for updated log files. Any detected new files or changes to existing files will trigger the upload to the configured delivery stream (the name of the delivery stream specified in “Setting up the Delivery Stream”).

The logs from Akamai are not directly manipulated before uploading, however, the first two lines within the logs are comment lines and are thus excluded.

Due to the size of the extracted logs, retention needs to be limited. Log rotation is utilised to remove logs that older than 24 hours. The Kinesis agent does not provide a native way to remove processed logs, so sufficient time must be allocated to allow for logs to be processed before being removed.

From “mostly” to “fully” serverless

Akamai’s Log Delivery Service supports both FTP and Email delivery methods.

Amazon’s Simple Email Service (SES) can be configured to receive the logs in emails delivered by Akamai.

The Lambda function created earlier can be used as an action, which SES will run when an email is received:

The Lambda function will need to be modified to support the extracting of attachments and the subsequent pushing to Firehose after the data has been transformed. Python’s email and Boto libraries can be used to perform both actions.

Wrap Up

Following the above steps will result in a ready-to-use solution that ingests, transforms and stores logs. Akamai’s Log Delivery Service is used in this case but the solution can support any type of data being sent to it. The processing in the Lambda function can be changed to support other formats, such as Apache logs or custom data, simply by changing the regex that’s used to parse the logs.

In the second part of the series, we’ll be looking at the same solution but implementing using code, rather than the AWS console.

Any comments or questions? Let me know below!

Add new comment

Share this article

Sign up to our newsletter!

Our thoughts

Let's work together

Get in touch and find out how we can empower your organisation.
Back to top