How to Use Athena for Log Analysis with Kinesis Data Firehose and S3

How to Use Athena for Log Analysis with Kinesis Data Firehose and S3

Takahiro Iwasa
Takahiro Iwasa
3 min read
Athena Firehose Kinesis

Introduction

AWS provides a robust way to query application logs stored in Amazon S3 using Athena. Powered by Presto, Athena offers a scalable and efficient solution for building log analysis environments.

This guide walks you through setting up Kinesis Data Firehose, Amazon S3, and Athena to create an efficient log analysis environment. Note that this does not cover AWS Glue Crawler, which can simplify partitioning tasks.

System Architecture

Step 1: Create an S3 Bucket

Start by creating an S3 bucket to store log files. Choose a naming convention that aligns with your organization’s standards.

Step 2: Configure Kinesis Data Firehose

Custom prefix support was introduced in February 2019, allowing you to specify Apache Hive-style prefixes for S3 object keys and use MSCK REPAIR TABLE to create partitions in Athena. Follow these steps to configure the delivery stream:

2.1. Create a Delivery Stream

Press Create delivery stream and assign a name.

2.2. Select the Source

Choose Direct PUT or other sources.

Skip the record processing settings.

2.3. Set Destination to S3

Select S3 as the destination.

Configure the prefix and error prefix with the following format:

FieldValue
Prefixlogs/!{timestamp:'year='yyyy'/month='MM'/day='dd'/hour='HH}/
Error Prefixerror_logs/!{timestamp:'year='yyyy'/month='MM'/day='dd'/hour='HH}/!{firehose:error-output-type}

2.4. Optimize Buffer Settings

Adjust Buffer size and Buffer interval based on your requirements.

2.5. Enable Compression

Use GZIP compression to reduce storage costs.

2.6. Set IAM Role

Create or select an IAM role for Firehose.

Step 3: Stream Data Using PHP (Optional)

You can stream log data to Kinesis Data Firehose programmatically. Here’s an example using the AWS SDK for PHP - FirehoseClient#putRecord:

$client = new FirehoseClient([
    'region' => '<AWS_REGION>',
    'version' => 'latest',
]);

$data = [
    'log_id' => 12345,
    'url' => 'https://example.com',
];

$client->putRecord([
    'DeliveryStreamName' => '<YOUR_STREAM>',
    'Record' => [
        'Data' => json_encode($data) . PHP_EOL,
    ],
]);

Step 4: Create an Athena Table

Select Create table from S3 bucket data in the Athena console.

Enter a database name, table name, and the S3 path used by Firehose.

Specify JSON as the file format.

Define columns based on your log structure.

Configure partitions (e.g., year/month/day/hour) to improve query performance.

Load partitions with the following command:

MSCK REPAIR TABLE {TABLE_NAME};

Step 5: Querying the Athena Table

You can use SQL to query the log data efficiently. Example query:

SELECT
  *
FROM
  table_name
WHERE
  year = 2019
  AND month = 8
  AND day = 30
LIMIT 10;

Best Practices:

  • Always include partition keys in the WHERE clause.
  • Use the LIMIT statement to avoid unnecessary scans.

Conclusion

By combining Athena, Kinesis Data Firehose, and S3, you can create a scalable, cost-effective, and highly available log analysis environment. Proper partitioning and efficient querying techniques ensure you keep costs under control.

Happy Coding! 🚀

Takahiro Iwasa

Takahiro Iwasa

Software Developer at KAKEHASHI Inc.
Involved in the requirements definition, design, and development of cloud-native applications using AWS. Now, building a new prescription data collection platform at KAKEHASHI Inc. Japan AWS Top Engineers 2020-2023.