How to Use Athena for Log Analysis with Kinesis Data Firehose and S3
Introduction
AWS provides a robust way to query application logs stored in Amazon S3 using Athena. Powered by Presto, Athena offers a scalable and efficient solution for building log analysis environments.
This guide walks you through setting up Kinesis Data Firehose, Amazon S3, and Athena to create an efficient log analysis environment. Note that this does not cover AWS Glue Crawler, which can simplify partitioning tasks.
Step 1: Create an S3 Bucket
Start by creating an S3 bucket to store log files. Choose a naming convention that aligns with your organization’s standards.
Step 2: Configure Kinesis Data Firehose
Custom prefix support was introduced in February 2019, allowing you to specify Apache Hive-style prefixes for S3 object keys and use MSCK REPAIR TABLE
to create partitions in Athena. Follow these steps to configure the delivery stream:
2.1. Create a Delivery Stream
Press Create delivery stream
and assign a name.
2.2. Select the Source
Choose Direct PUT or other sources
.
Skip the record processing settings.
2.3. Set Destination to S3
Select S3
as the destination.
Configure the prefix and error prefix with the following format:
Field | Value |
---|---|
Prefix | logs/!{timestamp:'year='yyyy'/month='MM'/day='dd'/hour='HH}/ |
Error Prefix | error_logs/!{timestamp:'year='yyyy'/month='MM'/day='dd'/hour='HH}/!{firehose:error-output-type} |
2.4. Optimize Buffer Settings
Adjust Buffer size
and Buffer interval
based on your requirements.
2.5. Enable Compression
Use GZIP compression to reduce storage costs.
2.6. Set IAM Role
Create or select an IAM role for Firehose.
Step 3: Stream Data Using PHP (Optional)
You can stream log data to Kinesis Data Firehose programmatically. Here’s an example using the AWS SDK for PHP - FirehoseClient#putRecord:
$client = new FirehoseClient([
'region' => '<AWS_REGION>',
'version' => 'latest',
]);
$data = [
'log_id' => 12345,
'url' => 'https://example.com',
];
$client->putRecord([
'DeliveryStreamName' => '<YOUR_STREAM>',
'Record' => [
'Data' => json_encode($data) . PHP_EOL,
],
]);
Step 4: Create an Athena Table
Select Create table from S3 bucket data
in the Athena console.
Enter a database name, table name, and the S3 path used by Firehose.
Specify JSON
as the file format.
Define columns based on your log structure.
Configure partitions (e.g., year/month/day/hour
) to improve query performance.
Load partitions with the following command:
MSCK REPAIR TABLE {TABLE_NAME};
Step 5: Querying the Athena Table
You can use SQL to query the log data efficiently. Example query:
SELECT
*
FROM
table_name
WHERE
year = 2019
AND month = 8
AND day = 30
LIMIT 10;
Best Practices:
- Always include partition keys in the
WHERE
clause. - Use the
LIMIT
statement to avoid unnecessary scans.
Conclusion
By combining Athena, Kinesis Data Firehose, and S3, you can create a scalable, cost-effective, and highly available log analysis environment. Proper partitioning and efficient querying techniques ensure you keep costs under control.
Happy Coding! 🚀