Automating Partitioning in Athena Tables with Partition Projection
Introduction
Amazon Athena Partition Projection, introduced in June 2020, eliminates the need to run MSCK REPAIR TABLE
to add new partitions manually. This blog post explores how you can set up and use Partition Projection effectively.
What is Partition Projection
Partition Projection allows Athena to calculate partition values and locations from a configuration rather than retrieving them from repositories like AWS Glue Data Catalog. This reduces query runtime for highly partitioned tables.
Key highlights from the AWS documentation:
- Efficiency: “Partition projection can reduce the runtime of queries against highly partitioned tables.”
- Automation: “Partition values and locations are calculated from configuration instead of being retrieved manually.”
- Reduced Overhead: “Athena avoids calling
GetPartitions
, leveraging the partition projection configuration instead.”
For more details, see the official AWS documentation.
Setting Up AWS Resources
Below is an example CloudFormation template for implementing Partition Projection.
AWSTemplateFormatVersion: "2010-09-09"
Description: Stack for Athena partition projection sample
Resources:
S3:
Type: AWS::S3::Bucket
Properties:
BucketName: athena-partition-projection-logs
BucketEncryption:
ServerSideEncryptionConfiguration:
- ServerSideEncryptionByDefault:
SSEAlgorithm: AES256
GlueDatabase:
Type: AWS::Glue::Database
Properties:
DatabaseInput:
Name: sample
CatalogId: !Ref AWS::AccountId
GlueTable:
Type: AWS::Glue::Table
Properties:
DatabaseName: !Ref GlueDatabase
CatalogId: !Ref AWS::AccountId
TableInput:
TableType: EXTERNAL_TABLE
Parameters:
classification: json
"projection.enabled": true
"projection.year_month.format": yyyy/MM
"projection.year_month.interval": 1
"projection.year_month.interval.unit": MONTHS
"projection.year_month.range": 2021/09,NOW
"projection.year_month.type": date
"storage.location.template": s3://athena-partition-projection-logs/${year_month}
StorageDescriptor:
Columns:
- Name: id
Type: int
- Name: message
Type: string
Location: !Sub s3://${S3}/
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: false
NumberOfBuckets: 0
SerdeInfo:
SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
Parameters:
paths: id,message
StoredAsSubDirectories: false
PartitionKeys:
- Name: year_month
Type: string
Retention: 0
Name: sample_logs
Key Point: GlueTable.TableInput.Parameters
The key point in the configuration is the GlueTable.TableInput.Parameters
section of the GlueTable
definition (lines 27–35). This section enables Partition Projection with the following important properties:
projection.enabled
: Enables partition projection.projection.year_month.format
: Specifies the date format for partitions.projection.year_month.range
: Defines the range for partitions, supportingNOW
with offsets (e.g.,NOW+9HOURS
).
projection.year_month.range
property is treated as UTC. If you intend to use a different timezone in your S3 object path, you need to specify the time difference using a format like NOW+9HOURS
. For more details, please refer to the official documentation.
Deployment
Deploy the above CloudFormation stack with the following command:
aws cloudformation deploy --template-file stack.yml --stack-name athena-partition-projection-sample
Testing the Setup
Upload data to the S3 bucket for testing:
echo '{"id": 1, "message": "hello"}' > 2021-09.json
echo '{"id": 2, "message": "world"}' > 2021-10.json
aws s3 cp 2021-09.json s3://athena-partition-projection-logs/2021/09/
aws s3 cp 2021-10.json s3://athena-partition-projection-logs/2021/10/
Verify that the files are uploaded:
aws s3 ls s3://athena-partition-projection-logs/2021/
Querying Data
Run the following SQL query to fetch data from the 2021/09
partition:
SELECT * FROM "sample"."sample_logs"
WHERE year_month = '2021/09'
LIMIT 10;
Expected result:
1 hello 2021/09
Similarly, query the 2021/10
partition:
SELECT * FROM "sample"."sample_logs"
WHERE year_month = '2021/10'
LIMIT 10;
Expected result:
2 world 2021/10
Cleaning Up
Clean up resources to avoid unnecessary charges:
aws s3 rm --recursive s3://athena-partition-projection-logs
aws cloudformation delete-stack --stack-name athena-partition-projection-sample
Conclusion
Partition Projection simplifies the management of Athena partitions, saving time and reducing complexity. It allows developers to focus on application logic without worrying about partition maintenance. By following the steps above, you can efficiently set up Partition Projection for your projects.
Happy Coding! 🚀