How to Set Up Glue Crawler for Event-Driven Crawling with SQS
Introduction
Glue Crawlers can crawl S3 buckets based on event notifications with SQS, avoiding a full scan and offering benefits in terms of both crawling performance and costs.
Creating AWS Resources
Create a CloudFormation template with the following content:
AWSTemplateFormatVersion: '2010-09-09'
Description: Glue crawler test
Resources:
SqsQueue:
Type: AWS::SQS::Queue
Properties:
SqsManagedSseEnabled: true
QueueName: glue-crawler-test-queue
SqsQueuePolicy:
Type: AWS::SQS::QueuePolicy
Properties:
Queues:
- !Ref SqsQueue
PolicyDocument:
Version: '2008-10-17'
Id: __default_policy_ID
Statement:
- Effect: Allow
Principal:
AWS:
- !Sub arn:aws:iam::${AWS::AccountId}:root
- !GetAtt IAMRoleGlueCrawler.Arn
Action: sqs:*
Resource: !GetAtt SqsQueue.Arn
- Effect: Allow
Principal:
Service: s3.amazonaws.com
Action: sqs:*
Resource: !GetAtt SqsQueue.Arn
S3Bucket:
Type: AWS::S3::Bucket
DependsOn: SqsQueuePolicy
Properties:
BucketName: !Sub glue-crawler-test-${AWS::AccountId}-${AWS::Region}
BucketEncryption:
ServerSideEncryptionConfiguration:
- ServerSideEncryptionByDefault:
SSEAlgorithm: AES256
NotificationConfiguration:
QueueConfigurations:
- Event: 's3:ObjectCreated:*'
Queue: !GetAtt SqsQueue.Arn
PublicAccessBlockConfiguration:
BlockPublicAcls: TRUE
BlockPublicPolicy: TRUE
IgnorePublicAcls: TRUE
RestrictPublicBuckets: TRUE
IAMRoleGlueCrawler:
Type: AWS::IAM::Role
Properties:
Path: /service-role/
RoleName: !Sub glue-crawler-test-service-role
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: glue.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: cw-logs
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- logs:CreateLogGroup
- logs:CreateLogStream
- logs:PutLogEvents
Resource: "*"
- PolicyName: glue
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- glue:CreateTable
- glue:GetDatabase
- glue:GetTable
- glue:UpdateTable
Resource: "*"
- PolicyName: s3
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- s3:GetObject
- s3:ListBucket
- s3:PutObject
Resource: '*'
GlueDatabase:
Type: AWS::Glue::Database
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseInput:
Name: glue-crawler-test-db
GlueTable:
Type: AWS::Glue::Table
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: !Ref GlueDatabase
TableInput:
Name: glue-crawler-test-table
TableType: EXTERNAL_TABLE
Parameters:
classification: json
StorageDescriptor:
Location: !Sub 's3://${S3Bucket}/'
Compressed: false
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
SerdeInfo:
SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
GlueCrawler:
Type: AWS::Glue::Crawler
Properties:
Name: glue-crawler-test
Role: !Sub service-role/${IAMRoleGlueCrawler}
Targets:
CatalogTargets:
- DatabaseName: !Ref GlueDatabase
Tables:
- !Ref GlueTable
SchemaChangePolicy:
UpdateBehavior: UPDATE_IN_DATABASE
DeleteBehavior: LOG
Replace <YOUR_CFN_BUCKET>
with the actual value and deploy the CloudFormation stack with the following commands:
STACK_NAME=glue-crawler-test
aws cloudformation package \
--template-file template.yaml \
--s3-bucket <YOUR_CFN_BUCKET> \
--s3-prefix "$STACK_NAME/$(date +%Y)/$(date +%m)/$(date +%d)/$(date +%H)/$(date +%M)" \
--output-template-file package.template
aws cloudformation deploy \
--stack-name $STACK_NAME \
--template-file package.template \
--capabilities CAPABILITY_NAMED_IAM
Updating Glue Crawler Target Table Setting
CloudFormation does not currently support S3 event notifications for Glue Crawlers. Update the target table settings manually:
CloudFormation Support For S3 Event isn’t currently available. S3 Event Crawler’s integration with CloudFormation is in scope and in the works. We plan on releasing this coverage to Cloudformation some later this year. Thank you for patience.
Select the table and click Edit
.
Choose Crawl based on events
in the Subsequent crawler runs
section.
Testing
Updating Glue Table Schema by JSON Version 1
Upload a sample JSON file to trigger an S3 event notification:
echo '{"message": "Hello World"}' > sample1.json
aws s3 cp sample1.json s3://glue-crawler-test-<ACCOUNT_ID>-<REGION>/
Start the Glue crawler:
aws glue start-crawler --name glue-crawler-test
Monitor its status until it shows STOPPING
:
$ aws glue get-crawler --name glue-crawler-test | jq -r '.Crawler.State'
STOPPING
Verify the updated Glue table schema:
$ aws glue get-table \
--database-name glue-crawler-test-db \
--name glue-crawler-test-table \
| jq '.Table.StorageDescriptor.Columns'
[
{
"Name": "message",
"Type": "string"
}
]
Updating Glue Table Schema by JSON Version 2
Repeat the process with a new JSON file version:
echo '{"message": "Hello World", "statusCode": 200}' > sample2.json
aws s3 cp sample2.json s3://glue-crawler-test-<ACCOUNT_ID>-<REGION>/
Start the Glue crawler:
aws glue start-crawler --name glue-crawler-test
Monitor its status until it shows STOPPING
:
$ aws glue get-crawler --name glue-crawler-test | jq -r '.Crawler.State'
STOPPING
Verify the updated Glue table schema:
aws glue get-table \
--database-name glue-crawler-test-db \
--name glue-crawler-test-table \
| jq '.Table.StorageDescriptor.Columns'
[
{
"Name": "message",
"Type": "string"
},
{
"Name": "statuscode",
"Type": "int"
}
]
Checking SQS Message Count
Verify that no messages are left in the SQS queue:
$ queue_url=$(aws sqs get-queue-url --queue-name glue-crawler-test-queue | jq -r '.QueueUrl')
$ aws sqs get-queue-attributes \
--queue-url $queue_url \
--attribute-names ApproximateNumberOfMessages
{
"Attributes": {
"ApproximateNumberOfMessages": "0"
}
}
Cleaning Up
Remove provisioned AWS resources:
aws s3 rm s3://glue-crawler-test-<ACCOUNT_ID>-<REGION>/ --recursive
aws cloudformation delete-stack --stack-name $STACK_NAME
Conclusion
By configuring Glue Crawlers to use S3 event notifications with SQS, you can avoid full scans, resulting in reduced costs and enhanced performance. This approach is a highly efficient solution for event-driven data pipelines.
Happy Coding! 🚀