Implementing Full-Text Search for a Movies Dataset with AWS CloudSearch
Introduction
AWS CloudSearch, built on Apache Solr, offers robust capabilities for full-text search. In this post, we will explore how to implement full-text search for a movies dataset using AWS CloudSearch.
Setting Up a CloudSearch Domain
Creating a CloudSearch Domain
To begin, create a CloudSearch domain with the following command:
aws cloudsearch create-domain --domain-name searching-movies-data
The command generates a response similar to the following:
{
"DomainStatus": {
"DomainId": "123456789012/searching-movies-data",
"DomainName": "searching-movies-data",
"ARN": "arn:aws:cloudsearch:ap-northeast-1:123456789012:domain/searching-movies-data",
"Created": true,
"Deleted": false,
"DocService": {},
"SearchService": {},
"RequiresIndexDocuments": false,
"Processing": false,
"SearchPartitionCount": 0,
"SearchInstanceCount": 0
}
}
According to the official documentation, creating a CloudSearch domain typically takes approximately 10 minutes to complete.
It takes about ten minutes to create endpoints for a new domain.
You can verify the domain creation status by running:
aws cloudsearch describe-domains --domain-name searching-movies-data
When the command output shows Processing: false
, this indicates that the domain and its endpoints are fully created and ready to use.
{
"DomainStatusList": [
{
"DomainId": "123456789012/searching-movies-data",
"DomainName": "searching-movies-data",
"ARN": "arn:aws:cloudsearch:ap-northeast-1:123456789012:domain/searching-movies-data",
"Created": true,
"Deleted": false,
"DocService": {
"Endpoint": "doc-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com"
},
"SearchService": {
"Endpoint": "search-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com"
},
"RequiresIndexDocuments": false,
"Processing": false,
"SearchInstanceType": "search.small",
"SearchPartitionCount": 1,
"SearchInstanceCount": 1,
"Limits": {
"MaximumReplicationCount": 5,
"MaximumPartitionCount": 10
}
}
]
}
Updating Access Policies
To enhance security, update the domain’s access policy to allow access only from your IP address:
aws cloudsearch update-service-access-policies \
--domain-name searching-movies-data \
--access-policies '
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": "*",
"Action": ["cloudsearch:*"],
"Condition": {"IpAddress": {"aws:SourceIp": "xxx.xxx.xxx.xxx/32"}}
}
]
}'
Configuring Index Fields
Define the index fields based on the dataset’s structure. This guide uses The Movies Dataset by Kaggle (CC0: Public Domain). Below is an example command for defining an index field:
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name adult --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name belongs_to_collection --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name budget --type double
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name genres --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name homepage --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name id --type int
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name imdb_id --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name original_language --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name original_title --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name overview --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name popularity --type double
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name poster_path --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name production_companies --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name production_countries --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name release_date --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name revenue --type int
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name runtime --type double
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name spoken_languages --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name status --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name tagline --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name title --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name video --type text
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name vote_average --type double
aws cloudsearch define-index-field \
--domain-name searching-movies-data --name vote_count --type int
Once done, tells the search domain to start indexing its documents:
aws cloudsearch index-documents --domain-name searching-movies-data
Indexing the Dataset
Download the dataset from Kaggle and prepare a sample file with the first 1,000 rows:
head -1000 movies_metadata.csv > sample.csv
CloudSearch can index data directly from CSV files through the console interface, though the AWS CLI’s aws cloudsearchdomain upload-documents
command only accepts JSON or XML formats.
In the CloudSearch console, upload the CSV file via the “Upload documents” option. Once uploaded, verify the document count to ensure successful indexing.
Navigate to Actions
> Upload documents
.
Choose your CSV file and click Next
.
Review the detected fields and click Upload documents
.
Once the process completes, you can confirm that 998 records (excluding headers) are successfully indexed.
Running Search Queries
Searching for Text
To search for movies with the keyword house
in the title
and overview
fields:
curl --location -g --request GET 'https://search-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com/2013-01-01/search?q=house&q.options={fields:["title","overview"]}&return=title,overview' | jq .
For more information, refer to the official documentation.
The response will include matching results, such as:
{
"status": {
"rid": "8fDJv8swsgEK1DyD",
"time-ms": 1
},
"hits": {
"found": 26,
"start": 0,
"hit": [
{
"id": "local_file_466",
"fields": {
"overview": "Hip Hop duo Kid & Play return...",
"title": "House Party 3"
}
},
...
]
}
}
Searching for Numbers
For numeric searches, such as movies with a vote_average
of 5.0
:
curl --location --request GET 'https://search-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com/2013-01-01/search?q.parser=structured&q=vote_average:5.0&return=title,overview' | jq .
For more information, refer to the official documentation.
The response will include matching results, such as:
{
"status": {
"rid": "w+Xgv8swwQEK1DyD",
"time-ms": 0
},
"hits": {
"found": 35,
"start": 0,
"hit": [
{
"id": "local_file_144",
"fields": {
"overview": "Far from home...",
"title": "The Amazing Panda Adventure"
}
},
...
]
}
}
Searching for Ranges
To find movies with a vote_average
greater than 7.0
:
curl --location -g --request GET 'https://search-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com/2013-01-01/search?q.parser=structured&q=vote_average:[7.0,}&return=title,overview' | jq .
For more information, refer to the official documentation.
The response will include matching results, such as:
{
"status": {
"rid": "vJ/vv8swyAEK1DyD",
"time-ms": 2
},
"hits": {
"found": 254,
"start": 0,
"hit": [
{
"id": "local_file_1",
"fields": {
"overview": "Led by Woody, Andy's toys...",
"title": "Toy Story"
}
},
...
]
}
}
Cleaning Up
Once you are done, delete the CloudSearch domain to avoid unnecessary charges:
aws cloudsearch delete-domain --domain-name searching-movies-data
Conclusion
AWS CloudSearch is a powerful service for implementing simple full-text search functionality. By following this guide, you can easily set up and manage a searchable domain for datasets like a movies catalog. For additional details, consult the AWS CloudSearch Developer Guide.
Happy Coding! 🚀