Implementing Full-Text Search for a Movies Dataset with AWS CloudSearch

Takahiro Iwasa

Nov 28, 2022

5 min read

CloudSearch

Introduction

AWS CloudSearch, built on Apache Solr, offers robust capabilities for full-text search. In this post, we will explore how to implement full-text search for a movies dataset using AWS CloudSearch.

Setting Up a CloudSearch Domain

Creating a CloudSearch Domain

To begin, create a CloudSearch domain with the following command:

aws cloudsearch create-domain --domain-name searching-movies-data

The command generates a response similar to the following:

{
  "DomainStatus": {
    "DomainId": "123456789012/searching-movies-data",
    "DomainName": "searching-movies-data",
    "ARN": "arn:aws:cloudsearch:ap-northeast-1:123456789012:domain/searching-movies-data",
    "Created": true,
    "Deleted": false,
    "DocService": {},
    "SearchService": {},
    "RequiresIndexDocuments": false,
    "Processing": false,
    "SearchPartitionCount": 0,
    "SearchInstanceCount": 0
  }
}

According to the official documentation, creating a CloudSearch domain typically takes approximately 10 minutes to complete.

It takes about ten minutes to create endpoints for a new domain.

You can verify the domain creation status by running:

aws cloudsearch describe-domains --domain-name searching-movies-data

When the command output shows Processing: false, this indicates that the domain and its endpoints are fully created and ready to use.

{
  "DomainStatusList": [
    {
      "DomainId": "123456789012/searching-movies-data",
      "DomainName": "searching-movies-data",
      "ARN": "arn:aws:cloudsearch:ap-northeast-1:123456789012:domain/searching-movies-data",
      "Created": true,
      "Deleted": false,
      "DocService": {
        "Endpoint": "doc-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com"
      },
      "SearchService": {
        "Endpoint": "search-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com"
      },
      "RequiresIndexDocuments": false,
      "Processing": false,
      "SearchInstanceType": "search.small",
      "SearchPartitionCount": 1,
      "SearchInstanceCount": 1,
      "Limits": {
        "MaximumReplicationCount": 5,
        "MaximumPartitionCount": 10
      }
    }
  ]
}

Updating Access Policies

To enhance security, update the domain’s access policy to allow access only from your IP address:

aws cloudsearch update-service-access-policies \
  --domain-name searching-movies-data \
  --access-policies '
  {
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": "*",
        "Action": ["cloudsearch:*"],
        "Condition": {"IpAddress": {"aws:SourceIp": "xxx.xxx.xxx.xxx/32"}}
      }
    ]
  }'

Configuring Index Fields

Define the index fields based on the dataset’s structure. This guide uses The Movies Dataset by Kaggle (CC0: Public Domain). Below is an example command for defining an index field:

aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name adult --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name belongs_to_collection --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name budget --type double
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name genres --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name homepage --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name id --type int
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name imdb_id --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name original_language --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name original_title --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name overview --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name popularity --type double
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name poster_path --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name production_companies --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name production_countries --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name release_date --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name revenue --type int
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name runtime --type double
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name spoken_languages --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name status --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name tagline --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name title --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name video --type text
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name vote_average --type double
aws cloudsearch define-index-field \
  --domain-name searching-movies-data --name vote_count --type int

Once done, tells the search domain to start indexing its documents:

aws cloudsearch index-documents --domain-name searching-movies-data

Indexing the Dataset

Download the dataset from Kaggle and prepare a sample file with the first 1,000 rows:

head -1000 movies_metadata.csv > sample.csv

CloudSearch can index data directly from CSV files through the console interface, though the AWS CLI’s aws cloudsearchdomain upload-documents command only accepts JSON or XML formats.

In the CloudSearch console, upload the CSV file via the “Upload documents” option. Once uploaded, verify the document count to ensure successful indexing.

Navigate to Actions > Upload documents.

Choose your CSV file and click Next.

Review the detected fields and click Upload documents.

Once the process completes, you can confirm that 998 records (excluding headers) are successfully indexed.

Running Search Queries

Searching for Text

To search for movies with the keyword house in the title and overview fields:

curl --location -g --request GET 'https://search-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com/2013-01-01/search?q=house&q.options={fields:["title","overview"]}&return=title,overview' | jq .

For more information, refer to the official documentation.

The response will include matching results, such as:

{
  "status": {
    "rid": "8fDJv8swsgEK1DyD",
    "time-ms": 1
  },
  "hits": {
    "found": 26,
    "start": 0,
    "hit": [
      {
        "id": "local_file_466",
        "fields": {
          "overview": "Hip Hop duo Kid & Play return...",
          "title": "House Party 3"
        }
      },
      ...
    ]
  }
}

Searching for Numbers

For numeric searches, such as movies with a vote_average of 5.0:

curl --location --request GET 'https://search-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com/2013-01-01/search?q.parser=structured&q=vote_average:5.0&return=title,overview' | jq .

For more information, refer to the official documentation.

The response will include matching results, such as:

{
  "status": {
    "rid": "w+Xgv8swwQEK1DyD",
    "time-ms": 0
  },
  "hits": {
    "found": 35,
    "start": 0,
    "hit": [
      {
        "id": "local_file_144",
        "fields": {
          "overview": "Far from home...",
          "title": "The Amazing Panda Adventure"
        }
      },
      ...
    ]
  }
}

Searching for Ranges

To find movies with a vote_average greater than 7.0:

curl --location -g --request GET 'https://search-searching-movies-data-xxxxxxxxxx.ap-northeast-1.cloudsearch.amazonaws.com/2013-01-01/search?q.parser=structured&q=vote_average:[7.0,}&return=title,overview' | jq .

For more information, refer to the official documentation.

The response will include matching results, such as:

{
  "status": {
    "rid": "vJ/vv8swyAEK1DyD",
    "time-ms": 2
  },
  "hits": {
    "found": 254,
    "start": 0,
    "hit": [
      {
        "id": "local_file_1",
        "fields": {
          "overview": "Led by Woody, Andy's toys...",
          "title": "Toy Story"
        }
      },
      ...
    ]
  }
}

Cleaning Up

Once you are done, delete the CloudSearch domain to avoid unnecessary charges:

aws cloudsearch delete-domain --domain-name searching-movies-data

Conclusion

AWS CloudSearch is a powerful service for implementing simple full-text search functionality. By following this guide, you can easily set up and manage a searchable domain for datasets like a movies catalog. For additional details, consult the AWS CloudSearch Developer Guide.

Happy Coding! 🚀