Querying a Data Lake on S3 with Athena & Glue on LocalStack

Key	Value
Environment	LocalStack, AWS
Services	S3, Athena, Glue, CloudFormation
Integrations	AWS CLI, CloudFormation
Categories	Big Data, Analytics, Data Lake
Level	Intermediate
Use Case	Resource Browsers, Big Data Testing
GitHub	Repository link

Introduction

This sample demonstrates how to build a comprehensive data analytics pipeline using Amazon Athena, S3, and Glue Catalog to query large datasets stored in a data lake. Starting with raw COVID-19 datasets from the Registry of Open Data on AWS, you'll deploy a complete analytics infrastructure that enables running standard SQL queries against structured data in S3 buckets. To test this application sample, we will demonstrate how you use LocalStack to deploy the infrastructure on your developer machine and validate big data workflows locally. The demo showcases LocalStack's Resource Browser capabilities for exploring Athena databases and running interactive SQL queries without the cost and complexity of AWS infrastructure.

Note

Initial service startup may take several minutes for dependency installation
Query performance is optimized for development testing, not production-scale analytics
Dataset size is limited to sample COVID-19 data for demonstration purposes

Architecture

The following diagram shows the architecture that this sample application builds and deploys:

S3 Buckets for storing COVID-19 datasets and Athena query results
Glue Data Catalog for metadata management and schema definitions
Athena serverless query service for interactive SQL analytics
CloudFormation for Infrastructure as Code deployment
Multiple data sources: hospital beds, vaccine distribution, and aggregated case data

Prerequisites

localstack CLI with a LOCALSTACK_AUTH_TOKEN
AWS CLI with the awslocal wrapper
make (optional, but recommended for running the sample application)

Note

This sample uses Athena & Glue Data Catalog which requires various dependencies to be lazily downloaded and installed at runtime, which increases the processing time on the first load. To mitigate this, you can pull the Big Data Mono container image with the default dependencies pre-installed.

docker pull localstack/localstack-pro:latest-bigdata

Start the container with IMAGE_NAME=localstack/localstack-pro:latest-bigdata configuration variable to use the pre-installed dependencies.

Installation

To run the sample application, you need to install the required dependencies.

First, clone the repository:

git clone https://github.com/localstack/sample-athena-glue-s3-data-lake-query.git

Then, navigate to the project directory:

cd sample-athena-glue-s3-data-lake-query

No additional installation steps are required as the sample uses CloudFormation templates and AWS CLI commands.

Deployment

Start LocalStack Pro with the LOCALSTACK_AUTH_TOKEN pre-configured:

localstack auth set-token <LOCALSTACK_AUTH_TOKEN>
IMAGE_NAME=localstack/localstack-pro:latest-bigdata localstack start

To deploy the sample application infrastructure, run the following command:

make deploy

Alternatively, you can deploy manually step-by-step.

Create S3 bucket and upload data

awslocal s3 mb s3://covid19-lake
awslocal s3 cp cloudformation-templates/CovidLakeStack.template.json s3://covid19-lake/cfn/CovidLakeStack.template.json
awslocal s3 sync ./covid19-lake-data/ s3://covid19-lake/

Deploy CloudFormation stack

awslocal cloudformation create-stack --stack-name covid-lake-stack --template-url https://covid19-lake.s3.us-east-2.amazonaws.com/cfn/CovidLakeStack.template.json

Verify deployment

awslocal cloudformation describe-stacks --stack-name covid-lake-stack | grep StackStatus

Wait for CREATE_COMPLETE status before proceeding.

Testing

After deployment, you can test the analytics pipeline using the LocalStack Web Application's Athena SQL viewer at https://app.localstack.cloud/inst/default/resources/athena/sql.

Query Examples

Run queries against the covid-19 database in the Glue Data Catalog:

Hospital beds data

SELECT * FROM covid_19.hospital_beds LIMIT 10

Aggregated COVID data by states

SELECT * FROM covid_19.enigma_aggregation_us_states

Moderna vaccine distribution

SELECT * FROM covid_19.cdc_moderna_vaccine_distribution

Integration tests

You can also run automated integration tests:

make test

Use Cases

Resource Browsers

In this sample, LocalStack's Resource Browser provides a web-based interface for interacting with Athena and Glue services without requiring additional tooling or AWS console access.

The Resource Browser allows you to:

Browse Glue Data Catalog databases and tables through the left navigation panel
Execute SQL queries directly in the browser with syntax highlighting and result formatting
View query execution history and rerun previous queries for iterative development

This approach eliminates the need to install and configure local SQL clients or connect to remote AWS services during development.

Big Data Testing

This sample includes patterns for testing big data workflows locally before deploying to production environments. LocalStack enables comprehensive validation of big data components without cloud infrastructure costs.

Key testing scenarios include:

Schema and Metadata Testing:
- Validate Glue Data Catalog table definitions and column mappings
- Test partitioning strategies and data formats (JSON, CSV, Parquet)
- Verify CloudFormation template creates correct database and table structures
Query Testing:
- Execute representative SQL queries against sample datasets
- Validate query execution plans and optimization strategies
- Test different table join patterns and aggregation logic
Integration Testing:
- End-to-end validation from S3 data ingestion through Athena query execution
- Verify S3 bucket policies and access patterns work correctly

LocalStack's isolated environment ensures tests don't interfere with production data while providing realistic AWS service behavior for comprehensive validation.

Troubleshooting

Issue	Resolution
Big Data services taking long to start	Use the pre-built `localstack-pro:latest-bigdata` Docker image to avoid dependency installation
CloudFormation stack creation fails	Verify S3 bucket exists and template is uploaded before creating stack
Athena queries return no results	Check Glue Data Catalog tables are created and S3 data is properly uploaded
Resource Browser not loading	Ensure LocalStack is running and the stack has been created successfully
Query execution timeouts	Reduce query complexity for development testing and review the LocalStack logs for any errors

Summary

This sample application demonstrates how to build, deploy, and test a complete big data analytics pipeline using AWS services and LocalStack. It showcases the following patterns:

Deploying scalable data lake architectures using S3, Athena, and Glue Data Catalog with CloudFormation
Running interactive SQL analytics against large datasets stored in S3 buckets
Using LocalStack's Resource Browser for intuitive data exploration and query development
Implementing comprehensive testing strategies for big data workflows in local environments
Leveraging AWS parity to ensure consistent behavior between development and production environments
Managing metadata and schema evolution through Glue Data Catalog integration

The application provides a foundation for understanding enterprise data analytics patterns and building cost-effective development workflows for AWS big data services.

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
.github/workflows		.github/workflows
cloudformation-templates		cloudformation-templates
covid19-lake-data		covid19-lake-data
images		images
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Querying a Data Lake on S3 with Athena & Glue on LocalStack

Introduction

Architecture

Prerequisites

Installation

Deployment

Create S3 bucket and upload data

Deploy CloudFormation stack

Verify deployment

Testing

Query Examples

Hospital beds data

Aggregated COVID data by states

Moderna vaccine distribution

Integration tests

Use Cases

Resource Browsers

Big Data Testing

Troubleshooting

Summary

Learn More

About

Uh oh!

Languages

License

localstack-samples/sample-athena-glue-s3-data-lake-query

Folders and files

Latest commit

History

Repository files navigation

Querying a Data Lake on S3 with Athena & Glue on LocalStack

Introduction

Architecture

Prerequisites

Installation

Deployment

Create S3 bucket and upload data

Deploy CloudFormation stack

Verify deployment

Testing

Query Examples

Hospital beds data

Aggregated COVID data by states

Moderna vaccine distribution

Integration tests

Use Cases

Resource Browsers

Big Data Testing

Troubleshooting

Summary

Learn More

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Languages