The AWS Athena connector for DynamoDB enables you to query data stored in Amazon DynamoDB using Amazon Athena, which is typically used to query structured data in S3 using standard SQL. Since DynamoDB is a NoSQL database, querying it with SQL isn’t straightforward by default. This connector provides a bridge between the two, allowing you to leverage Athena’s SQL-based querying on data stored in DynamoDB.
CDK Construct
The DataLake I am building requires that all data be encrypted with KMS, and deployed through CDK. This construct deploys the SAM template and configures all the necessary permissions. The OOTB role was too permissive, so I injected my own.
1 | import { aws_sam as sam, Stack } from 'aws-cdk-lib'; |
Manual Steps
- You must manually register the connector with Athena
- To use KMS you must manually add this environment variable to the lambda:
spill_put_request_headers={"x-amz-server-side-encryption" : "aws:kms"}
Drawbacks
This worked perfectly for small, homogeneous and relatively flat tables, but you may run into problems if:
- You have nested schemas
- The connector could not infer nested object schemas well and would fail
- You have a lot of columns. I had a flattened table with ~1,000 columns, whenever I ran
SELECT *
the query failed. When I queried specific columns it succeeded.- Under the hood, the connector creates Projection Expressions. ~1000 columns created massive expressions that were rejected by DynamoDB.
Alternative for Large Tables
I still use the connector in a few places, but for the really wide table I went with another approach.
I am using DDB for operational processes during the day. The reporting data does not need to be realtime, and could be rebuilt nightly.
- A nightly Lambda triggers a DDB Table Export to S3 ($.50/GB)
- When the export drops, another Lambda
- Unmarshalls the JSON data and stores it in a S3 staging prefix
- Invokes an Athena Query to
INSERT * INTO optimized_table SELECT * FROM json_table
json_table
is a Glue table definition over the staged unmarshalled JSON dataoptimized_table
is a Glue table definition using Parquet, Snappy, and Partitioning- NOTE: The query also performs some deduping, which I omitted for brevity
- Reporting is done on
optimized_table
which is super compressed and fast.
Other Tips
- If your data is strictly transactional, you may be able to delete the DDB source data to reduce costs
- If you have a lot of data, the Table
Scan
+BatchDelete
operation may be too slow. In my case, the Lambda was timing out after 15 minutes - Solution: Delete and recreate the table using the SDK.
- If you have a lot of data, the Table
Troubleshooting
Athena Dynamo DB Connector returns 0 rows
Re-run your query with LIMIT 10
, if the query succeeds check your lambda logs. You may be getting a permission denied when writing to the spill bucket.
Solution: Manually add this environment variable to the lambda:
spill_put_request_headers
={"x-amz-server-side-encryption" : "aws:kms"}
Athena Dynamo DB Connector Error: Invalid ProjectionExpression
Selecting certain fields worksSELECT foo, bar FROM "default"."datalake" limit 10;
Selecting all fields fails. I am guessing this is because connector puts all the columns into the ProjectionExpression.
SELECT * FROM "default"."datalake" limit 10;
1 | GENERIC_USER_ERROR: Encountered an exception[software.amazon.awssdk.services.dynamodb.model.DynamoDbException] from your LambdaFunction[arn:aws:lambda:{region}:{account}:function:ddb-connector] executed in context[S3SpillLocation{bucket='ddb-connector', key='athena-spill/xxx', directory=true}] with message[Invalid ProjectionExpression: Expression size has exceeded the maximum allowed size; (Service: DynamoDb, Status Code: 400, Request ID: xxx)] |