Blogs By this
to this blog
While products like S3 in AWS are very well established as great repositories for Data Lakes, the practices and governance applied to these data stores by developers and data managers is typically poor.
Storing data in a lake versus a traditional database is a pretty easy transition to make, but what is often forgotten is that similar to a traditional database, there are good (or Best) practices, and some very bad practices that can quickly come back to haunt you.
Similar to the cringe you may have experienced when working on a relational database that is poorly structured, has horrible field naming conventions, missing primary keys, no indexing, etc. A Data Lake setup poorly will over time turn murky and become swamp-like. Problems really start appearing when fresh eyes start making sense of the data to attempt to do some reporting over what may or may not be there.
Devoid of structure it is common to start your BI project first wading through a swath of issues that ideally should not ever have been a problem:
- Where’s the documentation?
- Why are there validation errors in this data?
- Why does the schema change over time?
- What the hell is this file format?
- Who created this data?
- Where did this data even come from?
AAAAARRRRGH! Just me?
While the tools are all there in AWS to make a Data Lake environment easy to manage, until recently there have not been the patterns that assist in managing the environment to provide consistency in how an organisation ingests, validates, curates, catalogues and monitors data sent into a Data Lake.
Enter the AWS Accelerated Data Lake
Introduced 7-months ago is the ‘AWS Accelerated Data Lake (3x3x3)’ a packaged Data Lake solution that provides the much-needed glue to keep your S3 Data Lakes easily manageable over time.
The solution packages a range of AWS services together (S3, DynamoDB, Elastic Search, Lambda, Step Functions, SNS) and enforces standards in what format the data is allowed to be in, how it will get stored, cleaned and catalogued (tagged), etc.
Perhaps its biggest selling point is that it is serverless and provides near real-time ingestion, cleansing and availability of curated data for reporting. This on-the-fly serverless ETL (Extract, Transform, Load) equivalent Data Lake solution means end users can be given access to curated content in minutes.
AWS have provided the full code for the solution on Github for you take and make your own. See: (https://github.com/aws-samples/accelerated-data-lake).
Improving the Pattern
Out of the box the pattern is very good, but it can be extended for even better results.
We recently completed a project where we needed to back enter millions of records and guarantee all records were processed end-to-end.
One of the unique challenges of a serverless environment is the time constraints of a Lambda script (max 10 minutes) and the number of concurrent scripts that can be run (by default AWS allow 1,000 current executions). While lambda concurrency can be increased through a support ticket, asking AWS for 1,000,000 concurrent lambda executions (if possible) to deal with your initial import will probably cost you big $$$.
The solution is to improve the pattern with an SQS queue to kick off the Accelerated Data Lake for any records thrown in the drop folder (instead of the default S3 trigger). Using the queue, you can process records in batches at a time (say 100 records) and throttle the concurrency of the lambda script that runs the opening 3x3x3 step function.
In our projects we also separate the data lake buckets for better delineation of the key steps, sending data though the following ordered stages:
Our improvements picks up any data dropped into the ‘drop bucket’ via the SQS queue trigger in batches, creates a complete unaltered copy of the record to the Raw Bucket, enriched the data in the Staging bucket, and finally provides end-user consumable versions of the data in the curated buckets.
Costs of Running
When in dealing with records at scale, cost management of the Accelerated Data Lake solution is also very important. Of all the components in the Accelerated Data Lake – DynamoDB, Elastic Search and Step Functions are the services to be most aware of cost-wise.
Step Functions is priced at the number of transitions the step functions performs per record. Keeping transitions to a minimum from expected Start and End case will save you lots of money per month.
Since the pattern was created, DynamoDB now supports ‘On Demand’ billing. Converting to ‘On Demand’ will save you money when you are not ingesting data. It will also help the solution scale when ingesting a lot of data infrequently such as the initial import. Provisioned billing will only help if you have a relatively consistent workload – which may be the case for some 3x3x3 workloads.
The Accelerated Data Lake pattern is a really good way to provide near real-time validation, tagging and cataloguing of your data in AWS. If you have lots of data that will need back entering (million+ records) we recommend extending the pattern to utilise SQS to improve reliability, and our alterations to make it run a lot cheaper.
The beauty of this solution is it forces a level of standardisation of how you capture and provide taxonomy across your Data Lakes. You can monitor any records that fail validation and create great separation between initially entered data (Raw) and final enriched data (Curated).
With each and every record in S3 tagged you always know where it came from, who owns it, and when it should be used. This tagging can also be used to secure who has access to the data within the business using IAM policies on the recorded Tags.
If you are using AWS for hosting your S3 Data Lake, I suggest you check it out!
Paul Macey for a great Deep Dive on the solution here: https://www.youtube.com/watch?v=8q-00R22Pzw
If you are looking to embark on a Data Lake project and need help, ..or otherwise find yourself stuck in the murky swap like a bad scene out of the NeverEnding Story…
drop us a line at Comunet on +61 (8) 8100 1111.
We provide BI, DevOps, CI/CD consultancy and can help you deliver a high performing and maintainable Data Analytics Platform on AWS.