Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Okay, so let me go into more detail:
- Step function:
- - takes start and end dates as an input
- - that input gets passed to a first Lambda that creates a list of dates for one month (e.g. 1.1.2024, 2.1.2024.,..., 31.1.2024.), which then provides that as an input to a Map State
- - Map state uses the input dates (one by one in our case) to pick day/month/year from it and passes it to a preconfigured query (python's fstring can be used, so SELECT * FROM table WHERE day = <day> etc.). NOTE: thinking back now, I've used CTAS. It enabled me to specify compression (GZIP) and S3 location for the data (so it doesn't go to the default S3 bucket)
- - query runs in Athena and writes the output to an S3 bucket. The results are, in our case, an aggregate of one day worth of data
- - when all Map states complete, another Lambda runs that picks the results from the S3 bucket (month worth of daily data) and GZIPs them to one archive
- - we apply lifecycle rules to the resulting files (12 of them per year, after the third Lambda has executed) and they get moved to deep archive after one day
- First Lambda:
- - if we pass an input to SF as:
- {
- "start_date": "01-01-2020",
- "end_date": "28-08-2024"
- }
- We then parse the input to create a first set of dates that will be passed to a Map state. Input is something like ["01-01-2020", "02-01-2020",..., "31-01-2020"], which can then be parsed in the second Lambda.
- Second Lambda:
- - the Lambda runs inside the Map state. It receives one date as an input, which means you'd have n Lambda executions, where n is equal to the number of dates in the input range
- - the dates get passed to the query builder that runs in Athena
- Third Lambda (outside Map state):
- - aggregates month worth of data to a GZIP file
- - checks if all dates from the SF input have been processed. If not, go back to the first Lambda
- So, there may be other quirks to this, but bear in mind I've written this more than 5 years ago as a junior. This might be made in a better way, but it's a general approach I've used to lower the cost.
- To answer your assumptions:
- > I am assuming, if I will create hundreds of queries to scan data on few hundred GBs per query then also my cost would be on per TB basis which will be 39TB. In my case 200$.
- Yes, that's correct. If your data is partitioned, you can use CTAS queries to create the table for those partitions only, so it limits the data it scans when creating the resulting file. CTAS query will create a table, but also the resulting file with your data. Just remember to drop the tables when you're done.
- > The result file(I am assuming can be create in csv or any format which lambda can read)
- You can specify the file type in the CTAS query. Very useful for this usecase.
- > S3 will trigger a lambda function which will zip the objects present in the result file
- That's also a way you can do it. We've decided to just pick up the data from S3 and create an archive since we knew the file names and their location.
- > But there will be lot of objects and lambda can't handle archiving all the objects present in result file. If data is chunked for multiple lambdas then how will I monitor which objects archived and which are remaining from multiple lambdas.
- You partition it in such a way so it doesn't get too big for a Lambda. You can also use a control DDB table which can drive your processing, which requires some designing.
- > Once archived data is present on new bucket then you apply lifecycle policy on that and move the data which will reduce the API calls to Glacier significantly and reduce the cost.
- That's correct. The last thing to do is to apply the lifecycle policy to the relatively small number of resulting files.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement