Issue
Current scenario: We are processing 40M records using some java code and uploading them as csv files to s3 bucket.
Future: We want to move this code to AWS, for this we want to have a python script to process the records and load them as csv files in the aws-s3 bucket. Can you suggest best way , which can trigger the script and process the data? We want to avoid using EC2 and hosting python script in the server. we want to make it as a serverless service.
My approach: I thought of doing this with AWS-Glue, using trigger (automatic/time based) to start the Job and will put my code in the script inside the job.
Is this a good approach?
Solution
You can use AWS Fargate for that, it can be integrated with Lambda considering maximum execution time of Lambda is 15 minutes, so you should use Fargate
more details https://serverless.com/blog/serverless-application-for-long-running-process-fargate-lambda/
Also you can use AWS Event Rule to schedule it
Look solution always exists. But there is a good practice and bad practice.
If I tell you, you can do that with just AWS lambda and AWS EventRule, and SQS how would that sound. Interesting?
So in short you can do that. Track time consumed in lambda, when it reaches to 14 mins, send a message to SQS indicating your processed row number, upload processed file to s3, and quit. Use that SQS to trigger lambda again with some message retention as 30 seconds to invoke same lambda again and start from that row number. Once all processing completed you would have multiple processed files in S3, use another lambda and SQS to consolidate them into one. This isn't a bad practice but less good in my opinion, Happy?
Answered By - Asfar Irshad