Ram Krishnan

Mar 26, 2021

4 min read

Building a scalable and cost-efficient Data Processing Engine

A majority of businesses in the current world rely on Data to drive insights and it plays a vital role in their decision-making process. In order to achieve the same, organizations pour money into their Infrastructure to be able to run Data jobs. Some have the expertise to build their platform in-house, whereas others look towards enterprise data platforms such as Databricks, Cloudera, Bluedata. SageMaker etc.

Data Engineering and Machine Learning pipelines are complex and are highly iterative in nature which makes it all the more important to design tech stacks efficiently with one eye on the prize and the other on price.

The Data Science Tech Stack

The Data Platform stack discussed has multiple tools from the open-source world that build up the Data storage, processing and management components. The entire platform is powered and deployed on Amazon EC2 Instances, which is the only paid service, infrastructure-wise, that the platform uses. Spot instances are used widely to optimize costs.

Designing the distributed processing layer

Challenges of running 24x7 Big Data clusters

In our experience, running 24x7 clusters, be it Spark or any other distributed computing layer, to support our data jobs proved challenging for the following reasons

All of the above most importantly translated to the fact that we were spending massively on an under-utilized infrastructure — a problem that a majority of organizations face in the real world

Moving to an On-demand, Dynamic framework

Building the Data Science Platform in-house gave us the edge when it came to customizations and leveraging multiple features from multiple tools to cater to our requirements

The is how we leveraged multiple features to build our On-demand and Dynamic Data Processing Platform

1. Workflow orchestration through Airflow

2. KubernetesPodOperator from Airflow

3. AWS Auto Scaling Group

4. Execution of workload

5. Decommissioning the machines

Logging and Monitoring

It’s vital that there is a layer that logs usage for this framework

It helps in:

Cloud-agnostic

The framework has been designed in such a way that it doesn’t depend on a particular cloud provider. The following are some of the requirements of the framework from a cloud service:

In Conclusion

Stay tuned for our next post on how to set up Data Science developer environments using the same framework