Building a scalable and cost-efficient Data Processing Engine
A majority of businesses in the current world rely on Data to drive insights and it plays a vital role in their decision-making process. In order to achieve the same, organizations pour money into their Infrastructure to be able to run Data jobs. Some have the expertise to build their platform in-house, whereas others look towards enterprise data platforms such as Databricks, Cloudera, Bluedata. SageMaker etc.
Data Engineering and Machine Learning pipelines are complex and are highly iterative in nature which makes it all the more important to design tech stacks efficiently with one eye on the prize and the other on price.
The Data Science Tech Stack
The Data Platform stack discussed has multiple tools from the open-source world that build up the Data storage, processing and management components. The entire platform is powered and deployed on Amazon EC2 Instances, which is the only paid service, infrastructure-wise, that the platform uses. Spot instances are used widely to optimize costs.
- Containerized Deployments — All components are deployed on K8s clusters using Docker images and uses Helm for container management
- Data Processing — The ETL jobs of the Data Pipeline are powered by Apache Spark. Dask is a light version of python with the ability to run distributed jobs on native python and has come handy in some of our projects.
- Data Storage — HDFS is a reliable distributed file system which serves as the Data lake for the platform
- Workflow Orchestrator — Apache Airflow is one of the best workflow managers available and brings in a wide range of features. It has proved to be one of the most integral parts of our platform
- Machine Learning — H2O.ai-Pysparkling provides a vast set of Machine learning libraries for Spark that are fast, distributed and scalable. Spark also provides some basic ML libraries in Spark MLib
Designing the distributed processing layer
Challenges of running 24x7 Big Data clusters
In our experience, running 24x7 clusters, be it Spark or any other distributed computing layer, to support our data jobs proved challenging for the following reasons
- The data pipeline processes data in batches and hence the clusters were occupied for only around 8 hours a day — making the servers heavily underutilized
- Scheduling is a nightmare where multiple jobs either get stuck or were queued for long periods of time. There were SLAs for each job and failing to meet them resulted in derailing the entire data pipeline
- Clusters don’t account for scale — whenever heavy or multiple new jobs were introduced to the pipeline, the cluster would have to be scaled with an extra resource. The same was the case when new technologies would have to be introduced to the platform which limited the scope for experimentation and innovation
All of the above most importantly translated to the fact that we were spending massively on an under-utilized infrastructure — a problem that a majority of organizations face in the real world
Moving to an On-demand, Dynamic framework
Building the Data Science Platform in-house gave us the edge when it came to customizations and leveraging multiple features from multiple tools to cater to our requirements
The is how we leveraged multiple features to build our On-demand and Dynamic Data Processing Platform
1. Workflow orchestration through Airflow
- The platform gets requests from multiple sources to run big data jobs
- This is where Airflow plays a crucial role in scheduling and maintaining a proper queue for the jobs that have been triggered
2. KubernetesPodOperator from Airflow
- Airflow provides the capability to communicate with a K8s cluster and spawn Pods through the KubernetesPodOperator
- Configurations like docker image details, tolerations, affinities, commands and arguments are provided, using which Airflow requests the K8s cluster to create the pod
3. AWS Auto Scaling Group
- K8s provides a feature to link your AWS Scaling group and your K8s cluster
- When the K8s cluster sees that there is a request to spawn pods with a particular configuration, it raises a request to scale the number of instances
- Once procured, the machine gets added as a worker to the K8s cluster
4. Execution of workload
- The pods that were in the queue and waiting for a machine, now begin execution
- Providing the right affinities and tolerations make sure that there are no other pods that get assigned to these newly procured machines
- For Spark jobs, a spark cluster gets created internally on the machine and code is executed within the same machine
- This framework also supports running jobs using the spark k8s scheduler — k8s scheduler, Google-spark-k8s-operator
5. Decommissioning the machines
- Once the pods have completed the execution of tasks, the K8s cluster contacts the AWS Scaling group to scale down machines
- This is a crucial step as you don’t want orphan pods lingering even after execution has been completed
- If not taken care of, this might lead to unnecessary costs incurred for machines for which execution is complete
Logging and Monitoring
It’s vital that there is a layer that logs usage for this framework
It helps in:
- tracking the commissioning and decommision of machines
- monitoring usage of resources across jobs and pipelines
- pinpoint time-consuming jobs and opens up an opportunity for optimization
- analysing spends over time
Cloud-agnostic
The framework has been designed in such a way that it doesn’t depend on a particular cloud provider. The following are some of the requirements of the framework from a cloud service:
- Auto Scaling of machines
- K8s Cloud providers — Available cloud providers that K8s can integrate with
- An auto scaler service that communicates with the provider from K8s
In Conclusion
- The framework helped us in scaling our platform and handling multiple workloads at the same time
- It opened up possibilities to experiment and run different technologies as each instance is docker driven
- Reduced our infrastructure costs by ~60%
- As an ML-AI platform, GPU machines help accelerate workloads and analyses. This framework opened up the opportunity to experiment and run workloads on GPU machines which were costly to run otherwise
Stay tuned for our next post on how to set up Data Science developer environments using the same framework