Building a scalable and cost-efficient Data Processing Engine

The Data Science Tech Stack

  • Containerized Deployments — All components are deployed on K8s clusters using Docker images and uses Helm for container management
  • Data Processing — The ETL jobs of the Data Pipeline are powered by Apache Spark. Dask is a light version of python with the ability to run distributed jobs on native python and has come handy in some of our projects.
  • Data Storage — HDFS is a reliable distributed file system which serves as the Data lake for the platform
  • Workflow Orchestrator — Apache Airflow is one of the best workflow managers available and brings in a wide range of features. It has proved to be one of the most integral parts of our platform
  • Machine Learning — provides a vast set of Machine learning libraries for Spark that are fast, distributed and scalable. Spark also provides some basic ML libraries in Spark MLib

Designing the distributed processing layer

Challenges of running 24x7 Big Data clusters

  • The data pipeline processes data in batches and hence the clusters were occupied for only around 8 hours a day — making the servers heavily underutilized
  • Scheduling is a nightmare where multiple jobs either get stuck or were queued for long periods of time. There were SLAs for each job and failing to meet them resulted in derailing the entire data pipeline
  • Clusters don’t account for scale — whenever heavy or multiple new jobs were introduced to the pipeline, the cluster would have to be scaled with an extra resource. The same was the case when new technologies would have to be introduced to the platform which limited the scope for experimentation and innovation

Moving to an On-demand, Dynamic framework

1. Workflow orchestration through Airflow

  • The platform gets requests from multiple sources to run big data jobs
  • This is where Airflow plays a crucial role in scheduling and maintaining a proper queue for the jobs that have been triggered
  • Airflow provides the capability to communicate with a K8s cluster and spawn Pods through the KubernetesPodOperator
  • Configurations like docker image details, tolerations, affinities, commands and arguments are provided, using which Airflow requests the K8s cluster to create the pod
  • K8s provides a feature to link your AWS Scaling group and your K8s cluster
  • When the K8s cluster sees that there is a request to spawn pods with a particular configuration, it raises a request to scale the number of instances
  • Once procured, the machine gets added as a worker to the K8s cluster
  • The pods that were in the queue and waiting for a machine, now begin execution
  • Providing the right affinities and tolerations make sure that there are no other pods that get assigned to these newly procured machines
  • For Spark jobs, a spark cluster gets created internally on the machine and code is executed within the same machine
  • This framework also supports running jobs using the spark k8s scheduler — k8s scheduler, Google-spark-k8s-operator
  • Once the pods have completed the execution of tasks, the K8s cluster contacts the AWS Scaling group to scale down machines
  • This is a crucial step as you don’t want orphan pods lingering even after execution has been completed
  • If not taken care of, this might lead to unnecessary costs incurred for machines for which execution is complete

Logging and Monitoring

  • tracking the commissioning and decommision of machines
  • monitoring usage of resources across jobs and pipelines
  • pinpoint time-consuming jobs and opens up an opportunity for optimization
  • analysing spends over time


  • Auto Scaling of machines
  • K8s Cloud providers — Available cloud providers that K8s can integrate with
  • An auto scaler service that communicates with the provider from K8s

In Conclusion

  • The framework helped us in scaling our platform and handling multiple workloads at the same time
  • It opened up possibilities to experiment and run different technologies as each instance is docker driven
  • Reduced our infrastructure costs by ~60%
  • As an ML-AI platform, GPU machines help accelerate workloads and analyses. This framework opened up the opportunity to experiment and run workloads on GPU machines which were costly to run otherwise




Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Bots, bugs, and betrayal…

Opening an Issue(in GitHub)

Connect Qt to MariaDB

Top 5 Flutter App Deployment Tools

Flutter app Devlopment

Why Cypress Sucks for Real Test Automation? (Part 2: Limitations)

.Net Core Actions, Routes, Handlers, and Methods

Builder Design Pattern

Builder Design Pattern By Majid Ahmaditabar

Did you push software updates into production yesterday?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ram Krishnan

Ram Krishnan

More from Medium

Keeping Data Under (Your) Control with Apache Airflow

Data Flow with Apache Nifi in Google Cloud Platform

The D’s of a data pipelines

AWS Glue Streaming checkpoint + Amazon MSK ( Apache Spark, Kafka )