Spark on Kubernetes

Youtube video

hi everyone thank you for joining us today it’s our pleasure to present you the elastic story of running spark on kubernetes natively at massive skill for apple my name is bowen lee i lead the batch processing and interactive analytics areas of the apple am mmo data platform my team builds and operates cloud native services like batch processing powered by spark on kubernetes interactive data science service with interactive spark and the jupiter and interactive analytics service powered by presto and neutrino so we serve hundreds of data engineers and scientists every day to improve our ei and mml products like siri and apple search with best in class data and analytics and processing infrastructure blue chao is an engineering from engineer from my team who has been focusing on how to run spark elastically and cost efficiently on kubernetes so here’s the agenda today

we will first talk about the benefits of cloud and our design principles to leverage those cloud native characteristics and then the architecture of our cloud native spark on kubernetes platform and why we need to like auto skill our spark service based on cost saving and elasticity need now which i will dive deep into design of the reactive auto scaling and the productionization of it and our learnings and the future work

sounds good let’s get in there so um you know why we are moving to cloud it’s this may not be a new topic but i want to iterate our unique perspectives um so cloud and kubernetes can help solve lots of the problems of legacy infrastructure have for example it is igel resources are consumed on demand and the user can pay as you go second is elastic and scalable we can acquire resources we need and return them when we are done so that saves us lots of money and the compute and the storage are almost infinite skill then kubernetes enables us to build services in a container native way with strong resource isolation so users workload only impact each other

this supports our like multi-tenancy and isolation guarantees with cloud and kubernetes we can leverage cloud native cutting edge security techniques to build a privacy first the data infra and lastly um you know kubernetes and the providers of the cloud took away lots of those heavy liftings from us which enabled our developers to focus on building and improve business critical batch processing service to achieve you know higher roi so it’s a no-brainer a couple years ago for us to decide to all in cloud and kubernetes

with the benefits of cloud and kubernetes in mind we set a few critical design principles for ourselves when designing the system first we want to fully embrace public cloud and cognitive way of thinking and building infrastructure that is quite a message a mindset change for example uh you know in the cloud native world when we want to upgrade our infrastructure we don’t have to do any in-place upgrade which expose a huge risk to our infrastructure and our users right in the new world we can just spin up a completely new environment and gradually roll traffic over from our old environment to the new environment and instantaneously switch back if there’s an issue so that’s kind of flexibility is a huge win for our devops second everything should be containerized for elasticity agility and reproducibility we aim to scale and replicate our infra very fast to cater to business needs and full containerization enables us to do so third compute and storage have to be fully decoupled so they can scale independently according to business needs for example the shuffle data science can vary significantly from spark job to job and we have to build our own spark service be able to handle that in a flexible way rather than one-size-fits-all solution the security and the privacy and user experience i’m talking about them together since they are related um you know in our new infrastructure security and privacy are first-class citizens in the design stack and we leverage fine-tuned uh you know techniques like rows policies to govern our data services and at the same time we still want to make it super easy for users to run smart jobs by following those governance so instead of having users to run spark submit directly we expose a rest api that has exactly the same parameters as spark submit so we can enforce security at the rest api layer while still giving users a very familiar development experience

last we have apple internal distribution that we decided to use

next i want to present our cloud native elastic spark architecture so you know we can start from the data plan where in the back end we have multiple spare kubernetes cluster we use spark kubernetes operator to submit spark jobs and manage these job life cycles there are a replica set of spark operators so they can load balance and achieve high availability each tenant on this platform have their own resource cues powered by apache unicorn unicorn plays a few key roles here like it’s for example one is it is a multi-tenancy support and resource quotas each queue for each tenant is fully isolated from one another second unicorn queue runs all the resource scheduling for spark workload you know from basic ones like gas scheduling requirements to more advanced scheduling policies like fifo priority or preemption lastly unicorn handles elasticity of the queues by independently scaling resources for each tenant we have many of the expired clusters in the back end the multi-cluster and the multi-cue strategy provide us with many folds of elasticity and linear scalability without a single bottleneck

so in the control plane we built our own spark service gateway which exposed the rest api i mentioned before it is itself is a container native and can be deployed and skilled very easily as a microservice on kubernetes when submitting uh jobs through our rest api users can specify additional parameters like queue name and the skate will route the job to the underlying queue on the client side we provide rest api a simple easy to use cli for users to run jobs from terminal and a corresponding airflow operator so users can run scheduled jobs

we have also the data science service where our data engineers quickly iterate and build their spark etl pipeline and data scientists build and train their models interactively we aim to share a unified backend for the two spark services so as you can see our interactive spark workload that comes from jupiter notebooks went through its own interactive spark gateway and workloads are running on the same infra structure on the backend this way we achieved the goal of reusing most of our infra without reinventing the wheel

lastly we closely collaborate with our security and privacy team and observability team to develop and uh integrate on those two fronts in a

fully integrated way so our spark service has been running in production for a year so far it currently supports many business critical workload for apple aml the development skill is massive we are running hundreds of thousands of vcpus and hundreds of terabytes memories with supports you know hundreds of thousands of smart jobs per week the job skill is also very large our users biggest jobs can consume up to you know thousands of executors and cpus at the same time as it runs for hours we have been very active contributors to the unique apache unicorn project and have grown commuters and pmcs organically from the team we are also planning to open source some of the components in the stack

you’re being super successful we initially have been operating all the resources statically for users for example our unicorn queues are of static amount of resources and we see a massive opportunity to make the stack more elastic and save cost for example workload patterns can vary from time to time in a week or even during a day right and um and they also vary quite a bit from use case to use case for example from running only scheduled jobs to mostly ad hoc and interactive jobs or mixed of both or occasionally super large-scale backfield jobs when using a fixed amount of resources it has to account for the max usage and will cost waste so we have been investing heavily into auto scaling spark on kubernetes and have achieved a great result so far by cutting down cost for our users by as much as you know 70 to 80 percent of q basis next i’ll hand it over to huichol to talk about how we achieved that and our learnings and roadmap on that direction

hi folks this is richard from air data platform in apple now let me walk you through the architecture and the design of this reactive auto screen feature in our cloud native spark cluster we delivered recently first of all let me talk about the auto scaling cluster node groups layout as a multi-tenant auto scaling cluster we provide physical isolation among system components spark driver and the spark exchangers

and each of them are located in their own node groups here the system component including such as node problem detector ingress controller structure kubernetes operator unicorn and so on also by mapping different tenants q to their dedicated executor node groups we can oscillate different tenants from each other to minimize the potential impact and also help us to generate the cost usage reports per tenant very easily

we provide a mean capacity setting per q so there is amount of guaranteed machine that keeps running over there to support the long running and the smaller cadence in workload the maximum capacity setting can provide a guide reel for each queue and workloads will be weighted in a queue if they are exceeded the maximum threshold until there are related resources our scheduler find

this is the workflow however clutter size being changed based on the spark workloads per node group when users submit their job to our gateway the skills service will create the crd on the corresponding cluster firstly it will create the driver pause on driver node group to make sure the job can always be scheduled and then execute reports will be created by spark operator in the pre-assigned node group scheduled by unicorn we can also see once kubernetes clutter auto scaler find the pending port in whichever node group it will talk to cloud provider to scale out the suitable numbers of nodes in the specified node group here which is mapping to our unicode resources queue vice verse wants you to find that there are idle nodes it will terminate the node to save the cost

beside this we also provide some customized scanning control to our auto scaling clusters for skill in control our backing will only apply the skill in on executor node groups and the scaling process will be triggered only when no running executor ports on the node we have enabled beam packing provided by unicode to minimize the number of instance to use the default allocation policy of the scheduler will try to evenly distribute distribute support to all the nodes the beam packing policy can sort the list of nodes by the amount of available resources so our scheduler can efficiently allocate the parts to the underutilized nose firstly and zinc to the idle nodes so cluster autoscaler can trigger the skill in in a very efficient way the the right hand are ec2 machine utilization dashboard the top one is a matrix of a static queue without beam packing we can see most of cpu and memory utilization is only around the template page only a few of machines can approach to 45 days the bottom dashboard is shows a matrix after being able to be packing on auto scaling cluster we can see there is a pretty good usage rate on both the cpu and memory compared to the massive wasting before

regarding to the skill out control we provide a skill out only feature to the bucket driver node group which all our users always get their driver paws launched so they also can check their logs over there always we also speed up the skill out latency by tuning some spark configurations

now let me talk about our production status with this new feature till now we have embodied more than 19 internal teams to our auto skating clusters for more than three months so far and the average cost saving range is around from 20 percentage to 70 percentage

during my creation we have found that all skilling events works as expected and the machine will not be removed as long as there are running or active specular parts the skill out latency is consistent which is keep lower than five minutes here the maximum skill out range we are talking about is from 2 to 200 machines moreover all the scaling feature can work with various type of resources usage pattern such as ad hoc etl and mixed patterns meantime we also found that compared to the massive over provisioning approach before runtime of workloads with auto scaling enables may increase however this is expected which is due to the very good usage rate of cpu and memory compared to the maximum wasting before given this user need to take this into consideration and optimize their jobs if there is a strict data delivery time required

i know we have covered the laws in this short time here are some key takeaways dealing with develop and deliver this new feature on our platform physical isolation at the ming max capacity is very important for customer requirements we can leverage node group mean and max settings and unicode resources quarter segment together to achieve this it will help us to support budget-based control going forward how to provides guarantees that no impact to existing smart jobs when skilling happens is the most important feature for production jobs we need to apply some customized skill in control based on different node group types to provide this guarantee in time we also need to enable pin packing to improve its efficiency the skill out latency is important to large scale jobs by using the dedicated driver node group and the tuned spark configurations we can keep the scale out latency as low as

possible going forward there are still lots of improved areas needed to be explored such as how to support mixed insulin type per cluster and how to fully support dynamic allocation support instance is much cheaper than on demand instance which we are using right now it will be another big win if we can support it with the help of remote travel services or similar these aggregated computes and storage architecture then we can trigger the skill in more aggressive and even separate the computation and the storage independently with different off scaling control

in future how to provide a predictive auto scaling feature to the platform is another interesting topic that’s all today’s sharing thanks for your time

spark on kubernetes

spark on kubernetes tutorial

spark on kubernetes example

spark on kubernetes operator

spark on kubernetes vs yarn

spark on kubernetes configuration

spark on kubernetes setup

spark on kubernetes vs emr

spark on kubernetes vs hadoop

spark on kubernetes vs databricks

spark on kubernetes vs yarn performance

spark on kubernetes history server

spark on azure kubernetes

apache spark on kubernetes

spark on kubernetes basics

spark on kubernetes bootcamp

spark kubernetes client mode

spark in kubernetes cluster

spark cluster on kubernetes

deploy spark on kubernetes

spark on kubernetes for dummies

spark on kubernetes for windows

spark on kubernetes for mac

spark on kubernetes framework

spark on kubernetes free

spark on kubernetes for beginners

install spark on kubernetes

spark on kubernetes github

spark on kubernetes gateway

spark on kubernetes grid

spark history server on kubernetes

spark in kubernetes

spark on kubernetes job

spark on kubernetes job scheduler

spark on kubernetes journey

spark on kubernetes jenkins

spark on kubernetes java

run spark on kubernetes

spark on kubernetes kubernetes

spark on kubernetes keyboard

kube proxy in kubernetes

spark on kubernetes load balancer

spark on kubernetes logs

spark on kubernetes linux

spark on kubernetes live

spark on kubernetes node

spark on kubernetes network

spark on kubernetes pyspark

pyspark on kubernetes

spark on kubernetes quest

spark on kubernetes qgis

spark on kubernetes qnap

spark on kubernetes queue

running spark on kubernetes

spark submit kubernetes

spark with kubernetes

spark on kubernetes xfi

spark on kubernetes yaml

spark on kubernetes yaml reference

spark on kubernetes yaml file

spark on kubernetes you need

spark on kubernetes zone

spark on kubernetes z wave

spark on kubernetes 0x

spark on kubernetes 101

spark on kubernetes 12

spark on kubernetes 2022

spark on kubernetes 2021

spark on kubernetes 3.0

spark on kubernetes 4k

spark on kubernetes 4.0

spark on kubernetes 4gb

spark on kubernetes 64 bit

spark on kubernetes 6900 xt

spark on kubernetes 600

spark on kubernetes 7.3

spark on kubernetes 777

spark on kubernetes 800

spark on kubernetes 900