hi everyone thank you for joining us today it’s our pleasure to present you the elastic story of running spark on kubernetes natively at massive skill for apple my name is bowen lee i lead the batch processing and interactive analytics areas of the apple am mmo data platform my team builds and operates cloud native services like batch processing powered by spark on kubernetes interactive data science service with interactive spark and the jupiter and interactive analytics service powered by presto and neutrino so we serve hundreds of data engineers and scientists every day to improve our ei and mml products like siri and apple search with best in class data and analytics and processing infrastructure blue chao is an engineering from engineer from my team who has been focusing on how to run spark elastically and cost efficiently on kubernetes so here’s the agenda today
we will first talk about the benefits of cloud and our design principles to leverage those cloud native characteristics and then the architecture of our cloud native spark on kubernetes platform and why we need to like auto skill our spark service based on cost saving and elasticity need now which i will dive deep into design of the reactive auto scaling and the productionization of it and our learnings and the future work
sounds good let’s get in there so um you know why we are moving to cloud it’s this may not be a new topic but i want to iterate our unique perspectives um so cloud and kubernetes can help solve lots of the problems of legacy infrastructure have for example it is igel resources are consumed on demand and the user can pay as you go second is elastic and scalable we can acquire resources we need and return them when we are done so that saves us lots of money and the compute and the storage are almost infinite skill then kubernetes enables us to build services in a container native way with strong resource isolation so users workload only impact each other
this supports our like multi-tenancy and isolation guarantees with cloud and kubernetes we can leverage cloud native cutting edge security techniques to build a privacy first the data infra and lastly um you know kubernetes and the providers of the cloud took away lots of those heavy liftings from us which enabled our developers to focus on building and improve business critical batch processing service to achieve you know higher roi so it’s a no-brainer a couple years ago for us to decide to all in cloud and kubernetes
with the benefits of cloud and kubernetes in mind we set a few critical design principles for ourselves when designing the system first we want to fully embrace public cloud and cognitive way of thinking and building infrastructure that is quite a message a mindset change for example uh you know in the cloud native world when we want to upgrade our infrastructure we don’t have to do any in-place upgrade which expose a huge risk to our infrastructure and our users right in the new world we can just spin up a completely new environment and gradually roll traffic over from our old environment to the new environment and instantaneously switch back if there’s an issue so that’s kind of flexibility is a huge win for our devops second everything should be containerized for elasticity agility and reproducibility we aim to scale and replicate our infra very fast to cater to business needs and full containerization enables us to do so third compute and storage have to be fully decoupled so they can scale independently according to business needs for example the shuffle data science can vary significantly from spark job to job and we have to build our own spark service be able to handle that in a flexible way rather than one-size-fits-all solution the security and the privacy and user experience i’m talking about them together since they are related um you know in our new infrastructure security and privacy are first-class citizens in the design stack and we leverage fine-tuned uh you know techniques like rows policies to govern our data services and at the same time we still want to make it super easy for users to run smart jobs by following those governance so instead of having users to run spark submit directly we expose a rest api that has exactly the same parameters as spark submit so we can enforce security at the rest api layer while still giving users a very familiar development experience
last we have apple internal distribution that we decided to use
next i want to present our cloud native elastic spark architecture so you know we can start from the data plan where in the back end we have multiple spare kubernetes cluster we use spark kubernetes operator to submit spark jobs and manage these job life cycles there are a replica set of spark operators so they can load balance and achieve high availability each tenant on this platform have their own resource cues powered by apache unicorn unicorn plays a few key roles here like it’s for example one is it is a multi-tenancy support and resource quotas each queue for each tenant is fully isolated from one another second unicorn queue runs all the resource scheduling for spark workload you know from basic ones like gas scheduling requirements to more advanced scheduling policies like fifo priority or preemption lastly unicorn handles elasticity of the queues by independently scaling resources for each tenant we have many of the expired clusters in the back end the multi-cluster and the multi-cue strategy provide us with many folds of elasticity and linear scalability without a single bottleneck
so in the control plane we built our own spark service gateway which exposed the rest api i mentioned before it is itself is a container native and can be deployed and skilled very easily as a microservice on kubernetes when submitting uh jobs through our rest api users can specify additional parameters like queue name and the skate will route the job to the underlying queue on the client side we provide rest api a simple easy to use cli for users to run jobs from terminal and a corresponding airflow operator so users can run scheduled jobs
we have also the data science service where our data engineers quickly iterate and build their spark etl pipeline and data scientists build and train their models interactively we aim to share a unified backend for the two spark services so as you can see our interactive spark workload that comes from jupiter notebooks went through its own interactive spark gateway and workloads are running on the same infra structure on the backend this way we achieved the goal of reusing most of our infra without reinventing the wheel
lastly we closely collaborate with our security and privacy team and observability team to develop and uh integrate on those two fronts in a
fully integrated way so our spark service has been running in production for a year so far it currently supports many business critical workload for apple aml the development skill is massive we are running hundreds of thousands of vcpus and hundreds of terabytes memories with supports you know hundreds of thousands of smart jobs per week the job skill is also very large our users biggest jobs can consume up to you know thousands of executors and cpus at the same time as it runs for hours we have been very active contributors to the unique apache unicorn project and have grown commuters and pmcs organically from the team we are also planning to open source some of the components in the stack
you’re being super successful we initially have been operating all the resources statically for users for example our unicorn queues are of static amount of resources and we see a massive opportunity to make the stack more elastic and save cost for example workload patterns can vary from time to time in a week or even during a day right and um and they also vary quite a bit from use case to use case for example from running only scheduled jobs to mostly ad hoc and interactive jobs or mixed of both or occasionally super large-scale backfield jobs when using a fixed amount of resources it has to account for the max usage and will cost waste so we have been investing heavily into auto scaling spark on kubernetes and have achieved a great result so far by cutting down cost for our users by as much as you know 70 to 80 percent of q basis next i’ll hand it over to huichol to talk about how we achieved that and our learnings and roadmap on that direction
hi folks this is richard from air data platform in apple now let me walk you through the architecture and the design of this reactive auto screen feature in our cloud native spark cluster we delivered recently first of all let me talk about the auto scaling cluster node groups layout as a multi-tenant auto scaling cluster we provide physical isolation among system components spark driver and the spark exchangers
and each of them are located in their own node groups here the system component including such as node problem detector ingress controller structure kubernetes operator unicorn and so on also by mapping different tenants q to their dedicated executor node groups we can oscillate different tenants from each other to minimize the potential impact and also help us to generate the cost usage reports per tenant very easily
we provide a mean capacity setting per q so there is amount of guaranteed machine that keeps running over there to support the long running and the smaller cadence in workload the maximum capacity setting can provide a guide reel for each queue and workloads will be weighted in a queue if they are exceeded the maximum threshold until there are related resources our scheduler find
this is the workflow however clutter size being changed based on the spark workloads per node group when users submit their job to our gateway the skills service will create the crd on the corresponding cluster firstly it will create the driver pause on driver node group to make sure the job can always be scheduled and then execute reports will be created by spark operator in the pre-assigned node group scheduled by unicorn we can also see once kubernetes clutter auto scaler find the pending port in whichever node group it will talk to cloud provider to scale out the suitable numbers of nodes in the specified node group here which is mapping to our unicode resources queue vice verse wants you to find that there are idle nodes it will terminate the node to save the cost
beside this we also provide some customized scanning control to our auto scaling clusters for skill in control our backing will only apply the skill in on executor node groups and the scaling process will be triggered only when no running executor ports on the node we have enabled beam packing provided by unicode to minimize the number of instance to use the default allocation policy of the scheduler will try to evenly distribute distribute support to all the nodes the beam packing policy can sort the list of nodes by the amount of available resources so our scheduler can efficiently allocate the parts to the underutilized nose firstly and zinc to the idle nodes so cluster autoscaler can trigger the skill in in a very efficient way the the right hand are ec2 machine utilization dashboard the top one is a matrix of a static queue without beam packing we can see most of cpu and memory utilization is only around the template page only a few of machines can approach to 45 days the bottom dashboard is shows a matrix after being able to be packing on auto scaling cluster we can see there is a pretty good usage rate on both the cpu and memory compared to the massive wasting before
regarding to the skill out control we provide a skill out only feature to the bucket driver node group which all our users always get their driver paws launched so they also can check their logs over there always we also speed up the skill out latency by tuning some spark configurations
now let me talk about our production status with this new feature till now we have embodied more than 19 internal teams to our auto skating clusters for more than three months so far and the average cost saving range is around from 20 percentage to 70 percentage
during my creation we have found that all skilling events works as expected and the machine will not be removed as long as there are running or active specular parts the skill out latency is consistent which is keep lower than five minutes here the maximum skill out range we are talking about is from 2 to 200 machines moreover all the scaling feature can work with various type of resources usage pattern such as ad hoc etl and mixed patterns meantime we also found that compared to the massive over provisioning approach before runtime of workloads with auto scaling enables may increase however this is expected which is due to the very good usage rate of cpu and memory compared to the maximum wasting before given this user need to take this into consideration and optimize their jobs if there is a strict data delivery time required
i know we have covered the laws in this short time here are some key takeaways dealing with develop and deliver this new feature on our platform physical isolation at the ming max capacity is very important for customer requirements we can leverage node group mean and max settings and unicode resources quarter segment together to achieve this it will help us to support budget-based control going forward how to provides guarantees that no impact to existing smart jobs when skilling happens is the most important feature for production jobs we need to apply some customized skill in control based on different node group types to provide this guarantee in time we also need to enable pin packing to improve its efficiency the skill out latency is important to large scale jobs by using the dedicated driver node group and the tuned spark configurations we can keep the scale out latency as low as
possible going forward there are still lots of improved areas needed to be explored such as how to support mixed insulin type per cluster and how to fully support dynamic allocation support instance is much cheaper than on demand instance which we are using right now it will be another big win if we can support it with the help of remote travel services or similar these aggregated computes and storage architecture then we can trigger the skill in more aggressive and even separate the computation and the storage independently with different off scaling control
in future how to provide a predictive auto scaling feature to the platform is another interesting topic that’s all today’s sharing thanks for your time
spark on kubernetes
spark on kubernetes tutorial
spark on kubernetes example
spark on kubernetes operator
spark on kubernetes vs yarn
spark on kubernetes configuration
spark on kubernetes setup
spark on kubernetes vs emr
spark on kubernetes vs hadoop
spark on kubernetes vs databricks
spark on kubernetes vs yarn performance
spark on kubernetes history server
spark on azure kubernetes
apache spark on kubernetes
spark on kubernetes basics
spark on kubernetes bootcamp
spark kubernetes client mode
spark in kubernetes cluster
spark cluster on kubernetes
deploy spark on kubernetes
spark on kubernetes for dummies
spark on kubernetes for windows
spark on kubernetes for mac
spark on kubernetes framework
spark on kubernetes free
spark on kubernetes for beginners
install spark on kubernetes
spark on kubernetes github
spark on kubernetes gateway
spark on kubernetes grid
spark history server on kubernetes
spark in kubernetes
spark on kubernetes job
spark on kubernetes job scheduler
spark on kubernetes journey
spark on kubernetes jenkins
spark on kubernetes java
run spark on kubernetes
spark on kubernetes kubernetes
spark on kubernetes keyboard
kube proxy in kubernetes
spark on kubernetes load balancer
spark on kubernetes logs
spark on kubernetes linux
spark on kubernetes live
spark on kubernetes node
spark on kubernetes network
spark on kubernetes pyspark
pyspark on kubernetes
spark on kubernetes quest
spark on kubernetes qgis
spark on kubernetes qnap
spark on kubernetes queue
running spark on kubernetes
spark submit kubernetes
spark with kubernetes
spark on kubernetes xfi
spark on kubernetes yaml
spark on kubernetes yaml reference
spark on kubernetes yaml file
spark on kubernetes you need
spark on kubernetes zone
spark on kubernetes z wave
spark on kubernetes 0x
spark on kubernetes 101
spark on kubernetes 12
spark on kubernetes 2022
spark on kubernetes 2021
spark on kubernetes 3.0
spark on kubernetes 4k
spark on kubernetes 4.0
spark on kubernetes 4gb
spark on kubernetes 64 bit
spark on kubernetes 6900 xt
spark on kubernetes 600
spark on kubernetes 7.3
spark on kubernetes 777
spark on kubernetes 800
spark on kubernetes 900