Use Karpenter to hurry up Amazon EMR on EKS autoscaling

[ad_1]

Amazon EMR on Amazon EKS is a deployment choice for Amazon EMR that permits organizations to run Apache Spark on Amazon Elastic Kubernetes Service (Amazon EKS). With EMR on EKS, the Spark jobs run on the Amazon EMR runtime for Apache Spark. This will increase the efficiency of your Spark jobs in order that they run quicker and value lower than open supply Apache Spark. Additionally, you possibly can run Amazon EMR-based Apache Spark purposes with different varieties of purposes on the identical EKS cluster to enhance useful resource utilization and simplify infrastructure administration.

Karpenter was launched at AWS re:Invent 2021 to offer a dynamic, excessive efficiency, open-source cluster auto scaling resolution for Kubernetes. It robotically provisions new nodes in response to unschedulable pods. It observes the combination useful resource requests of unscheduled pods and makes choices to launch new nodes and terminate cease them to scale back scheduling latencies in addition to infrastructure prices.

To configure Karpenter, you create provisioners that outline how Karpenter manages the pods which might be pending and expires nodes. Though most use circumstances are addressed with a single provisioner, a number of provisioners are helpful in multi-tenant use circumstances comparable to isolating nodes for billing, utilizing completely different node constraints (comparable to no GPUs for a crew), or utilizing completely different deprovisioning settings. Karpenter launches nodes with minimal compute assets to suit un-schedulable pods for environment friendly binpacking. It really works in tandem with the Kubernetes scheduler to bind un-schedulable pods to the brand new nodes which might be provisioned. The next diagram illustrates the way it works.

This submit exhibits how you can combine Karpenter into your EMR on EKS structure to realize quicker and capacity-aware auto scaling capabilities to hurry up your massive knowledge and machine studying (ML) workloads whereas lowering prices. We run the identical workload utilizing each Cluster Autoscaler and Karpenter, to see among the enhancements we focus on within the subsequent part.

Enhancements in comparison with Cluster Autoscaler

Like Karpenter, Kubernetes Cluster Autoscaler (CAS) is designed so as to add nodes when requests are available in to run pods that may’t be met by present capability. Cluster Autoscaler is a part of the Kubernetes challenge, with implementations by main Kubernetes cloud suppliers. By taking a contemporary take a look at provisioning, Karpenter presents the next enhancements:

  • No node group administration overhead – As a result of you have got completely different useful resource necessities for various Spark workloads together with different workloads in your EKS cluster, you have to create separate node teams that may meet your necessities, like occasion sizes, Availability Zones, and buy choices. This could shortly develop to tens and a whole bunch of node teams, which provides extra administration overhead. Karpenter manages every occasion straight, with out using extra orchestration mechanisms like node teams, taking a group-less strategy by calling the EC2 Fleet API on to provision nodes. This permits Karpenter to make use of various occasion sorts, Availability Zones, and buy choices by merely making a single provisioner, as proven within the following determine.
  • Fast retries – If the Amazon Elastic Compute Cloud (Amazon EC2) capability isn’t accessible, Karpenter can retry in milliseconds as a substitute of minutes. That is generally is a actually helpful if you happen to’re utilizing EC2 Spot Cases and also you’re unable to get capability to particular occasion sorts.
  • Designed to deal with full flexibility of the cloud – Karpenter has the flexibility to effectively deal with the complete vary of occasion sorts accessible via AWS. Cluster Autoscaler wasn’t initially constructed with the flexibleness to deal with a whole bunch of occasion sorts, Availability Zones, and buy choices. We suggest being as versatile as you will be to allow Karpenter get the just-in-time capability you want.
  • Improves the general node utilization by binpacking – Karpenter batches pending pods after which binpacks them primarily based on CPU, reminiscence, and GPUs required, bearing in mind node overhead (for instance, daemon set assets required). After the pods are binpacked on essentially the most environment friendly occasion kind, Karpenter takes different occasion sorts which might be comparable or bigger than essentially the most environment friendly packing, and passes the occasion kind choices to an API referred to as EC2 Fleet, following among the greatest practices of occasion diversification to enhance the possibilities of getting the request capability.

Finest practices utilizing Karpenter with EMR on EKS

For basic greatest practices with Karpenter, consult with Karpenter Finest Practices. The next are extra issues to contemplate with EMR on EKS:

  • Keep away from inter-AZ knowledge switch value by both configuring the Karpenter provisioner to launch in a single Availability Zone or use node selector or affinity and anti-affinity to schedule the motive force and the executors of the identical job to a single Availability Zone. See the next code:
    nodeSelector:
      topology.kubernetes.io/zone: us-east-1a

  • Value optimize Spark workloads utilizing EC2 Spot Cases for executors and On-Demand Cases for the motive force through the use of the node selector with the label karpenter.sh/capacity-type within the pod templates. We suggest utilizing pod templates to specify driver pods to run on On-Demand Cases and executor pods to run on Spot Cases. This lets you consolidate provisioner specs since you don’t want two specs per job kind. It additionally follows one of the best apply of utilizing customization outlined on workload sorts and to maintain provisioner specs to assist a broader variety of use circumstances.
  • When utilizing EC2 Spot Cases, maximize the occasion diversification within the provisioner configuration to stick to the greatest practices. To pick out appropriate occasion sorts, you should use the ec2-instance-selector, a CLI device and go library that recommends occasion sorts primarily based on useful resource standards like vCPUs and reminiscence.

Resolution overview

This submit gives an instance of how you can arrange each Cluster Autoscaler and Karpenter in an EKS cluster and evaluate the auto scaling enhancements by operating a pattern EMR on EKS workload.

The next diagram illustrates the structure of this resolution.

We use the Transaction Processing Efficiency Council-Resolution Assist (TPC-DS), a call assist benchmark to sequentially run three Spark SQL queries (q70-v2.4, q82-v2.4, q64-v2.4) with a hard and fast variety of 50 executors, in opposition to 17.7 billion data, roughly 924 GB compressed knowledge in Parquet file format. For extra particulars on TPC-DS, consult with the eks-spark-benchmark GitHub repo.

We submit the identical job with completely different Spark driver and executor specs to imitate completely different jobs solely to watch the auto scaling habits and binpacking. We suggest you right-size your Spark executors primarily based on the workload traits for manufacturing workloads.

The next code is an instance Spark configuration that leads to pod spec requests of 4 vCPU and 15 GB:

--conf spark.executor.cases=50 --conf spark.driver.cores=4 --conf spark.driver.reminiscence=10g --conf spark.driver.memoryOverhead=5g --conf spark.executor.cores=4 --conf spark.executor.reminiscence=10g  --conf spark.executor.memoryOverhead=5g

We use pod templates to schedule Spark drivers on On-Demand Cases and executors on EC2 Spot Cases (which may save as much as 90% over On-Demand Occasion costs). Spark’s inherent resiliency has the motive force launch new executors to switch those that fail resulting from Spot interruptions. See the next code:

apiVersion: v1
type: Pod
spec:
  nodeSelector:
    karpenter.sh/capacity-type: spot
  containers:
  - title: spark-kubernetes-executor


apiVersion: v1
type: Pod
spec:
  nodeSelector:
    karpenter.sh/capacity-type: on-demand
  containers:
  - title: spark-kubernetes-driver

Stipulations

We use an AWS Cloud9 IDE to run all of the directions all through this submit.

To create your IDE, run the next instructions in AWS CloudShell. The default Area is us-east-1, however you possibly can change it if wanted.

# clone the repo
git clone https://github.com/black-mirror-1/karpenter-for-emr-on-eks.git
cd karpenter-for-emr-on-eks
./setup/create-cloud9-ide.sh

Navigate to the AWS Cloud9 IDE utilizing the URL from the output of the script.

Set up instruments on the AWS Cloud9 IDE

Set up the next instruments required on the AWS Cloud9 atmosphere by the operating a script:

Run the next directions in your AWS Cloud9 atmosphere and never CloudShell.

  1. Clone the GitHub repository:
    cd ~/atmosphere
    git clone https://github.com/black-mirror-1/karpenter-for-emr-on-eks.git
    cd ~/atmosphere/karpenter-for-emr-on-eks

  2. Arrange the required atmosphere variables. Be happy to regulate the next code in accordance with your wants:
    # Set up envsubst (from GNU gettext utilities) and bash-completion
    sudo yum -y set up jq gettext bash-completion moreutils
    
    # Setup env variables required
    export EKSCLUSTER_NAME=aws-blog
    export EKS_VERSION="1.23"
    # get the hyperlink to the identical model as EKS from right here https://docs.aws.amazon.com/eks/newest/userguide/install-kubectl.html
    export KUBECTL_URL="https://s3.us-west-2.amazonaws.com/amazon-eks/1.23.7/2022-06-29/bin/linux/amd64/kubectl"
    export HELM_VERSION="v3.9.4"
    export KARPENTER_VERSION="v0.18.1"
    # get the newest matching model of the Cluster Autoscaler from right here https://github.com/kubernetes/autoscaler/releases
    export CAS_VERSION="v1.23.1"

  3. Set up the AWS Cloud9 CLI instruments:
    cd ~/atmosphere/karpenter-for-emr-on-eks
    ./setup/c9-install-tools.sh

Provision the infrastructure

We arrange the next assets utilizing the supply infrastructure script:

  1. Create the EMR on EKS and Karpenter infrastructure:
    cd ~/atmosphere/karpenter-for-emr-on-eks
    ./setup/create-eks-emr-infra.sh

  2. Validate the setup:
    # Ought to have outcomes which might be operating
    kubectl get nodes
    kubectl get pods -n karpenter
    kubectl get po -n kube-system -l app.kubernetes.io/occasion=cluster-autoscaler
    kubectl get po -n prometheus

Understanding Karpenter configurations

As a result of the pattern workload has driver and executor specs which might be of various sizes, we have now recognized the cases from c5, c5a, c5d, c5ad, c6a, m4, m5, m5a, m5d, m5ad, and m6a households of sizes 2xlarge, 4xlarge, 8xlarge, and 9xlarge for our workload utilizing the amazon-ec2-instance-selector CLI. With CAS, we have to create a complete of 12 node teams, as proven in eksctl-config.yaml, however can outline the identical constraints in Karpenter with a single provisioner, as proven within the following code:

apiVersion: karpenter.sh/v1alpha5
type: Provisioner
metadata:
  title: default
spec:
  supplier:
    launchTemplate: {EKSCLUSTER_NAME}-karpenter-launchtemplate
    subnetSelector:
      karpenter.sh/discovery: {EKSCLUSTER_NAME}
  labels:
    app: kspark
  necessities:
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["on-demand","spot"]
    - key: "kubernetes.io/arch" 
      operator: In
      values: ["amd64"]
    - key: karpenter.k8s.aws/instance-family
      operator: In
      values: [c5, c5a, c5d, c5ad, m5, c6a]
    - key: karpenter.k8s.aws/instance-size
      operator: In
      values: [2xlarge, 4xlarge, 8xlarge, 9xlarge]
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["{AWS_REGION}a"]

  limits:
    assets:
      cpu: "2000"

  ttlSecondsAfterEmpty: 30

We have now arrange each auto scalers to scale down nodes which might be empty for 30 seconds utilizing ttlSecondsAfterEmpty in Karpenter and --scale-down-unneeded-time in CAS.

Karpenter by design will attempt to obtain essentially the most environment friendly packing of the pods on a node primarily based on CPU, reminiscence, and GPUs required.

Run a pattern workload

To run a pattern workload, full the next steps:

  1. Lets evaluate the AWS Command Line Interface (AWS CLI) command to submit a pattern job:
    aws emr-containers start-job-run 
      --virtual-cluster-id $VIRTUAL_CLUSTER_ID 
      --name karpenter-benchmark-${CORES}vcpu-${MEMORY}gb  
      --execution-role-arn $EMR_ROLE_ARN 
      --release-label emr-6.5.0-latest 
      --job-driver '{
      "sparkSubmitJobDriver": {
          "entryPoint": "native:///usr/lib/spark/examples/jars/eks-spark-benchmark-assembly-1.0.jar",
          "entryPointArguments":["s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned","s3://'$S3BUCKET'/EMRONEKS_TPCDS-TEST-3T-RESULT-KA","/opt/tpcds-kit/tools","parquet","3000","1","false","q70-v2.4,q82-v2.4,q64-v2.4","true"],
          "sparkSubmitParameters": "--class com.amazonaws.eks.tpcds.BenchmarkSQL --conf spark.executor.cases=50 --conf spark.driver.cores="$CORES" --conf spark.driver.reminiscence='$EXEC_MEMORY'g --conf spark.executor.cores="$CORES" --conf spark.executor.reminiscence='$EXEC_MEMORY'g"}}' 
      --configuration-overrides '{
        "applicationConfiguration": [
          {
            "classification": "spark-defaults", 
            "properties": {
              "spark.kubernetes.node.selector.app": "kspark",
              "spark.kubernetes.node.selector.topology.kubernetes.io/zone": "'${AWS_REGION}'a",
    
              "spark.kubernetes.container.image":  "'$ECR_URL'/eks-spark-benchmark:emr6.5",
              "spark.kubernetes.driver.podTemplateFile": "s3://'$S3BUCKET'/pod-template/karpenter-driver-pod-template.yaml",
              "spark.kubernetes.executor.podTemplateFile": "s3://'$S3BUCKET'/pod-template/karpenter-executor-pod-template.yaml",
              "spark.network.timeout": "2000s",
              "spark.executor.heartbeatInterval": "300s",
              "spark.kubernetes.executor.limit.cores": "'$CORES'",
              "spark.executor.memoryOverhead": "'$MEMORY_OVERHEAD'G",
              "spark.driver.memoryOverhead": "'$MEMORY_OVERHEAD'G",
              "spark.kubernetes.executor.podNamePrefix": "karpenter-'$CORES'vcpu-'$MEMORY'gb",
              "spark.executor.defaultJavaOptions": "-verbose:gc -XX:+UseG1GC",
              "spark.driver.defaultJavaOptions": "-verbose:gc -XX:+UseG1GC",
    
              "spark.ui.prometheus.enabled":"true",
              "spark.executor.processTreeMetrics.enabled":"true",
              "spark.kubernetes.driver.annotation.prometheus.io/scrape":"true",
              "spark.kubernetes.driver.annotation.prometheus.io/path":"/metrics/executors/prometheus/",
              "spark.kubernetes.driver.annotation.prometheus.io/port":"4040",
              "spark.kubernetes.driver.service.annotation.prometheus.io/scrape":"true",
              "spark.kubernetes.driver.service.annotation.prometheus.io/path":"/metrics/driver/prometheus/",
              "spark.kubernetes.driver.service.annotation.prometheus.io/port":"4040",
              "spark.metrics.conf.*.sink.prometheusServlet.class":"org.apache.spark.metrics.sink.PrometheusServlet",
              "spark.metrics.conf.*.sink.prometheusServlet.path":"/metrics/driver/prometheus/",
              "spark.metrics.conf.master.sink.prometheusServlet.path":"/metrics/master/prometheus/",
              "spark.metrics.conf.applications.sink.prometheusServlet.path":"/metrics/applications/prometheus/"
             }}
        ]}'

  2. Submit 4 jobs with completely different driver and executor vCPUs and reminiscence sizes on Karpenter:
    # the arguments are vcpus and reminiscence
    export EMRCLUSTER_NAME=${EKSCLUSTER_NAME}-emr
    ./sample-workloads/emr6.5-tpcds-karpenter.sh 4 7
    ./sample-workloads/emr6.5-tpcds-karpenter.sh 8 15
    ./sample-workloads/emr6.5-tpcds-karpenter.sh 4 15
    ./sample-workloads/emr6.5-tpcds-karpenter.sh 8 31 

  3. To observe the pods’s autoscaling standing in actual time, open a brand new terminal in Cloud9 IDE and run the next command (nothing is returned at first):
    watch -n1 "kubectl get pod -n emr-karpenter"

  4. Observe the EC2 occasion and node auto scaling standing in a second terminal tab by operating the next command (by design, Karpenter schedules in Availability Zone a):
    watch -n1 "kubectl get node --label-columns=node.kubernetes.io/instance-type,karpenter.sh/capacity-type,topology.kubernetes.io/zone,app -l app=kspark"

Evaluate with Cluster Autoscaler (Optionally available)

We have now arrange Cluster Autoscaler in the course of the infrastructure setup step with the next configuration:

  • Launch EC2 nodes in Availability Zone b
  • Include 12 node teams (6 every for On-Demand and Spot)
  • Scale down unneeded nodes after 30 seconds with --scale-down-unneeded-time
  • Use the least-waste expander on CAS, which may choose the node group that may have the least idle CPU for binpacking effectivity
  1. Submit 4 jobs with completely different driver and executor vCPUs and reminiscence sizes on CAS:
    # the arguments are vcpus and reminiscence
    ./sample-workloads/emr6.5-tpcds-ca.sh 4 7
    ./sample-workloads/emr6.5-tpcds-ca.sh 8 15
    ./sample-workloads/emr6.5-tpcds-ca.sh 4 15
    ./sample-workloads/emr6.5-tpcds-ca.sh 8 31

  2. To observe the pods’s autoscaling standing in actual time, open a brand new terminal in Cloud9 IDE and run the next command (nothing is returned at first):
    watch -n1 "kubectl get pod -n emr-ca"

  3. Observe the EC2 occasion and node auto scaling standing in a second terminal tab by operating the next command (by design, CAS schedules in Availability Zone b):
    watch -n1 "kubectl get node --label-columns=node.kubernetes.io/instance-type,eks.amazonaws.com/capacityType,topology.kubernetes.io/zone,app -l app=caspark"

Observations

The time from pod creation to being scheduled on common is much less with Karpenter than CAS, as proven within the following determine; you possibly can see a noticeable distinction while you run massive scale workloads.

As proven within the following figures, as the roles had been accomplished, Karpenter was capable of scale down the nodes that aren’t wanted inside seconds. In distinction, CAS takes minutes, as a result of it sends a sign to the node teams, including extra latency. This in flip helps scale back total prices by lowering the variety of seconds unneeded EC2 cases are operating.

Clear up

To wash up your atmosphere, delete all of the assets created in reverse order by operating the cleanup script:

export EKSCLUSTER_NAME=aws-blog
cd ~/atmosphere/karpenter-for-emr-on-eks
./setup/cleanup.sh

Conclusion

On this submit, we confirmed you how you can use Karpenter to simplify EKS node provisioning, and velocity up auto scaling of EMR on EKS workloads. We encourage you to attempt Karpenter and supply any suggestions by making a GitHub difficulty.

Additional studying


In regards to the Authors

Changbin Gong is a Principal Options Architect at Amazon Internet Providers. He engages with clients to create progressive options that deal with buyer enterprise issues and speed up the adoption of AWS companies. In his spare time, Changbin enjoys studying, operating, and touring.

Sandeep Palavalasa is a Sr. Specialist Containers SA at Amazon Internet Providers. He’s a software program expertise chief with over 12 years of expertise in constructing large-scale, distributed software program techniques. His skilled profession began with a give attention to monitoring and observability and he has a powerful cloud structure background. He likes engaged on distributed techniques and is happy to speak about microservice structure design. His present pursuits are within the areas of container companies and serverless applied sciences.

[ad_2]

Leave a Reply