Get in Touch

Course Outline

Introduction:

  • Apache Spark within the Hadoop Ecosystem
  • Short introduction to Python and Scala

Basics (theory):

  • Architecture
  • RDD
  • Transformations and Actions
  • Stages, Tasks, and Dependencies

Using the Databricks environment to understand the basics (hands-on workshop):

  • Exercises using the RDD API
  • Basic action and transformation functions
  • PairRDD
  • Join operations
  • Caching strategies
  • Exercises using the DataFrame API
  • SparkSQL
  • DataFrame: select, filter, group, sort
  • UDF (User Defined Function)
  • Exploring the DataSet API
  • Streaming

Using the AWS environment to understand deployment (hands-on workshop):

  • Basics of AWS Glue
  • Understanding the differences between AWS EMR and AWS Glue
  • Example jobs on both environments
  • Exploring pros and cons

Extra:

  • Introduction to Apache Airflow orchestration

Requirements

Programming skills (preferably in Python or Scala)

Foundational knowledge of SQL

 21 Hours

Number of participants


Price per participant

Testimonials (3)

Provisional Upcoming Courses (Require 5+ participants)

Related Categories