Spark Sql And Pyspark 3 Using Python 3 Hands-On With Labs

October 8, 2022

Spark Sql And Pyspark 3 Using Python 3 Hands-On With Labs
Last updated 8/2022
MP4 | Video: h264, 1280x720 | Audio: AAC, 44.1 KHz
Language: English | Size: 10.03 GB | Duration: 32h 12m

A Comprehensive Course on Spark SQL as well as Data Frame APIs using Python 3 with complementary lab access

What you'll learn
Setup the Single Node Hadoop and Spark using Docker locally or on AWS Cloud9
Review ITVersity Labs (exclusively for ITVersity Lab Customers)
All the HDFS Commands that are relevant to validate files and folders in HDFS.
Quick recap of Python which is relevant to learn Spark
Ability to use Spark SQL to solve the problems using SQL style syntax.
Pyspark Dataframe APIs to solve the problems using Dataframe style APIs.
Relevance of Spark Metastore to convert Dataframs into Temporary Views so that one can process data in Dataframes using Spark SQL.
Apache Spark Application Development Life Cycle
Apache Spark Application Execution Life Cycle and Spark UI
Setup SSH Proxy to access Spark Application logs
Deployment Modes of Spark Applications (Cluster and Client)
Passing Application Properties Files and External Dependencies while running Spark Applications
Requirements
Basic programming skills using any programming language
Self support lab (Instructions provided) or ITVersity lab at additional cost for appropriate environment.
Minimum memory required based on the environment you are using with 64 bit operating system
4 GB RAM with access to proper clusters or 16 GB RAM to setup environment using Docker

Description
As part of this course, you will learn all the key skills to build Data Engineering Pipelines using Spark SQL and Spark Data Frame APIs using Python as a Programming language. This course used to be a CCA 175 Spark and Hadoop Developer course for the preparation for the Certification Exam. As of 10/31/2021, the exam is sunset and we have renamed it to Apache Spark 2 and 3 using Python 3 as it covers industry-relevant topics beyond the scope of certification.About Data EngineeringData Engineering is nothing but processing the data depending upon our downstream needs. We need to build different pipelines such as Batch Pipelines, Streaming Pipelines, etc as part of Data Engineering. All roles related to Data Processing are consolidated under Data Engineering. Conventionally, they are known as ETL Development, Data Warehouse Development, etc. Apache Spark is evolved as a leading technology to take care of Data Engineering at scale.I have prepared this course for anyone who would like to transition into a Data Engineer role using Pyspark (Python + Spark). I myself am a proven Data Engineering Solution Architect with proven experience in designing solutions using Apache Spark.Let us go through the details about what you will be learning in this course. Keep in mind that the course is created with a lot of hands-on tasks which will give you enough practice using the right tools. Also, there are tons of tasks and exercises to evaluate yourself. We will provide details about Resources or Environments to learn Spark SQL and PySpark 3 using Python 3 as well as Reference Material on GitHub to practice Spark SQL and PySpark 3 using Python 3. Keep in mind that you can either use the cluster at your workplace or set up the environment using provided instructions or use ITVersity Lab to take this course.Setup of Single Node Big Data ClusterMany of you would like to transition to Big Data from Conventional Technologies such as Mainframes, Oracle PL/SQL, etc and you might not have access to Big Data Clusters. It is very important for you set up the environment in the right manner. Don't worry if you do not have the cluster handy, we will guide you through support via Udemy Q&A.Setup Ubuntu-based AWS Cloud9 Instance with the right configurationEnsure Docker is setupSetup Jupyter Lab and other key componentsSetup and Validate Hadoop, Hive, YARN, and SparkAre you feeling a bit overwhelmed about setting up the environment? Don't worry!!! We will provide complementary lab access for up to 2 months. Here are the details.Training using an interactive environment. You will get 2 weeks of lab access, to begin with. If you like the environment, and acknowledge it by providing a 5* rating and feedback, the lab access will be extended to additional 6 weeks (2 months). Feel free to send an email to [email protected] to get complementary lab access. Also, if your employer provides a multi-node environment, we will help you set up the material for the practice as part of the live session. On top of Q&A Support, we also provide required support via live sessions.A quick recap of PythonThis course requires a decent knowledge of Python. To make sure you understand Spark from a Data Engineering perspective, we added a module to quickly warm up with Python. If you are not familiar with Python, then we suggest you go through our other course Data Engineering Essentials - Python, SQL, and Spark.Master required Hadoop Skills to build Data Engineering ApplicationsAs part of this section, you will primarily focus on HDFS commands so that we can copy files into HDFS. The data copied into HDFS will be used as part of building data engineering pipelines using Spark and Hadoop with Python as the Programming Language.Overview of HDFS CommandsCopy Files into HDFS using the put or copyFromLocal command using appropriate HDFS CommandsReview whether the files are copied properly or not to HDFS using HDFS Commands.Get the size of the files using HDFS commands such as du, df, etc.Some fundamental concepts related to HDFS such as block size, replication factor, etc.Data Engineering using Spark SQLLet us, deep-dive into Spark SQL to understand how it can be used to build Data Engineering Pipelines. Spark with SQL will provide us the ability to leverage distributed computing capabilities of Spark coupled with easy-to-use developer-friendly SQL-style syntax.Getting Started with Spark SQLBasic Transformations using Spark SQLManaging Tables - Basic DDL and DML in Spark SQLManaging Tables - DML and Create Partitioned Tables using Spark SQLOverview of Spark SQL Functions to manipulate strings, dates, null values, etcWindowing Functions using Spark SQL for ranking, advanced aggregations, etc.Data Engineering using Spark Data Frame APIsSpark Data Frame APIs are an alternative way of building Data Engineering applications at scale leveraging distributed computing capabilities of Spark. Data Engineers from application development backgrounds might prefer Data Frame APIs over Spark SQL to build Data Engineering applications.Data Processing Overview using Spark or Pyspark Data Frame APIs.Projecting or Selecting data from Spark Data Frames, renaming columns, providing aliases, dropping columns from Data Frames, etc using Pyspark Data Frame APIs.Processing Column Data using Spark or Pyspark Data Frame APIs - You will be learning functions to manipulate strings, dates, null values, etc.Basic Transformations on Spark Data Frames using Pyspark Data Frame APIs such as Filtering, Aggregations, and Sorting using functions such as filter/where, groupBy with agg, sort or orderBy, etc.Joining Data Sets on Spark Data Frames using Pyspark Data Frame APIs such as join. You will learn inner joins, outer joins, etc using the right examples.Windowing Functions on Spark Data Frames using Pyspark Data Frame APIs to perform advanced Aggregations, Ranking, and Analytic FunctionsSpark Metastore Databases and Tables and integration between Spark SQL and Data Frame APIsApache Spark Application Development and Deployment Life CycleOnce you go through the content related to Spark using Jupyter-based environment, we will also walk you through the details about how the Spark applications are typically developed using Python, deployed as well as reviewed.Setup Python Virtual Environment and Project for Spark Application Development using PycharmUnderstand complete Spark Application Development Lifecycle using Pycharm and PythonBuild zip file for the Spark Application, copy to the environment where it is supposed to run and run.Understand how to review the Spark Application Execution Life Cycle.All the demos are given on our state-of-the-art Big Data cluster. You can avail of one-month complimentary lab access by reaching out to [email protected] with a Udemy receipt.

Overview

Section 1: Introduction about Spark SQL and PySpark 3 using Python 3

Lecture 1 Introduction to Spark SQL and PySpark 3 using Python 3

Lecture 2 Curriculum for Spark SQL and Pyspark 3 using Python 3

Lecture 3 Purchasing the Spark SQL and PySpark using Python 3 Course

Lecture 4 Introduction to Udemy Course Landing Page

Lecture 5 Overview of Udemy Course or Video Player

Lecture 6 Adding Notes to Course Lectures

Lecture 7 Using Course Sidebar to move between lectures

Lecture 8 Overview of Support to ITVersity courses on Udemy

Lecture 9 Best Practices to get ITVersity Support using Udemy

Lecture 10 Resources for Spark SQL and Pyspark 3 using Python 3

Lecture 11 Material for Spark SQL and PySpark 3 using Python 3

Lecture 12 Become Part of ITVersity Data Engineering Community

Lecture 13 Rate and Leave Feedback - Spark SQL and PySpark 3 using Python 3

Lecture 14 Udemy for Business Customers - Important Information for about labs for practice

Section 2: Using ITVersity Labs for hands-on practice (for ITVersity Lab Customers only)

Lecture 15 Setup Development Environment using VS Code Remote Development Extension Pack

Lecture 16 Review Data Sets Provided as part of Gateway Nodes of Hadoop and Spark Cluster

Lecture 17 Validate HDFS on Multi Node Hadoop and Spark Cluster from Gateway Node

Lecture 18 Validate Hive on Hadoop and Spark Multinode Cluster

Lecture 19 Review Hadoop HDFS and YARN Property Files on Hadoop and Spark Cluster

Lecture 20 Review Hadoop HDFS and YARN Property Files using Visual Studio Code Editor

Lecture 21 Review Hive Property Files on Multinode Hadoop and Spark Cluster

Lecture 22 Review Spark 2 Property Files and Important Properties

Lecture 23 Validate Spark Shell CLI using Spark 2

Lecture 24 Validate Pyspark CLI using Spark 2

Lecture 25 Validate Spark SQL CLI using Spark 2

Lecture 26 Review Spark 3 Property Files and Important Properties

Lecture 27 Validate Spark Shell CLI using Spark 3

Lecture 28 Validate Pyspark CLI using Spark 3

Lecture 29 Validate Spark SQL CLI using Spark 3

Section 3: Setup Hadoop and Spark Single Node Cluster on Windows 11 using Docker

Lecture 30 Prerequisites for Single Node Hadoop and Spark Cluster on Windows

Lecture 31 Overview of Windows System Configuration

Lecture 32 Setup Ubuntu on Windows 11 using wsl

Lecture 33 Setup and Validate Ubuntu VM on Windows using wsl

Lecture 34 Install Docker Desktop on Windows 11 using wsl2

Lecture 35 Overview of Docker Desktop on Windows 11

Lecture 36 Validate Docker Commands using Windows Powershell as well as wsl Ubuntu

Lecture 37 Setup Visual Studio Code IDE on Windows

Lecture 38 Install Visual Studio Code Extension for Remote Development

Lecture 39 Clone GitHub Repository for Pyspark Course using Visual Studio Code

Lecture 40 Launching Terminal using Visual Studio Code and WSL

Lecture 41 Review Docker Compose File to setup Hadoop and Spark Lab

Lecture 42 Start Hadoop and Spark Lab along with Jupyter Lab on Windows 11

Lecture 43 Review the resource utilization of Windows for Hadoop and Spark Lab

Lecture 44 Review Docker Desktop for Hadoop and Spark Lab using Docker

Lecture 45 Overview of Docker Compose Commands to manage Hadoop and Spark Lab

Lecture 46 Validate Hadoop and Spark setup using Docker on Windows

Section 4: Setup Hadoop and Spark Single Node Cluster on AWS Cloud9 using Docker

Lecture 47 Getting Started with AWS Cloud9

Lecture 48 Creating AWS Cloud9 Environment

Lecture 49 Warming up with AWS Cloud9 IDE

Lecture 50 Review Operating System Details on AWS Cloud9

Lecture 51 Overview of EC2 Instance related to AWS Cloud9

Lecture 52 Opening ports for AWS Cloud9 Instance

Lecture 53 Associating Elastic IPs to AWS Cloud9 Instance

Lecture 54 Increase EBS Volume Size of AWS Cloud9 Instance

Lecture 55 Setup Docker Compose on AWS Cloud9 Instance

Lecture 56 Clone GitHub Repository on AWS Cloud9 for the Course Material

Lecture 57 Review Docker Compose File to setup Hadoop and Spark Lab

Lecture 58 Start Hadoop and Spark Lab along with Jupyter Lab on Windows 11

Lecture 59 Overview of Docker Compose Commands to manage Hadoop and Spark Lab

Lecture 60 Validate Hadoop and Spark setup using Docker

Section 5: Python Fundamentals

Lecture 61 Introduction and Setting up Python

Lecture 62 Basic Programming Constructs

Lecture 63 Functions in Python

Lecture 64 Python Collections

Lecture 65 Map Reduce operations on Python Collections

Lecture 66 Setting up Data Sets for Basic I/O Operations

Lecture 67 Basic I/O operations and processing data using Collections

Section 6: Overview of Hadoop HDFS Commands

Lecture 68 Getting help or usage

Lecture 69 Listing HDFS Files

Lecture 70 Managing HDFS Directories

Lecture 71 Copying files from local to HDFS

Lecture 72 Copying files from HDFS to local

Lecture 73 Getting File Metadata

Lecture 74 Previewing Data in HDFS File

Lecture 75 HDFS Block Size

Lecture 76 HDFS Replication Factor

Lecture 77 Getting HDFS Storage Usage

Lecture 78 Using HDFS Stat Commands

Lecture 79 HDFS File Permissions

Lecture 80 Overriding Properties

Section 7: Apache Spark 2.x - Data processing - Getting Started

Lecture 81 Introduction

Lecture 82 Review of Setup Steps for Spark Environment

Lecture 83 Using ITVersity labs

Lecture 84 Apache Spark Official Documentation (Very Important)

Lecture 85 Quick Review of Spark APIs

Lecture 86 Spark Modules

Lecture 87 Spark Data Structures - RDDs and Data Frames

Lecture 88 Develop Simple Application

Lecture 89 Apache Spark - Framework

Lecture 90 Create Data Frames from Text Files

Lecture 91 Create Data Frames from Hive Tables

Section 8: Apache Spark using SQL - Getting Started

Lecture 92 Getting Started - Overview

Lecture 93 Overview of Spark Documentation

Lecture 94 Launching and using Spark SQL CLI

Lecture 95 Overview of Spark SQL Properties

Lecture 96 Running OS Commands using Spark SQL

Lecture 97 Understanding Spark Metastore Warehouse Directory

Lecture 98 Managing Spark Metastore Databases using Spark SQL

Lecture 99 Managing Spark Metastore Tables using Spark SQL

Lecture 100 Retrieve Metadata of Spark Metastore Tables using Spark SQL Describe Command

Lecture 101 Role of Spark Metastore or Hive Metastore

Lecture 102 Exercise - Getting Started with Spark SQL

Section 9: Apache Spark using SQL - Basic Transformations using Spark SQL

Lecture 103 Basic Transformations using Spark SQL - Introduction

Lecture 104 Spark SQL - Overview

Lecture 105 Define Problem Statement

Lecture 106 Prepare Spark Metastore Tables for Basic Transformations using Spark SQL

Lecture 107 Projecting Data using Spark SQL Select Clause

Lecture 108 Filtering Data using Spark SQL Where Clause

Lecture 109 Joining Tables using Spark SQL - Inner

Lecture 110 Joining Tables using Spark SQL - Outer

Lecture 111 Aggregating Data using Group By in Spark SQL

Lecture 112 Sorting Data using Order By in Spark SQL

Lecture 113 Conclusion - Final Solution for the problem statement using Spark SQL

Section 10: Apache Spark using SQL - Basic DDL and DML

Lecture 114 Introduction to Basic DDL and DML in Spark SQL

Lecture 115 Create Spark Metastore Tables using Spark SQL Create Statement

Lecture 116 Overview of Data Types used in Spark Metastore Tables

Lecture 117 Adding Comments to Spark Metastore Tables using Spark SQL

Lecture 118 Loading Data from Local File System Into Tables using Spark SQL Load Statement

Lecture 119 Loading Data from HDFS Folders Into Tables using Spark SQL Load Statement

Lecture 120 Difference between Load with Append and Overwrite using Spark SQL Load Statement

Lecture 121 Creating External Spark Metastore Tables using Spark SQL

Lecture 122 Difference between Managed and External Spark Metastore Tables

Lecture 123 Overview of File Formats used in Spark Metastore Tables

Lecture 124 Drop Spark Metastore Tables and Databases using Spark SQL

Lecture 125 Truncating Spark Metastore Tables

Lecture 126 Exercise - Managed Spark Metastore Tables

Section 11: Apache Spark using SQL - DML and Partitioning

Lecture 127 Introduction to DML and Partitioning using Spark SQL on Spark Metastore Tables

Lecture 128 Introduction to Partitioning of Spark Metastore Tables using Spark SQL

Lecture 129 Creating Spark Metastore Tables using Parquet File Format

Lecture 130 Difference between Load and Insert to get data into Spark Metastore Tables

Lecture 131 Inserting Data using Stage Table leveraging Spark SQL

Lecture 132 Creating Spark Metastore Partitioned Tables using Spark SQL

Lecture 133 Adding Partitions to Spark Metastore Tables using Spark SQL

Lecture 134 Loading Data into Spark Metastore Partitioned Tables using Spark SQL

Lecture 135 Inserting Data into Spark Metastore Partitions using Spark SQL Insert Statement

Lecture 136 Using Dynamic Partition Mode while inserting into Spark Partitioned Tables

Lecture 137 Exercise - Partitioned Tables using Spark SQL

Section 12: Apache Spark using SQL - Pre-defined Functions

Lecture 138 Introduction - Overview of Spark SQL Pre-defined Functions

Lecture 139 Overview of Spark SQL Pre-defined Functions

Lecture 140 Validating Spark SQL Functions

Lecture 141 String Manipulation using Spark SQL Functions

Lecture 142 Date Manipulation using Spark SQL Functions

Lecture 143 Overview of Numeric Functions in Spark SQL

Lecture 144 Data Type Conversion using Spark SQL

Lecture 145 Dealing with Nulls using Spark SQL

Lecture 146 Using CASE and WHEN in Spark SQL Queries

Lecture 147 Query Example - Word Count using Spark SQL

Section 13: Apache Spark SQL - Windowing Functions

Lecture 148 Introduction to Windowing Functions in Spark SQL

Lecture 149 Prepare HR Database for Windowing Functions in Spark SQL

Lecture 150 Overview of Windowing Functions using Spark SQL

Lecture 151 Aggregations using Spark SQL Windowing Functions

Lecture 152 Using LEAD or LAG in Spark SQL Windowing Functions

Lecture 153 Getting first and last values using Spark SQL Windowing Functions

Lecture 154 Ranking using Spark SQL Windowing Functions - rank, dense_rank and row_number

Lecture 155 Order of execution of Spark SQL Queries

Lecture 156 Overview of Subqueries in Spark SQL

Lecture 157 Filtering Window Function Results using Spark SQL

Section 14: Apache Spark using Python - Data Processing Overview

Lecture 158 Starting Spark Context - pyspark

Lecture 159 Overview of Spark Read APIs

Lecture 160 Understanding airlines data

Lecture 161 Inferring Schema using Spark Data Frame APIs

Lecture 162 Previewing Airlines Data using Spark Data Frame APIs

Lecture 163 Overview of Data Frame APIs

Lecture 164 Overview of Functions on Spark Data Frames

Lecture 165 Overview of Spark Write APIs

Section 15: Apache Spark using Python - Processing Column Data

Lecture 166 Overview of Predefined Functions on Spark Data Frame Columns

Lecture 167 Create Dummy Data Frame to explore Functions on Data Frame Columns

Lecture 168 Categories of Predefined Functions used on Spark Data Frame Columns

Lecture 169 Special Functions for Spark Data Frame Columns - col and lit

Lecture 170 Common String Manipulation Functions for Spark Data Frame Columns

Lecture 171 Extracting Strings using substring from Spark Data Frame Columns

Lecture 172 Extracting Strings using split from Spark Data Frame Columns

Lecture 173 Padding Characters around Strings in Spark Data Frame Columns

Lecture 174 Trimming Characters from Strings in Spark Data Frame Columns

Lecture 175 Date and Time Manipulation Functions for Spark Data Frame Columns

Lecture 176 Date and Time Arithmetic on Spark Data Frame Columns

Lecture 177 Using Date and Time Trunc Functions on Spark Data Frame Columns

Lecture 178 Date and Time Extract Functions for Spark Data Frame Columns

Lecture 179 Using to_date and to_timestamp on Spark Data Frame Columns

Lecture 180 Using date_format Function on Spark Data Frame Columns

Lecture 181 Dealing with Unix Timestamp in Spark Data Frame Columns

Lecture 182 Dealing with Nulls in Spark Data Frame Columns

Lecture 183 Using CASE and WHEN on Spark Data Frame Columns

Section 16: Apache Spark using Python - Basic Transformations

Lecture 184 Overview of Basic Transformations on Spark Data Frames

Lecture 185 Spark Data Frames for basic transformations

Lecture 186 Basic Filtering of Data or rows using where from Spark Data Frames

Lecture 187 Filtering Example using dates on Spark Data Frames

Lecture 188 Boolean Operators while filtering from Spark Data Frames

Lecture 189 Using IN Operator or isin Function while filtering from Spark Data Frames

Lecture 190 Using LIKE Operator or like Function while filtering from Spark Data Frames

Lecture 191 Using BETWEEN Operator while filtering from Spark Data Frames

Lecture 192 Dealing with Nulls while Filtering from Spark Data Frames

Lecture 193 Total Aggregations on Spark Data Frames

Lecture 194 Aggregate data using groupBy from Spark Data Frames

Lecture 195 Aggregate data using rollup on Spark Data Frames

Lecture 196 Aggregate data using cube on Spark Data Frames

Lecture 197 Overview of Sorting Spark Data Frames

Lecture 198 Solution - Problem 1 - Get Total Aggregations

Lecture 199 Solution - Problem 2 - Get Total Aggregations By FlightDate

Section 17: Apache Spark using Python - Joining Data Sets

Lecture 200 Prepare Datasets for Joining Spark Data Frames

Lecture 201 Analyze Datasets for Joining Spark Data Frames

Lecture 202 Problem Statements for Joining Spark Data Frames

Lecture 203 Overview of Joins on Spark Data Frames

Lecture 204 Using Inner Joins on Spark Data Frames

Lecture 205 Left or Right Outer Join on Spark Data Frames

Lecture 206 Solution - Get Flight Count Per US Airport using Spark Data Frame APIs

Lecture 207 Solution - Get Flight Count Per US State using Spark Data Frame APIs

Lecture 208 Solution - Get Dormant US Airports using Spark Data Frame APIs

Lecture 209 Solution - Get Origins without master data using Spark Data Frame APIs

Lecture 210 Solution - Get Count of Flights without master data using Spark Data Frame APIs

Lecture 211 Solution - Get Count of Flights per Airport without master data

Lecture 212 Solution - Get Daily Revenue using Spark Data Frame APIs

Lecture 213 Solution - Get Daily Revenue rolled up till Yearly using Spark Data Frame APIs

Section 18: Apache Spark using Python - Spark Metastore

Lecture 214 Overview of APIs to deal with Spark Metastore

Lecture 215 Exploring Spark Catalog

Lecture 216 Creating Spark Metastore Tables using catalog

Lecture 217 Inferring Schema while creating Spark Metastore Tables using Spark Catalog

Lecture 218 Define Schema for Spark Metastore Tables using StructType

Lecture 219 Inserting into Existing Spark Metastore Tables using Spark Data Frame APIs

Lecture 220 Read and Process data from Spark Metastore Tables using Data Frame APIs

Lecture 221 Create Spark Metastore Partitioned Tables using Data Frame APIs

Lecture 222 Saving as Spark Metastore Partitioned Table using Data Frame APIs

Lecture 223 Creating Temporary Views on top of Spark Data Frames

Lecture 224 Using Spark SQL against Temporary Views on Spark Data Frames

Section 19: Getting Started with Semi Structured Data using Spark

Lecture 225 Introduction to Getting Started with Semi Structured Data using Spark

Lecture 226 Create Spark Metastore Table with Special Data Types

Lecture 227 Overview of ARRAY Type in Spark Metastore Table

Lecture 228 Overview of MAP and STRUCT Type in Spark Metastore Table

Lecture 229 Insert Data into Spark Metastore Table with Special Type Columns

Lecture 230 Create Spark Data Frame with Special Data Types

Lecture 231 Create Spark Data Frame with Special Types using Python List

Lecture 232 Insert Spark Data Frame with Special Types into Spark Metastore Table

Lecture 233 Review Data in the JSON File with Special Data Types

Lecture 234 Setup JSON Data Set to explore Spark APIs on Special Data Type Columns

Lecture 235 Read JSON Data with Special Types into Spark Data Frame

Lecture 236 Flatten Array Fields in Spark Data Frames using explode and explode_outer

Lecture 237 Get Size or Length of Array Type Columns in Spark Data Frame

Lecture 238 Concatenate Array Values into Delimited String using Spark APIs

Lecture 239 Convert Delimited Strings from Spark Data Frame Columns to Arrays

Lecture 240 Setup Data Sets to Build Arrays using Spark.cmproj

Lecture 241 Read JSON Data into Spark Data Frame and Review Aggregate Operations

Lecture 242 Build Arrays from Flattened Rows of Spark Data Frame

Lecture 243 Getting Started with Spark Data Frames with Struct Columns

Lecture 244 Concatenate Struct Column Values in Spark Data Frame

Lecture 245 Filter Data on Struct Column Attributes in Spark Data Frame

Lecture 246 Create Spark Data Frame using Map Type Column

Lecture 247 Project Map Values as Columns using Spark Data Frame APIs

Lecture 248 Conclusion of Getting Started with Semi Structured Data using Spark

Section 20: Process Semi Structured Data using Spark Data Frame APIs

Lecture 249 Introduction to Process Semi Structured Data using Spark Data Frame APIs

Lecture 250 Review the Data Sets to generate denormalized JSON Data using Spark

Lecture 251 Setup JSON Data Sets in HDFS using HDFS Command

Lecture 252 Create Spark Data Frames using Data Frame APIs

Lecture 253 Join Orders and Order Items using Spark Data Frame APIs

Lecture 254 Generate Struct Field for Order Details using Spark

Lecture 255 Generate Array of Struct Field for Order Details using Spark

Lecture 256 Join Data Sets to generate denormalized JSON Data using Spark

Lecture 257 Denormalize Join Results using Spark Data Frame APIs

Lecture 258 Write Denormalized Customer Details to JSON Files using Spark

Lecture 259 Publish JSON Files for downstream applications

Lecture 260 Read Denormalized Data into Spark Data Frame

Lecture 261 Filter Denormalized Data Frame using Spark APIs

Lecture 262 Perform Aggregations on Denormalized Data Frame using Spark

Lecture 263 Flatten Semi Structured Data or Denormalized Data using Spark

Lecture 264 Compute Monthly Customer Revenue using Spark on Denormalized Data

Lecture 265 Conclusion of Processing Semi Structured Data using Spark Data Frame APIs

Section 21: Apache Spark - Application Development Life Cycle

Lecture 266 Setup Virtual Environment and Install Pyspark

Lecture 267 Getting Started with Pycharm

Lecture 268 Passing Run Time Arguments

Lecture 269 Accessing OS Environment Variables

Lecture 270 Getting Started with Spark

Lecture 271 Create Function for Spark Session

Lecture 272 Setup Sample Data

Lecture 273 Read data from files

Lecture 274 Process data using Spark APIs

Lecture 275 Write data to files

Lecture 276 Validating Writing Data to Files

Lecture 277 Productionizing the Code

Lecture 278 Setting up Data for Production Validation

Lecture 279 Running the application using YARN

Lecture 280 Detailed Validation of the Application

Section 22: Spark Application Execution Life Cycle and Spark UI

Lecture 281 Deploying and Monitoring Spark Applications - Introduction

Lecture 282 Overview of Types of Spark Cluster Managers

Lecture 283 Setup EMR Cluster with Hadoop and Spark

Lecture 284 Overall Capacity of Big Data Cluster with Hadoop and Spark

Lecture 285 Understanding YARN Capacity of an Enterprise Cluster

Lecture 286 Overview of Hadoop HDFS and YARN Setup on Multi-node Cluster

Lecture 287 Overview of Spark Setup on top of Hadoop

Lecture 288 Setup Data Set for Word Count application

Lecture 289 Develop Word Count Application

Lecture 290 Review Deployment Process of Spark Application

Lecture 291 Overview of Spark Submit Command

Lecture 292 Switch between Python Versions to run Spark Applications or launch Pyspark CLI

Lecture 293 Switch between Pyspark Versions to run Spark Applications or launch Pyspark CLI

Lecture 294 Review Spark Configuration Properties at Run Time

Lecture 295 Develop Shell Script to run Spark Application

Lecture 296 Run Spark Application and review default executors

Lecture 297 Overview of Spark History Server UI

Section 23: Setup SSH Proxy to access Spark Application logs

Lecture 298 Setup SSH Proxy to access Spark Application logs - Introduction

Lecture 299 Overview of Private and Public ips of servers in the cluster

Lecture 300 Overview of SSH Proxy

Lecture 301 Setup sshuttle on Mac or Linux

Lecture 302 Proxy using sshuttle on Mac or Linux

Lecture 303 Accessing Spark Application logs via SSH Proxy using sshuttle on Mac or Linux

Lecture 304 Side effects of using SSH Proxy to access Spark Application Logs

Lecture 305 Steps to setup SSH Proxy on Windows to access Spark Application Logs

Lecture 306 Setup PuTTY and PuTTYgen on Windows

Lecture 307 Quick Tour of PuTTY on Windows

Lecture 308 Configure Passwordless Login using PuTTYGen Keys on Windows

Lecture 309 Run Spark Application on Gateway Node using PuTTY

Lecture 310 Configure Tunnel to Gateway Node using PuTTY on Windows for SSH Proxy

Lecture 311 Setup Proxy on Windows and validate using Microsoft Edge browser

Lecture 312 Understanding Proxying Network Traffic overcoming Windows Caveats

Lecture 313 Update Hosts file for worker nodes using private ips

Lecture 314 Access Spark Application logs using SSH Proxy

Lecture 315 Overview of performing tasks related to Spark Applications using Mac

Section 24: Deployment Modes of Spark Applications

Lecture 316 Deployment Modes of Spark Applications - Introduction

Lecture 317 Default Execution Master Type for Spark Applications

Lecture 318 Launch Pyspark using local mode

Lecture 319 Running Spark Applications using Local Mode

Lecture 320 Overview of Spark CLI Commands such as Pyspark

Lecture 321 Accessing Local Files using Spark CLI or Spark Applications

Lecture 322 Overview of submitting spark application using client deployment mode

Lecture 323 Overview of submitting spark application using cluster deployment mode

Lecture 324 Review the default logging while submitting Spark Applications

Lecture 325 Changing Spark Application Log Level using custom log4j properties

Lecture 326 Submit Spark Application using client mode with log level info

Lecture 327 Submit Spark Application using cluster mode with log level info

Lecture 328 Submit Spark Applications using SPARK_CONF_DIR with custom properties files

Lecture 329 Submit Spark Applications using Properties File

Section 25: Passing Application Properties Files and External Dependencies

Lecture 330 Passing Application Properties Files and External Dependencies - Introduction

Lecture 331 Steps to pass application properties using JSON

Lecture 332 Setup Working Directory to pass application properties using JSON

Lecture 333 Build the JSON with Application Properties

Lecture 334 Explore APIs to process JSON Data using Pyspark

Lecture 335 Refactor the Spark Application Code to use properties from JSON

Lecture 336 Pass Application Properties to Spark Application using local files in client mod

Lecture 337 Pass Application Properties to Spark Application using local files in cluster mo

Lecture 338 Pass Application Properties to Spark Application using HDFS files

Lecture 339 Steps to pass external Python Libraries using pyfiles

Lecture 340 Create required YAML File to externalize application properties

Lecture 341 Install PyYAML into specific folder and build zip

Lecture 342 Explore APIs to process YAML Data using Pyspark

Lecture 343 Refactor the Spark Application Code to use properties from YAML

Lecture 344 Pass External Dependencies to Spark Application using local files in client mode

Lecture 345 Pass External Dependencies to Spark Apps using local files in cluster mode

Lecture 346 Pass External Dependencies to Spark Application using HDFS files

Any IT aspirant/professional willing to learn Data Engineering using Apache Spark,Python Developers who want to learn Spark to add the key skill to be a Data Engineer,Scala based Data Engineers who would like to learn Spark using Python as Programming Language

Homepage

Hidden Content

Give reaction to this post to see the hidden content.

Hidden Content

Give reaction to this post to see the hidden content.

Hidden Content

Give reaction to this post to see the hidden content.

Sign In

Spark Sql And Pyspark 3 Using Python 3 Hands-On With Labs

Recommended Posts

Srbija

Please sign in to comment

Browse

Activity