Using R for Big Data with Spark

Video description

Using R for Big Data with Spark

Video description

Data analysts familiar with R will learn to leverage the power of Spark, distributed computing and cloud storage in this course that shows you how to use your R skills in a big data environment.

You'll learn to create Spark clusters on the Amazon Web Services (AWS) platform; perform cluster based data modeling using Gaussian generalized linear models, binomial generalized linear models, Naive Bayes, and K-means modeling; access data from S3 Spark DataFrames and other formats like CSV, Json, and HDFS; and do cluster based data manipulation operations with tools like SparkR and SparkSQL. By course end, you'll be capable of working with massive data sets not possible on a single computer. This hands-on class requires each learner to set-up their own extremely low-cost, easily terminated AWS account.

Discover how to use your R skills in a big data distributed cloud computing cluster environment
Gain hands-on experience setting up Spark clusters on Amazon's AWS cloud services platform
Understand how to control a cloud instance on AWS using SSH or PuTTY
Explore basic distributed modeling techniques like GLM, Naive Bayes, and K-means
Learn to do cloud based data manipulation and processing using SparkR and SparkSQL
Understand how to access data from the CSV, Json, HDFS, and S3 formats

Manuel Amunategui is a data science practitioner, consultant, teacher, and author with 16+ years of data science experience. A former quantitative analyst for a Wall Street brokerage firm, he now serves as the lead data scientist for Providence Health & Services in Portland, Oregon. In his free time, Manuel does competitive data modeling on Kaggle.com, CrowdANALYTIX.com, Datascience.net, and DrivenData.org.

Publisher resources

Download Example Code

Introduction

Welcome to the Course 00:04:21

About the Author 00:01:09

Creating Clusters on Amazon Web Services

Creating an AWS Launching Instance 00:09:40

Connecting to AWS Instance using SSH 00:06:19

Connecting to AWS Instance using PuTTY 00:08:37

Starting Spark Clusters Part 1 00:09:02

Starting Spark Clusters Part 2 00:09:55

Terminate Your Clusters 00:00:58

Data and Modeling Basics

Data Basics 00:08:34

Modeling with Gaussian Generalized Linear Models 00:11:19

Modeling with Binomial Generalized Linear Models 00:09:34

Naive Bayes and K-Means Modeling 00:09:14

Data Sources and Data Manipulation

Bigger Data and S3 00:07:27

Accessing S3 Spark Dataframes 00:04:57

SparkR Dataframe Operations 00:11:01

SparkSQL 00:05:16

Various

Brief Look at HDFS 00:11:00

Brief Look at Databricks Community Edition 00:08:20

Conclusion

Wrap Up and Thank You 00:02:02

Start your Free Trial

Self paced

Go to the Course
We have partnered with providers to bring you collection of courses, When you buy through links on our site, we may earn an affiliate commission from provider.