Data analysts familiar with R will learn to leverage the power of Spark, distributed computing and cloud storage in this course that shows you how to use your R skills in a big data environment.
You'll learn to create Spark clusters on the Amazon Web Services (AWS) platform; perform cluster based data modeling using Gaussian generalized linear models, binomial generalized linear models, Naive Bayes, and K-means modeling; access data from S3 …
Using R for Big Data with Spark
Video description
Data analysts familiar with R will learn to leverage the power of Spark, distributed computing and cloud storage in this course that shows you how to use your R skills in a big data environment.
You'll learn to create Spark clusters on the Amazon Web Services (AWS) platform; perform cluster based data modeling using Gaussian generalized linear models, binomial generalized linear models, Naive Bayes, and K-means modeling; access data from S3 Spark DataFrames and other formats like CSV, Json, and HDFS; and do cluster based data manipulation operations with tools like SparkR and SparkSQL. By course end, you'll be capable of working with massive data sets not possible on a single computer. This hands-on class requires each learner to set-up their own extremely low-cost, easily terminated AWS account.
Discover how to use your R skills in a big data distributed cloud computing cluster environment
Gain hands-on experience setting up Spark clusters on Amazon's AWS cloud services platform
Understand how to control a cloud instance on AWS using SSH or PuTTY
Explore basic distributed modeling techniques like GLM, Naive Bayes, and K-means
Learn to do cloud based data manipulation and processing using SparkR and SparkSQL
Understand how to access data from the CSV, Json, HDFS, and S3 formats
Manuel Amunategui is a data science practitioner, consultant, teacher, and author with 16+ years of data science experience. A former quantitative analyst for a Wall Street brokerage firm, he now serves as the lead data scientist for Providence Health & Services in Portland, Oregon. In his free time, Manuel does competitive data modeling on Kaggle.com, CrowdANALYTIX.com, Datascience.net, and DrivenData.org.
Modeling with Gaussian Generalized Linear Models
00:11:19
Modeling with Binomial Generalized Linear Models
00:09:34
Naive Bayes and K-Means Modeling
00:09:14
Data Sources and Data Manipulation
Bigger Data and S3
00:07:27
Accessing S3 Spark Dataframes
00:04:57
SparkR Dataframe Operations
00:11:01
SparkSQL
00:05:16
Various
Brief Look at HDFS
00:11:00
Brief Look at Databricks Community Edition
00:08:20
Conclusion
Wrap Up and Thank You
00:02:02
Start your Free Trial Self paced Go to the Course We have partnered with providers to bring you collection of courses, When you buy through links on our site, we may earn an affiliate commission from provider.
This site uses cookies. By continuing to use this website, you agree to their use.I Accept