Hadoop Fundamentals for Data Scientists

Video description

Get a practical introduction to Hadoop, the framework that made big data and large-scale analytics possible by combining distributed computing techniques with distributed storage. In this video tutorial, hosts Benjamin Bengfort and Jenny Kim discuss the core concepts behind distributed computing and big data, and then show you how to work with a Hadoop cluster and program analytical jobs. You’ll also learn how to use higher-level tools …

Hadoop Fundamentals for Data Scientists

Video description

Hadoop is a cluster computing technology that has many moving parts, including distributed systems administration, data engineering and warehousing methodologies, software engineering for distributed computing, and large-scale analytics. With this video, you’ll learn how to operationalize analytics over large datasets and rapidly deploy analytical jobs with a variety of toolsets.

Once you’ve completed this video, you’ll understand how different parts of Hadoop combine to form an entire data pipeline managed by teams of data engineers, data programmers, data researchers, and data business people.

Understand the Hadoop architecture and set up a pseudo-distributed development environment
Learn how to develop distributed computations with MapReduce and the Hadoop Distributed File System (HDFS)
Work with Hadoop via the command-line interface
Use the Hadoop Streaming utility to execute MapReduce jobs in Python
Explore data warehousing, higher-order data flows, and other projects in the Hadoop ecosystem
Learn how to use Hive to query and analyze relational data using Hadoop
Use summarization, filtering, and aggregation to move Big Data towards last mile computation
Understand how analytical workflows including iterative machine learning, feature analysis, and data modeling work in a Big Data context

Benjamin Bengfort is a data scientist and programmer in Washington DC who prefers technology to politics but sees the value of data in every domain. Alongside his work teaching, writing, and developing large-scale analytics with a focus on statistical machine learning, he is finishing his PhD at the University of Maryland where he studies machine learning and artificial intelligence.

Jenny Kim, a software engineer in the San Francisco Bay Area, develops, teaches, and writes about big data analytics applications and specializes in large-scale, distributed computing infrastructures and machine-learning algorithms to support recommendations systems.

Publisher resources

Download Example Code

Overview of the Video Course

A Distributed Computing Environment

The Motivation for Hadoop

A Brief History of Hadoop

Understanding the Hadoop Architecture

Setting Up A Pseudo-Distributed Environment

The Distributed File System (HDFS)

Distributed Computing with MapReduce

Word Count - the “Hello, World” of Hadoop!

Computing with Hadoop

How a MapReduce Job Works

Mappers and Reducers in Detail

Working with Hadoop via the Command Line: Starting HDFS and Yarn

Working with Hadoop via the Command Line: Loading Data into HDFS

Working with Hadoop via the Command Line: Running a MapReduce Job

How To Use Our Github Goodies

Working in Python with Hadoop Streaming

Common MapReduce Tasks

Spark on Hadoop 2

Creating a Spark Application with Python

The Hadoop Ecosystem

Data Warehousing with Hadoop

Higher Order Data Flows

Other Notable Projects

Working with Data on Hive

Introduction to Hive

Interacting with Data via the Hive Console

Creating Databases, Tables, and Schemas for Hive

Loading Data into Hive from HDFS

Querying Data and Performing Aggregations With Hive

Towards Last Mile Computing

Decomposing Large Data Sets to a Computational Space

Linear Regressions

Summarizing Documents with TF-IDF

Classification of Text

Parallel Canopy Clustering

Computing Recommendations via Linear Log-Likelihoods

Start your Free Trial

Self paced

Go to the Course
We have partnered with providers to bring you collection of courses, When you buy through links on our site, we may earn an affiliate commission from provider.