This course gives an overview of Big Data, i.e., storage, retrieval, and processing of big data.
It also focuses on the “technologies”, i.e., the tools/algorithms that are available for the storage, and processing of Big Data.
It helps to gain an in-depth understanding and practical experience of Apache Spark and the Spark Ecosystem, which covers Spark RDD, Spark SQL, Spark MLlib, and Spark Streaming.
It helps a student perform various “analytics” on different data sets and arrive at positive conclusions.
Syllabus
Unit1: Big Data Analytics introduction
Introduction to Big data: Types of data -Evolution of big data -Definition of big data - Characteristics of big data- challenges with big data-Introduction to Big data analytics-Technologies help to meet the challenges posed by big data. Big data technology landscape: Introduction to Hadoop – Hadoop architecture- Hadoop distributed file systems – Processing data with Hadoop -Hadoop Ecosystems.
Unit2: PySpark
Introduction to Spark- Why Spark with python- Spark core concepts-Spark core components-Spark architecture- How Spark works? -Environment setup- Spark RDD- Programming with RDDs: Create RDD- RDD Common transformation and actions-Key-Value pairs-RDD vs Data Frame - Aggregate and Group By operations- Filters- Joins- Programming with Data Frame-Data preprocessing methods- Data Exploration-Data Manipulation-Machine learning using Spark- Data analysis use cases with real-world applications.
NOTE:
Python is a prerequisite for this course.