Big Data Processing with Apache Spark

Author： Manuel Ignacio Franco Galeano

Publisher： Packt Publishing‎

Publication year： 2018

E-ISBN: 9781789804522

P-ISBN(Paperback): 89543100949910

Subject： TP301.6 algorithm theory

Language： ENG

Access to resources Favorite

Disclaimer: Any content in publications that violate the sovereignty, the constitution or regulations of the PRC is not accepted or approved by CNPIEC.

Big Data Processing with Apache Spark

Chapter

Preface

Introduction to Spark Distributed Processing

Introduction

Introduction to Spark and Resilient Distributed Datasets

Spark Components

Spark Deployment Modes

Spark Standalone

Apache Mesos

Other Deployment Options

Resilient Distributed Datasets

Python Shell and SparkContext

Parallelized Collections

RDD Creation from External Data Sources

Exercise 1: Basic Interactive Analysis with Python

Operations Supported by the RDD API

Map Transformations

Reduce Action

Working with Key-Value Pairs

Join Transformations

Set Operations

Exercise 2: Map Reduce Operations

Activity 1: Statistical Operations on Books

Self-Contained Python Spark Programs

Introduction to Functional Programming

Exercise 3: Standalone Python Programs

Introduction to SQL, Datasets, and DataFrames

Exercise 4: Downloading the Reduced Version of the movielens Dataset

Exercise 5: RDD Operations in DataFrame Objects

Summary

Introduction to Spark Streaming

Introduction

Introduction to Streaming Architectures

Back-Pressure, Write-Ahead Logging, and Checkpointing

Introduction to Discretized Streams

Consuming Streams from a TCP Socket

TCP Input DStream

Map-Reduce Operations over DStreams

Exercise 6: Building an Event TCP Server

Activity 2: Building a Simple TCP Spark Stream Consumer

Parallel Recovery of State with Checkpointing

Keeping the State in Streaming Applications

Join Operations

Exercise 7: TCP Stream Consumer from Multiple Sources

Activity 3: Consuming Event Data from Three TCP Servers

Windowing Operations

Exercise 8: Distributed Log Server

Introduction to Structured Streaming

Result Table and Output Modes in Structured Streaming

Exercise 9: Writing Random Ratings

Exercise 10: Structured Streaming

Summary

Spark Streaming Integration with AWS

Introduction

Spark Integration with AWS Services

Previous Requirements

AWS Kinesis Data Streams Basic Functionality

Integrating AWS Kinesis and Python

Exercise 11: Listing Existing Streams

Exercise 12: Creating a New Stream

Exercise 13: Deleting an Existing Stream

Exercise 14: Pushing Data to a Stream

AWS S3 Basic Functionality

Creating, Listing, and Deleting AWS S3 Buckets

Exercise 15: Listing Existing Buckets

Exercise 16: Creating a Bucket

Exercise 17: Deleting a Bucket

Kinesis Streams and Spark Streams

Activity 4: AWS and Spark Pipeline

Summary

Spark Streaming, ML, and Windowing Operations

Introduction

Spark Integration with Machine Learning

The MovieLens Dataset

Introduction to Recommendation Systems and Collaborative Filtering

Exercise 18: Collaborative Filtering and Spark

Exercise 19: Creating a TCP Server that Publishes User Ratings

Exercise 20: Spark Streams Integration with Machine Learning

Activity 5: Experimenting with Windowing Operations

Summary

Appendix A

Index

The users who browse this book also browse

Chapter

The users who browse this book also browse

No browse record.