Practical Real-time Data Processing and Analytics

Author: Shilpi Saxena   Saurabh Gupta  

Publisher: Packt Publishing‎

Publication year: 2017

E-ISBN: 9781787289864

P-ISBN(Paperback): 9781787281202

Subject: TP274 数据处理、数据处理系统;TP39 computer application

Language: ENG

Access to resources Favorite

Disclaimer: Any content in publications that violate the sovereignty, the constitution or regulations of the PRC is not accepted or approved by CNPIEC.

Description

A practical guide to help you tackle different real-time data processing and analytics problems using the best tools for each scenario About This Book • Learn about the various challenges in real-time data processing and use the right tools to overcome them • This book covers popular tools and frameworks such as Spark, Flink, and Apache Storm to solve all your distributed processing problems • A practical guide filled with examples, tips, and tricks to help you perform efficient Big Data processing in real-time Who This Book Is For If you are a Java developer who would like to be equipped with all the tools required to devise an end-to-end practical solution on real-time data streaming, then this book is for you. Basic knowledge of real-time processing would be helpful, and knowing the fundamentals of Maven, Shell, and Eclipse would be great. What You Will Learn • Get an introduction to the established real-time stack • Understand the key integration of all the components • Get a thorough understanding of the basic building blocks for real-time solution designing • Garnish the search and visualization aspects for your real-time solution • Get conceptually and practically acquainted with real-time analytics • Be well equipped to apply the knowledge and create your own solutions In Detail With the rise of Big Data, there is an increasing need to process large amounts of data continuously, with a shorter turnaround time. Real-time data processing involves continuous input,

Chapter

Chapter 1: Introducing Real-Time Analytics

What is big data?

Big data infrastructure

Real–time analytics – the myth and the reality

Near real–time solution – an architecture that works

NRT – The Storm solution

NRT – The Spark solution

Lambda architecture – analytics possibilities

IOT – thoughts and possibilities

Edge analytics

Cloud – considerations for NRT and IOT

Summary

Chapter 2: Real Time Applications — The Basic Ingredients

The NRT system and its building blocks

Data collection

Stream processing

Analytical layer – serve it to the end user

NRT – high-level system view

NRT – technology view

Event producer

Collection

Broker

Transformation and processing

Storage

Summary

Chapter 3: Understanding and Tailing Data Streams

Understanding data streams

Setting up infrastructure for data ingestion

Apache Kafka

Apache NiFi

Logstash

Fluentd

Flume

Taping data from source to the processor - expectations and caveats

Comparing and choosing what works best for your use case

Do it yourself

Setting up Elasticsearch

Summary

Chapter 4: Setting up the Infrastructure for Storm

Overview of Storm

Storm architecture and its components

Characteristics

Components

Stream grouping

Setting up and configuring Storm

Setting up Zookeeper

Installing

Configuring

Standalone

Cluster

Running

Setting up Apache Storm

Installing

Configuring

Running

Real-time processing job on Storm

Running job

Local

Cluster

Summary

Chapter 5: Configuring Apache Spark and Flink

Setting up and a quick execution of Spark

Building from source

Downloading Spark

Running an example

Setting up and a quick execution of Flink

Build Flink source

Download Flink

Running example

Setting up and a quick execution of Apache Beam

Beam model

Running example

MinimalWordCount example walk through

Balancing in Apache Beam

Summary

Chapter 6: Integrating Storm with a Data Source

RabbitMQ – messaging that works

RabbitMQ exchanges

Direct exchanges

Fanout exchanges

Topic exchanges

Headers exchanges

RabbitMQ setup

RabbitMQ — publish and subscribe

RabbitMQ – integration with Storm

AMQPSpout

PubNub data stream publisher

String together Storm-RMQ-PubNub sensor data topology

Summary

Chapter 7: From Storm to Sink

Setting up and configuring Cassandra

Setting up Cassandra

Configuring Cassandra

Storm and Cassandra topology

Storm and IMDB integration for dimensional data

Integrating the presentation layer with Storm

Setting up Grafana with the Elasticsearch plugin

Downloading Grafana

Configuring Grafana

Installing the Elasticsearch plugin in Grafana

Running Grafana

Adding the Elasticsearch datasource in Grafana

Writing code

Executing code

Visualizing the output on Grafana

Do It Yourself

Summary

Chapter 8: Storm Trident

State retention and the need for Trident

Transactional spout

Opaque transactional Spout

Basic Storm Trident topology

Trident internals

Trident operations

Functions

map and flatMap

peek

Filters

Windowing

Tumbling window

Sliding window

Aggregation

Aggregate

Partition aggregate

Persistence aggregate

Combiner aggregator

Reducer aggregator

Aggregator

Grouping

Merge and joins

DRPC

Do It Yourself

Summary

Chapter 9: Working with Spark

Spark overview

Spark framework and schedulers

Distinct advantages of Spark

When to avoid using Spark

Spark – use cases

Spark architecture - working inside the engine

Spark pragmatic concepts

RDD – the name says it all

Spark 2.x – advent of data frames and datasets

Summary

Chapter 10: Working with Spark Operations

Spark – packaging and API

RDD pragmatic exploration

Transformations

Actions

Shared variables – broadcast variables and accumulators

Broadcast variables

Accumulators

Summary

Chapter 11: Spark Streaming

Spark Streaming concepts

Spark Streaming - introduction and architecture

Packaging structure of Spark Streaming

Spark Streaming APIs

Spark Streaming operations

Connecting Kafka to Spark Streaming

Summary

Chapter 12: Working with Apache Flink

Flink architecture and execution engine

Flink basic components and processes

Integration of source stream to Flink

Integration with Apache Kafka

Example

Integration with RabbitMQ

Running example

Flink processing and computation

DataStream API

DataSet API

Flink persistence

Integration with Cassandra

Running example

FlinkCEP

Pattern API

Detecting pattern

Selecting from patterns

Example

Gelly

Gelly API

Graph representation

Graph creation

Graph transformations

DIY

Summary

Chapter 13: Case Study

Introduction

Data modeling

Tools and frameworks

Setting up the infrastructure

Implementing the case study

Building the data simulator

Hazelcast loader

Building Storm topology

Parser bolt

Check distance and alert bolt

Generate alert Bolt

Elasticsearch Bolt

Complete Topology

Running the case study

Load Hazelcast

Generate Vehicle static value

Deploy topology

Start simulator

Visualization using Kibana

Summary

Index

The users who browse this book also browse


No browse record.