Practical Big Data Analytics

Author: Nataraj Dasgupta  

Publisher: Packt Publishing‎

Publication year: 2018

E-ISBN: 9781783554409

P-ISBN(Paperback): 89543100624370

Subject: TP181 automatic reasoning, machine learning

Language: ENG

Access to resources Favorite

Disclaimer: Any content in publications that violate the sovereignty, the constitution or regulations of the PRC is not accepted or approved by CNPIEC.

Practical Big Data Analytics

Chapter

Chapter 1: Too Big or Not Too Big

What is big data?

A brief history of data

Dawn of the information age

Dr. Alan Turing and modern computing

The advent of the stored-program computer

From magnetic devices to SSDs

Why we are talking about big data now if data has always existed

Definition of big data

Building blocks of big data analytics

Types of Big Data

Structured

Unstructured

Semi-structured

Sources of big data

The 4Vs of big data

When do you know you have a big data problem and where do you start your search for the big data solution?

Summary

Chapter 2: Big Data Mining for the Masses

What is big data mining?

Big data mining in the enterprise

Building the case for a Big Data strategy

Implementation life cycle

Stakeholders of the solution

Implementing the solution

Technical elements of the big data platform

Selection of the hardware stack

Selection of the software stack

Summary

Chapter 3: The Analytics Toolkit

Components of the Analytics Toolkit

System recommendations

Installing on a laptop or workstation

Installing on the cloud

Installing Hadoop

Installing Oracle VirtualBox

Installing CDH in other environments

Installing Packt Data Science Box

Installing Spark

Installing R

Steps for downloading and installing Microsoft R Open

Installing RStudio

Installing Python

Summary

Chapter 4: Big Data With Hadoop

The fundamentals of Hadoop

The fundamental premise of Hadoop

The core modules of Hadoop

Hadoop Distributed File System - HDFS

Data storage process in HDFS

Hadoop MapReduce

An intuitive introduction to MapReduce

A technical understanding of MapReduce

Block size and number of mappers and reducers

Hadoop YARN

Job scheduling in YARN

Other topics in Hadoop

Encryption

User authentication

Hadoop data storage formats

New features expected in Hadoop 3

The Hadoop ecosystem

Hands-on with CDH

WordCount using Hadoop MapReduce

Analyzing oil import prices with Hive

Joining tables in Hive

Summary

Chapter 5: Big Data Mining with NoSQL

Why NoSQL?

The ACID, BASE, and CAP properties

ACID and SQL

The BASE property of NoSQL

The CAP theorem

The need for NoSQL technologies

Google Bigtable

Amazon Dynamo

NoSQL databases

In-memory databases

Columnar databases

Document-oriented databases

Key-value databases

Graph databases

Other NoSQL types and summary of other types of databases 

Analyzing Nobel Laureates data with MongoDB

JSON format

Installing and using MongoDB

Tracking physician payments with real-world data

Installing kdb+, R, and RStudio

Installing kdb+

Installing R

Installing RStudio

The CMS Open Payments Portal

Downloading the CMS Open Payments data

Creating the Q application

Loading the data

The backend code

Creating the frontend web portal

R Shiny platform for developers

Putting it all together - The CMS Open Payments application

Applications

Summary

Chapter 6: Spark for Big Data Analytics

The advent of Spark

Limitations of Hadoop

Overcoming the limitations of Hadoop

Theoretical concepts in Spark

Resilient distributed datasets

Directed acyclic graphs

SparkContext

Spark DataFrames

Actions and transformations

Spark deployment options

Spark APIs

Core components in Spark

Spark Core

Spark SQL

Spark Streaming

GraphX

MLlib

The architecture of Spark

Spark solutions

Spark practicals

Signing up for Databricks Community Edition

Spark exercise - hands-on with Spark (Databricks)

Summary

Chapter 7: An Introduction to Machine Learning Concepts

What is machine learning?

The evolution of machine learning

Factors that led to the success of machine learning

Machine learning, statistics, and AI

Categories of machine learning

Supervised and unsupervised machine learning

Supervised machine learning

Vehicle Mileage, Number Recognition and other examples

Unsupervised machine learning

Subdividing supervised machine learning

Common terminologies in machine learning

The core concepts in machine learning

Data management steps in machine learning

Pre-processing and feature selection techniques

Centering and scaling

The near-zero variance function

Removing correlated variables

Other common data transformations

Data sampling

Data imputation

The importance of variables

The train, test splits, and cross-validation concepts

Splitting the data into train and test sets

The cross-validation parameter

Creating the model

Leveraging multicore processing in the model

Summary

Chapter 8: Machine Learning Deep Dive

The bias, variance, and regularization properties

The gradient descent and VC Dimension theories

Popular machine learning algorithms

Regression models

Association rules

Confidence

Support

Lift

Decision trees

The Random forest extension

Boosting algorithms

Support vector machines

The K-Means machine learning technique

The neural networks related algorithms

Tutorial - associative rules mining with CMS data

Downloading the data

Writing the R code for Apriori

Shiny (R Code)

Using custom CSS and fonts for the application

Running the application

Summary

Chapter 9: Enterprise Data Science

Enterprise data science overview

A roadmap to enterprise analytics success

Data science solutions in the enterprise

Enterprise data warehouse and data mining

Traditional data warehouse systems

Oracle Exadata, Exalytics, and TimesTen

HP Vertica

Teradata

IBM data warehouse systems (formerly Netezza appliances)

PostgreSQL

Greenplum

SAP Hana

Enterprise and open source NoSQL Databases

Kdb+

MongoDB

Cassandra

Neo4j

Cloud databases

Amazon Redshift, Redshift Spectrum, and Athena databases

Google BigQuery and other cloud services

Azure CosmosDB

GPU databases

Brytlyt

MapD

Other common databases

Enterprise data science – machine learning and AI

The R programming language

Python

OpenCV, Caffe, and others

Spark

Deep learning

H2O and Driverless AI

Datarobot

Command-line tools

Apache MADlib

Machine learning as a service

Enterprise infrastructure solutions

Cloud computing

Virtualization

Containers – Docker, Kubernetes, and Mesos

On-premises hardware

Enterprise Big Data

Tutorial – using RStudio in the cloud

Summary

Chapter 10: Closing Thoughts on Big Data

Corporate big data and data science strategy

Ethical considerations

Silicon Valley and data science

The human factor

Characteristics of successful projects

Summary

Appendix: External Data Science Resources

Big data resources

NoSQL products

Languages and tools

Creating dashboards

Notebooks

Visualization libraries

Courses on R

Courses on machine learning

Machine learning and deep learning links

Web-based machine learning services

Movies

Machine learning books from Packt

Books for leisure reading

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

The users who browse this book also browse


No browse record.