Java: Data Science Made Easy

Author: Richard M. Reese;Jennifer L. Reese;Alexey Grigorev  

Publisher: Packt Publishing‎

Publication year: 2017

E-ISBN: 9781788479189

P-ISBN(Paperback): 9781788475655

Subject: TP274 数据处理、数据处理系统;TP39 computer application

Keyword: 计算机的应用,数据处理、数据处理系统

Language: ENG

Access to resources Favorite

Disclaimer: Any content in publications that violate the sovereignty, the constitution or regulations of the PRC is not accepted or approved by CNPIEC.

Java: Data Science Made Easy

Description

Data collection, processing, analysis, and more About This Book • Your entry ticket to the world of data science with the stability and power of Java • Explore, analyse, and visualize your data effectively using easy-to-follow examples • A highly practical course covering a broad set of topics - from the basics of Machine Learning to Deep Learning and Big Data frameworks. Who This Book Is For This course is meant for Java developers who are comfortable developing applications in Java, and now want to enter the world of data science or wish to build intelligent applications. Aspiring data scientists with some understanding of the Java programming language will also find this book to be very helpful. If you are willing to build efficient data science applications and bring them in the enterprise environment without changing your existing Java stack, this book is for you! What You Will Learn • Understand the key concepts of data science • Explore the data science ecosystem available in Java • Work with the Java APIs and techniques used to perform efficient data analysis • Find out how to approach different machine learning problems with Java • Process unstructured information such as natural language text or images, and create your own searc • Learn how to build deep neural networks with DeepLearning4j • Build data science applications that scale and process large amounts of data • Deploy data science models to production and evaluate their performance In Detail Data s

Chapter

Module 1: Java for Data Science

Chapter 1: Getting Started with Data Science

Problems solved using data science

Understanding the data science problem - solving approach

Using Java to support data science

Acquiring data for an application

The importance and process of cleaning data

Visualizing data to enhance understanding

The use of statistical methods in data science

Machine learning applied to data science

Using neural networks in data science

Deep learning approaches

Performing text analysis

Visual and audio analysis

Improving application performance using parallel techniques

Assembling the pieces

Summary

Chapter 2: Data Acquisition

Understanding the data formats used in data science applications

Overview of CSV data

Overview of spreadsheets

Overview of databases

Overview of PDF files

Overview of JSON

Overview of XML

Overview of streaming data

Overview of audio/video/images in Java

Data acquisition techniques

Using the HttpUrlConnection class

Web crawlers in Java

Creating your own web crawler

Using the crawler4j web crawler

Web scraping in Java

Using API calls to access common social media sites

Using OAuth to authenticate users

Handing Twitter

Handling Wikipedia

Handling Flickr

Handling YouTube

Searching by keyword

Summary

Chapter 3: Data Cleaning

Handling data formats

Handling CSV data

Handling spreadsheets

Handling Excel spreadsheets

Handling PDF files

Handling JSON

Using JSON streaming API

Using the JSON tree API

The nitty gritty of cleaning text

Using Java tokenizers to extract words

Java core tokenizers

Third-party tokenizers and libraries

Transforming data into a usable form

Simple text cleaning

Removing stop words

Finding words in text

Finding and replacing text

Data imputation

Subsetting data

Sorting text

Data validation

Validating data types

Validating dates

Validating e-mail addresses

Validating ZIP codes

Validating names

Cleaning images

Changing the contrast of an image

Smoothing an image

Brightening an image

Resizing an image

Converting images to different formats

Summary

Chapter 4: Data Visualization

Understanding plots and graphs

Visual analysis goals

Creating index charts

Creating bar charts

Using country as the category

Using decade as the category

Creating stacked graphs

Creating pie charts

Creating scatter charts

Creating histograms

Creating donut charts

Creating bubble charts

Summary

Chapter 5: Statistical Data Analysis Techniques

Working with mean, mode, and median

Calculating the mean

Using simple Java techniques to find mean

Using Java 8 techniques to find mean

Using Google Guava to find mean

Using Apache Commons to find mean

Calculating the median

Using simple Java techniques to find median

Using Apache Commons to find the median

Calculating the mode

Using ArrayLists to find multiple modes

Using a HashMap to find multiple modes

Using a Apache Commons to find multiple modes

Standard deviation

Sample size determination

Hypothesis testing

Regression analysis

Using simple linear regression

Using multiple regression

Summary

Chapter 6: Machine Learning

Supervised learning techniques

Decision trees

Decision tree types

Decision tree libraries

Using a decision tree with a book dataset

Testing the book decision tree

Support vector machines

Using an SVM for camping data

Testing individual instances

Bayesian networks

Using a Bayesian network

Unsupervised machine learning

Association rule learning

Using association rule learning to find buying relationships

Reinforcement learning

Summary

Chapter 7: Neural Networks

Training a neural network

Getting started with neural network architectures

Understanding static neural networks

A basic Java example

Understanding dynamic neural networks

Multilayer perceptron networks

Building the model

Evaluating the model

Predicting other values

Saving and retrieving the model

Learning vector quantization

Self-Organizing Maps

Using a SOM

Displaying the SOM results

Additional network architectures and algorithms

The k-Nearest Neighbors algorithm

Instantaneously trained networks

Spiking neural networks

Cascading neural networks

Holographic associative memory

Backpropagation and neural networks

Summary

Chapter 8: Deep Learning

Deeplearning4j architecture

Acquiring and manipulating data

Reading in a CSV file

Configuring and building a model

Using hyperparameters in ND4J

Instantiating the network model

Training a model

Testing a model

Deep learning and regression analysis

Preparing the data

Setting up the class

Reading and preparing the data

Building the model

Evaluating the model

Restricted Boltzmann Machines

Reconstruction in an RBM

Configuring an RBM

Deep autoencoders

Building an autoencoder in DL4J

Configuring the network

Building and training the network

Saving and retrieving a network

Specialized autoencoders

Convolutional networks

Building the model

Evaluating the model

Recurrent Neural Networks

Summary

Chapter 9: Text Analysis

Implementing named entity recognition

Using OpenNLP to perform NER

Identifying location entities

Classifying text

Word2Vec and Doc2Vec

Classifying text by labels

Classifying text by similarity

Understanding tagging and POS

Using OpenNLP to identify POS

Understanding POS tags

Extracting relationships from sentences

Using OpenNLP to extract relationships

Sentiment analysis

Downloading and extracting the Word2Vec model

Building our model and classifying text

Summary

Chapter 10: Visual and Audio Analysis

Text-to-speech

Using FreeTTS

Getting information about voices

Gathering voice information

Understanding speech recognition

Using CMUPhinx to convert speech to text

Obtaining more detail about the words

Extracting text from an image

Using Tess4j to extract text

Identifying faces

Using OpenCV to detect faces

Classifying visual data

Creating a Neuroph Studio project for classifying visual images

Training the model

Summary

Chapter 11: Mathematical and Parallel Techniques for Data Analysis

Implementing basic matrix operations

Using GPUs with DeepLearning4j

Using map-reduce

Using Apache's Hadoop to perform map-reduce

Writing the map method

Writing the reduce method

Creating and executing a new Hadoop job

Various mathematical libraries

Using the jblas API

Using the Apache Commons math API

Using the ND4J API

Using OpenCL

Using Aparapi

Creating an Aparapi application

Using Aparapi for matrix multiplication

Using Java 8 streams

Understanding Java 8 lambda expressions and streams

Using Java 8 to perform matrix multiplication

Using Java 8 to perform map-reduce

Summary

Chapter 12: Bringing It All Together

Defining the purpose and scope of our application

Understanding the application's architecture

Data acquisition using Twitter

Understanding the TweetHandler class

Extracting data for a sentiment analysis model

Building the sentiment model

Processing the JSON input

Cleaning data to improve our results

Removing stop words

Performing sentiment analysis

Analysing the results

Other optional enhancements

Summary

Module 2: Mastering Java for Data Science

Chapter 1: Data Science Using Java

Data science

Machine learning

Supervised learning

Unsupervised learning

Clustering

Dimensionality reduction

Natural Language Processing

Data science process models

CRISP-DM

A running example

Data science in Java

Data science libraries

Data processing libraries

Math and stats libraries

Machine learning and data mining libraries

Text processing

Summary

Chapter 2: Data Processing Toolbox

Standard Java library

Collections

Input/Output

Reading input data

Writing ouput data

Streaming API

Extensions to the standard library

Apache Commons

Commons Lang

Commons IO

Commons Collections

Other commons modules

Google Guava

AOL Cyclops React

Accessing data

Text data and CSV

Web and HTML

JSON

Databases

DataFrames

Search engine - preparing data

Summary

Chapter 3: Exploratory Data Analysis

Exploratory data analysis in Java

Search engine datasets

Apache Commons Math

Joinery

Interactive Exploratory Data Analysis in Java

JVM languages

Interactive Java

Joinery shell

Summary

Chapter 4: Supervised Learning - Classification and Regression

Classification

Binary classification models

Smile

JSAT

LIBSVM and LIBLINEAR

Encog

Evaluation

Accuracy

Precision, recall, and F1

ROC and AU ROC (AUC)

Result validation

K-fold cross-validation

Training, validation, and testing

Case study - page prediction

Regression

Machine learning libraries for regression

Smile

JSAT

Other libraries

Evaluation

MSE

MAE

Case study - hardware performance

Summary

Chapter 5: Unsupervised Learning - Clustering and Dimensionality Reduction

Dimensionality reduction

Unsupervised dimensionality reduction

Principal Component Analysis

Truncated SVD

Truncated SVD for categorical and sparse data

Random projection

Cluster analysis

Hierarchical methods

K-means

Choosing K in K-Means

DBSCAN

Clustering for supervised learning

Clusters as features

Clustering as dimensionality reduction

Supervised learning via clustering

Evaluation

Manual evaluation

Supervised evaluation

Unsupervised Evaluation

Summary

Chapter 6: Working with Text - Natural Language Processing and Information Retrieval

Natural Language Processing and information retrieval

Vector Space Model - Bag of Words and TF-IDF

Vector space model implementation

Indexing and Apache Lucene

Natural Language Processing tools

Stanford CoreNLP

Customizing Apache Lucene

Machine learning for texts

Unsupervised learning for texts

Latent Semantic Analysis

Text clustering

Word embeddings

Supervised learning for texts

Text classification

Learning to rank for information retrieval

Reranking with Lucene

Summary

Chapter 7: Extreme Gradient Boosting

Gradient Boosting Machines and XGBoost

Installing XGBoost

XGBoost in practice

XGBoost for classification

Parameter tuning

Text features

Feature importance

XGBoost for regression

XGBoost for learning to rank

Summary

Chapter 8: Deep Learning with DeepLearning4J

Neural Networks and DeepLearning4J

ND4J - N-dimensional arrays for Java

Neural networks in DeepLearning4J

Convolutional Neural Networks

Deep learning for cats versus dogs

Reading the data

Creating the model

Monitoring the performance

Data augmentation

Running DeepLearning4J on GPU

Summary

Chapter 9: Scaling Data Science

Apache Hadoop

Hadoop MapReduce

Common Crawl

Apache Spark

Link prediction

Reading the DBLP graph

Extracting features from the graph

Node features

Negative sampling

Edge features

Link Prediction with MLlib and XGBoost

Link suggestion

Summary

Chapter 10: Deploying Data Science Models

Microservices

Spring Boot

Search engine service

Online evaluation

A/B testing

Multi-armed bandits

Summary

Bibliography

Index

The users who browse this book also browse