Mastering Java for Data Science

Author: Alexey Grigorev  

Publisher: Packt Publishing‎

Publication year: 2017

E-ISBN: 9781785887390

P-ISBN(Paperback): 9781782174271

Subject: TP Automation Technology , Computer Technology;TP39 computer application

Keyword: 计算机的应用,自动化技术、计算机技术

Language: ENG

Access to resources Favorite

Disclaimer: Any content in publications that violate the sovereignty, the constitution or regulations of the PRC is not accepted or approved by CNPIEC.

Description

Use Java to create a diverse range of Data Science applications and bring Data Science into production About This Book • An overview of modern Data Science and Machine Learning libraries available in Java • Coverage of a broad set of topics, going from the basics of Machine Learning to Deep Learning and Big Data frameworks. • Easy-to-follow illustrations and the running example of building a search engine. Who This Book Is For This book is intended for software engineers who are comfortable with developing Java applications and are familiar with the basic concepts of data science. Additionally, it will also be useful for data scientists who do not yet know Java but want or need to learn it. If you are willing to build efficient data science applications and bring them in the enterprise environment without changing the existing stack, this book is for you! What You Will Learn • Get a solid understanding of the data processing toolbox available in Java • Explore the data science ecosystem available in Java • Find out how to approach different machine learning problems with Java • Process unstructured information such as natural language text or images • Create your own search engine • Get state-of-the-art performance with XGBoost • Learn how to build deep neural networks with DeepLearning4j • Build applications that scale and process large amounts of data • Deploy data science models to production and evaluate their performance In Detail Java is the most popular progra

Chapter

Chapter 1: Data Science Using Java

Data science

Machine learning

Supervised learning

Unsupervised learning

Clustering

Dimensionality reduction

Natural Language Processing

Data science process models

CRISP-DM

A running example

Data science in Java

Data science libraries

Data processing libraries

Math and stats libraries

Machine learning and data mining libraries

Text processing

Summary

Chapter 2: Data Processing Toolbox

Standard Java library

Collections

Input/Output

Reading input data

Writing ouput data

Streaming API

Extensions to the standard library

Apache Commons

Commons Lang

Commons IO

Commons Collections

Other commons modules

Google Guava

AOL Cyclops React

Accessing data

Text data and CSV

Web and HTML

JSON

Databases

DataFrames

Search engine - preparing data

Summary

Chapter 3: Exploratory Data Analysis

Exploratory data analysis in Java

Search engine datasets

Apache Commons Math

Joinery

Interactive Exploratory Data Analysis in Java

JVM languages

Interactive Java

Joinery shell

Summary

Chapter 4: Supervised Learning - Classification and Regression

Classification

Binary classification models

Smile

JSAT

LIBSVM and LIBLINEAR

Encog

Evaluation

Accuracy

Precision, recall, and F1

ROC and AU ROC (AUC)

Result validation

K-fold cross-validation

Training, validation, and testing

Case study - page prediction

Regression

Machine learning libraries for regression

Smile

JSAT

Other libraries

Evaluation

MSE

MAE

Case study - hardware performance

Summary

Chapter 5: Unsupervised Learning - Clustering and Dimensionality Reduction

Dimensionality reduction

Unsupervised dimensionality reduction

Principal Component Analysis

Truncated SVD

Truncated SVD for categorical and sparse data

Random projection

Cluster analysis

Hierarchical methods

K-means

Choosing K in K-Means

DBSCAN

Clustering for supervised learning

Clusters as features

Clustering as dimensionality reduction

Supervised learning via clustering

Evaluation

Manual evaluation

Supervised evaluation

Unsupervised Evaluation

Summary

Chapter 6: Working with Text - Natural Language Processing and Information Retrieval

Natural Language Processing and information retrieval

Vector Space Model - Bag of Words and TF-IDF

Vector space model implementation

Indexing and Apache Lucene

Natural Language Processing tools

Stanford CoreNLP

Customizing Apache Lucene

Machine learning for texts

Unsupervised learning for texts

Latent Semantic Analysis

Text clustering

Word embeddings

Supervised learning for texts 

Text classification

Learning to rank for information retrieval

Reranking with Lucene

Summary

Chapter 7: Extreme Gradient Boosting

Gradient Boosting Machines and XGBoost

Installing XGBoost

XGBoost in practice

XGBoost for classification

Parameter tuning

Text features

Feature importance

XGBoost for regression

XGBoost for learning to rank

Summary

Chapter 8: Deep Learning with DeepLearning4J

Neural Networks and DeepLearning4J

ND4J - N-dimensional arrays for Java

Neural networks in DeepLearning4J

Convolutional Neural Networks

Deep learning for cats versus dogs

Reading the data

Creating the model

Monitoring the performance

Data augmentation

Running DeepLearning4J on GPU

Summary

Chapter 9: Scaling Data Science

Apache Hadoop

Hadoop MapReduce

Common Crawl

Apache Spark

Link prediction

Reading the DBLP graph

Extracting features from the graph

Node features

Negative sampling

Edge features

Link Prediction with MLlib and XGBoost

Link suggestion

Summary

Chapter 10: Deploying Data Science Models

Microservices

Spring Boot

Search engine service

Online evaluation

A/B testing

Multi-armed bandits

Summary

Index

The users who browse this book also browse