Python: Real-World Data Science

Author: Dusty Phillips;Fabrizio Romano;Phuong Vo.T.H;Martin Czygan;Robert Layton;Sebastian Raschka  

Publisher: Packt Publishing‎

Publication year: 2016

E-ISBN: 9781786468413

P-ISBN(Paperback): 9781786465160

Subject: TN919.5 数据处理系统及设备;TP274 数据处理、数据处理系统;TP301.6 algorithm theory;TP31 computer software;TP312 程序语言、算法语言

Keyword: 程序语言、算法语言,自动化技术、计算机技术,算法理论,计算机软件,数据处理、数据处理系统,数据处理系统及设备

Language: ENG

Access to resources Favorite

Disclaimer: Any content in publications that violate the sovereignty, the constitution or regulations of the PRC is not accepted or approved by CNPIEC.

Python: Real-World Data Science

Description

Unleash the power of Python and its robust data science capabilities About This Book • Unleash the power of Python 3 objects • Learn to use powerful Python libraries for effective data processing and analysis • Harness the power of Python to analyze data and create insightful predictive models • Unlock deeper insights into machine learning with this vital guide to cutting-edge predictive analytics Who This Book Is For Entry-level analysts who want to enter in the data science world will find this course very useful to get themselves acquainted with Python’s data science capabilities for doing real-world data analysis. What You Will Learn • Install and setup Python • Implement objects in Python by creating classes and defining methods • Get acquainted with NumPy to use it with arrays and array-oriented computing in data analysis • Create effective visualizations for presenting your data using Matplotlib • Process and analyze data using the time series capabilities of pandas • Interact with different kind of database systems, such as file, disk format, Mongo, and Redis • Apply data mining concepts to real-world problems • Compute on big data, including real-time data from the Internet • Explore how to use different machine learning models to ask different questions of your data In Detail The Python: Real-World Data Science course will take you on a journey to become an efficient data science practitioner by thoroughly understanding the key concepts of Python. This lear

Chapter

Introduction and First Steps – Take a Deep Breath

A proper introduction

Enter the Python

About Python

Portability

Coherence

Developer productivity

An extensive library

Software quality

Software integration

Satisfaction and enjoyment

What are the drawbacks?

Who is using Python today?

Setting up the environment

Python 2 versus Python 3 – the great debate

What you need for this course

Installing Python

Installing IPython

Installing additional packages

How you can run a Python program

Running Python scripts

Running the Python interactive shell

Running Python as a service

Running Python as a GUI application

How is Python code organized

How do we use modules and packages

Python's execution model

Names and namespaces

Scopes

Guidelines on how to write good code

The Python culture

A note on the IDEs

Object-oriented Design

Introducing object-oriented

Objects and classes

Specifying attributes and behaviors

Data describes objects

Behaviors are actions

Hiding details and creating the public interface

Composition

Inheritance

Inheritance provides abstraction

Multiple inheritance

Case study

Objects in Python

Creating Python classes

Adding attributes

Making it do something

Talking to yourself

More arguments

Initializing the object

Explaining yourself

Modules and packages

Organizing the modules

Absolute imports

Relative imports

Organizing module contents

Who can access my data?

Third-party libraries

Case study

When Objects Are Alike

Basic inheritance

Extending built-ins

Overriding and super

Multiple inheritance

The diamond problem

Different sets of arguments

Polymorphism

Abstract base classes

Using an abstract base class

Creating an abstract base class

Demystifying the magic

Case study

Expecting the Unexpected

Raising exceptions

Raising an exception

The effects of an exception

Handling exceptions

The exception hierarchy

Defining our own exceptions

Case study

When to Use Object-oriented Programming

Treat objects as objects

Adding behavior to class data with properties

Properties in detail

Decorators – another way to create properties

Deciding when to use properties

Manager objects

Removing duplicate code

In practice

Case study

Python Data Structures

Empty objects

Tuples and named tuples

Named tuples

Dictionaries

Dictionary use cases

Using defaultdict

Counter

Lists

Sorting lists

Sets

Extending built-ins

Queues

FIFO queues

LIFO queues

Priority queues

Case study

Python Object-oriented Shortcuts

Python built-in functions

The len() function

Reversed

Enumerate

File I/O

Placing it in context

An alternative to method overloading

Default arguments

Variable argument lists

Unpacking arguments

Functions are objects too

Using functions as attributes

Callable objects

Case study

Strings and Serialization

Strings

String manipulation

String formatting

Escaping braces

Keyword arguments

Container lookups

Object lookups

Making it look right

Strings are Unicode

Converting bytes to text

Converting text to bytes

Mutable byte strings

Regular expressions

Matching patterns

Matching a selection of characters

Escaping characters

Matching multiple characters

Grouping patterns together

Getting information from regular expressions

Making repeated regular expressions efficient

Serializing objects

Customizing pickles

Serializing web objects

Case study

The Iterator Pattern

Design patterns in brief

Iterators

The iterator protocol

Comprehensions

List comprehensions

Set and dictionary comprehensions

Generator expressions

Generators

Yield items from another iterable

Coroutines

Back to log parsing

Closing coroutines and throwing exceptions

The relationship between coroutines, generators, and functions

Case study

Python Design Patterns I

The decorator pattern

A decorator example

Decorators in Python

The observer pattern

An observer example

The strategy pattern

A strategy example

Strategy in Python

The state pattern

A state example

State versus strategy

State transition as coroutines

The singleton pattern

Singleton implementation

The template pattern

A template example

Python Design Patterns II

The adapter pattern

The facade pattern

The flyweight pattern

The command pattern

The abstract factory pattern

The composite pattern

Testing Object-oriented Programs

Why test?

Test-driven development

Unit testing

Assertion methods

Reducing boilerplate and cleaning up

Organizing and running tests

Ignoring broken tests

Testing with py.test

One way to do setup and cleanup

A completely different way to set up variables

Skipping tests with py.test

Imitating expensive objects

How much testing is enough?

Case study

Implementing it

Concurrency

Threads

The many problems with threads

Shared memory

The global interpreter lock

Thread overhead

Multiprocessing

Multiprocessing pools

Queues

The problems with multiprocessing

Futures

AsyncIO

AsyncIO in action

Reading an AsyncIO future

AsyncIO for networking

Using executors to wrap blocking code

Streams

Executors

Case study

Introducing Data Analysis and Libraries

Data analysis and processing

An overview of the libraries in data analysis

Python libraries in data analysis

NumPy

pandas

Matplotlib

PyMongo

The scikit-learn library

NumPy Arrays and Vectorized Computation

NumPy arrays

Data types

Array creation

Indexing and slicing

Fancy indexing

Numerical operations on arrays

Array functions

Data processing using arrays

Loading and saving data

Saving an array

Loading an array

Linear algebra with NumPy

NumPy random numbers

Data Analysis with pandas

An overview of the pandas package

The pandas data structure

Series

The DataFrame

The essential basic functionality

Reindexing and altering labels

Head and tail

Binary operations

Functional statistics

Function application

Sorting

Indexing and selecting data

Computational tools

Working with missing data

Advanced uses of pandas for data analysis

Hierarchical indexing

The Panel data

Data Visualization

The matplotlib API primer

Line properties

Figures and subplots

Exploring plot types

Scatter plots

Bar plots

Contour plots

Histogram plots

Legends and annotations

Plotting functions with pandas

Additional Python data visualization tools

Bokeh

MayaVi

Time Series

Time series primer

Working with date and time objects

Resampling time series

Downsampling time series data

Upsampling time series data

Timedeltas

Time series plotting

Interacting with Databases

Interacting with data in text format

Reading data from text format

Writing data to text format

Interacting with data in binary format

HDF5

Interacting with data in MongoDB

Interacting with data in Redis

The simple value

List

Set

Ordered set

Data Analysis Application Examples

Data munging

Cleaning data

Filtering

Merging data

Reshaping data

Data aggregation

Grouping data

Getting Started with Data Mining

Introducing data mining

A simple affinity analysis example

What is affinity analysis?

Product recommendations

Loading the dataset with NumPy

Implementing a simple ranking of rules

Ranking to find the best rules

A simple classification example

What is classification?

Loading and preparing the dataset

Implementing the OneR algorithm

Testing the algorithm

Classifying with scikit-learn Estimators

scikit-learn estimators

Nearest neighbors

Distance metrics

Loading the dataset

Moving towards a standard workflow

Running the algorithm

Setting parameters

Preprocessing using pipelines

An example

Standard preprocessing

Putting it all together

Pipelines

Predicting Sports Winners with Decision Trees

Loading the dataset

Collecting the data

Using pandas to load the dataset

Cleaning up the dataset

Extracting new features

Decision trees

Parameters in decision trees

Using decision trees

Sports outcome prediction

Putting it all together

Random forests

How do ensembles work?

Parameters in Random forests

Applying Random forests

Engineering new features

Recommending Movies Using Affinity Analysis

Affinity analysis

Algorithms for affinity analysis

Choosing parameters

The movie recommendation problem

Obtaining the dataset

Loading with pandas

Sparse data formats

The Apriori implementation

The Apriori algorithm

Implementation

Extracting association rules

Evaluation

Extracting Features with Transformers

Feature extraction

Representing reality in models

Common feature patterns

Creating good features

Feature selection

Selecting the best individual features

Feature creation

Creating your own transformer

The transformer API

Implementation details

Unit testing

Putting it all together

Social Media Insight Using Naive Bayes

Disambiguation

Downloading data from a social network

Loading and classifying the dataset

Creating a replicable dataset from Twitter

Text transformers

Bag-of-words

N-grams

Other features

Naive Bayes

Bayes' theorem

Naive Bayes algorithm

How it works

Application

Extracting word counts

Converting dictionaries to a matrix

Training the Naive Bayes classifier

Putting it all together

Evaluation using the F1-score

Getting useful features from models

Discovering Accounts to Follow Using Graph Mining

Loading the dataset

Classifying with an existing model

Getting follower information from Twitter

Building the network

Creating a graph

Creating a similarity graph

Finding subgraphs

Connected components

Optimizing criteria

Beating CAPTCHAs with Neural Networks

Artificial neural networks

An introduction to neural networks

Creating the dataset

Drawing basic CAPTCHAs

Splitting the image into individual letters

Creating a training dataset

Adjusting our training dataset to our methodology

Training and classifying

Back propagation

Predicting words

Improving accuracy using a dictionary

Ranking mechanisms for words

Putting it all together

Authorship Attribution

Attributing documents to authors

Applications and use cases

Attributing authorship

Getting the data

Function words

Counting function words

Classifying with function words

Support vector machines

Classifying with SVMs

Kernels

Character n-grams

Extracting character n-grams

Using the Enron dataset

Accessing the Enron dataset

Creating a dataset loader

Putting it all together

Evaluation

Clustering News Articles

Obtaining news articles

Using a Web API to get data

Reddit as a data source

Getting the data

Extracting text from arbitrary websites

Finding the stories in arbitrary websites

Putting it all together

Grouping news articles

The k-means algorithm

Evaluating the results

Extracting topic information from clusters

Using clustering algorithms as transformers

Clustering ensembles

Evidence accumulation

How it works

Implementation

Online learning

An introduction to online learning

Implementation

Classifying Objects in Images Using Deep Learning

Object classification

Application scenario and goals

Use cases

Deep neural networks

Intuition

Implementation

An introduction to Theano

An introduction to Lasagne

Implementing neural networks with nolearn

GPU optimization

When to use GPUs for computation

Running our code on a GPU

Setting up the environment

Application

Getting the data

Creating the neural network

Putting it all together

Working with Big Data

Big data

Application scenario and goals

MapReduce

Intuition

A word count example

Hadoop MapReduce

Application

Getting the data

Naive Bayes prediction

The mrjob package

Extracting the blog posts

Training Naive Bayes

Putting it all together

Training on Amazon's EMR infrastructure

Next Steps…

Chapter 1 – Getting Started with Data Mining

Scikit-learn tutorials

Extending the IPython Notebook

Chapter 2 – Classifying with scikit-learn Estimators

More complex pipelines

Comparing classifiers

Chapter 3: Predicting Sports Winners with Decision Trees

More on pandas

Chapter 4 – Recommending Movies Using Affinity Analysis

The Eclat algorithm

Chapter 5 – Extracting Features with Transformers

Vowpal Wabbit

Chapter 6 – Social Media Insight Using Naive Bayes

Natural language processing and part-of-speech tagging

Chapter 7 – Discovering Accounts to Follow Using Graph Mining

More complex algorithms

Chapter 8 – Beating CAPTCHAs with Neural Networks

Deeper networks

Reinforcement learning

Chapter 9 – Authorship Attribution

Local n-grams

Chapter 10 – Clustering News Articles

Real-time clusterings

Chapter 11 – Classifying Objects in Images Using Deep Learning

Keras and Pylearn2

Mahotas

Chapter 12 – Working with Big Data

Courses on Hadoop

Pydoop

Recommendation engine

More resources

Giving Computers the Ability to Learn from Data

How to transform data into knowledge

The three different types of machine learning

Making predictions about the future with supervised learning

Classification for predicting class labels

Regression for predicting continuous outcomes

Solving interactive problems with reinforcement learning

Discovering hidden structures with unsupervised learning

Finding subgroups with clustering

Dimensionality reduction for data compression

An introduction to the basic terminology and notations

A roadmap for building machine learning systems

Preprocessing – getting data into shape

Training and selecting a predictive model

Evaluating models and predicting unseen data instances

Using Python for machine learning

Training Machine Learning Algorithms for Classification

Artificial neurons – a brief glimpse into the early history of machine learning

Implementing a perceptron learning algorithm in Python

Training a perceptron model on the Iris dataset

Adaptive linear neurons and the convergence of learning

Minimizing cost functions with gradient descent

Implementing an Adaptive Linear Neuron in Python

Large scale machine learning and stochastic gradient descent

A Tour of Machine Learning Classifiers Using scikit-learn

Choosing a classification algorithm

First steps with scikit-learn

Training a perceptron via scikit-learn

Modeling class probabilities via logistic regression

Logistic regression intuition and conditional probabilities

Learning the weights of the logistic cost function

Training a logistic regression model with scikit-learn

Tackling overfitting via regularization

Maximum margin classification with support vector machines

Maximum margin intuition

Dealing with the nonlinearly separable case using slack variables

Alternative implementations in scikit-learn

Solving nonlinear problems using a kernel SVM

Using the kernel trick to find separating hyperplanes in higher dimensional space

Decision tree learning

Maximizing information gain – getting the most bang for the buck

Building a decision tree

Combining weak to strong learners via random forests

K-nearest neighbors – a lazy learning algorithm

Building Good Training Sets – Data Preprocessing

Dealing with missing data

Eliminating samples or features with missing values

Imputing missing values

Understanding the scikit-learn estimator API

Handling categorical data

Mapping ordinal features

Encoding class labels

Performing one-hot encoding on nominal features

Partitioning a dataset in training and test sets

Bringing features onto the same scale

Selecting meaningful features

Sparse solutions with L1 regularization

Sequential feature selection algorithms

Assessing feature importance with random forests

Compressing Data via Dimensionality Reduction

Unsupervised dimensionality reduction via principal component analysis

Total and explained variance

Feature transformation

Principal component analysis in scikit-learn

Supervised data compression via linear discriminant analysis

Computing the scatter matrices

Selecting linear discriminants for the new feature subspace

Projecting samples onto the new feature space

LDA via scikit-learn

Using kernel principal component analysis for nonlinear mappings

Kernel functions and the kernel trick

Implementing a kernel principal component analysis in Python

Example 1 – separating half-moon shapes

Example 2 – separating concentric circles

Projecting new data points

Kernel principal component analysis in scikit-learn

Learning Best Practices for Model Evaluation and Hyperparameter Tuning

Streamlining workflows with pipelines

Loading the Breast Cancer Wisconsin dataset

Combining transformers and estimators in a pipeline

Using k-fold cross-validation to assess model performance

The holdout method

K-fold cross-validation

Debugging algorithms with learning and validation curves

Diagnosing bias and variance problems with learning curves

Addressing overfitting and underfitting with validation curves

Fine-tuning machine learning models via grid search

Tuning hyperparameters via grid search

Algorithm selection with nested cross-validation

Looking at different performance evaluation metrics

Reading a confusion matrix

Optimizing the precision and recall of a classification model

Plotting a receiver operating characteristic

The scoring metrics for multiclass classification

Combining Different Models for Ensemble Learning

Learning with ensembles

Implementing a simple majority vote classifier

Combining different algorithms for classification with majority vote

Evaluating and tuning the ensemble classifier

Bagging – building an ensemble of classifiers from bootstrap samples

Leveraging weak learners via adaptive boosting

Predicting Continuous Target Variables with Regression Analysis

Introducing a simple linear regression model

Exploring the Housing Dataset

Visualizing the important characteristics of a dataset

Implementing an ordinary least squares linear regression model

Solving regression for regression parameters with gradient descent

Estimating the coefficient of a regression model via scikit-learn

Fitting a robust regression model using RANSAC

Evaluating the performance of linear regression models

Using regularized methods for regression

Turning a linear regression model into a curve – polynomial regression

Modeling nonlinear relationships in the Housing Dataset

Dealing with nonlinear relationships using random forests

Decision tree regression

Random forest regression

Reflect and Test Yourself! Answers

Module 2: Data Analysis

Chapter 1: Introducing Data Analysis and Libraries

Chapter 2: Object-oriented Design

Chapter 3: Data Analysis with pandas

Chapter 4: Data Visualization

Chapter 5: Time Series

Chapter 6: Interacting with Databases

Chapter 7: Data Analysis Application Examples

Module 3: Data Mining

Chapter 1: Getting Started with Data Mining

Chapter 2: Classifying with scikit-learn Estimators

Chapter 3: Predicting Sports Winners with Decision Trees

Chapter 4: Recommending Movies Using Affinity Analysis

Chapter 5: Extracting Features with Transformers

Chapter 6: Social Media Insight Using Naive Bayes

Chapter 7: Discovering Accounts to Follow Using Graph Mining

Chapter 8: Beating CAPTCHAs with Neural Networks

Chapter 9: Authorship Attribution

Chapter 10: Clustering News Articles

Chapter 11: Classifying Objects in Images Using Deep Learning

Chapter 12: Working with Big Data

Module 4: Machine Learning

Chapter 1: Giving Computers the Ability to Learn from Data

Chapter 2: Training Machine Learning

Chapter 3: A Tour of Machine Learning Classifiers Using scikit-learn

Chapter 4: Building Good Training Sets – Data Preprocessing

Chapter 5: Compressing Data via Dimensionality Reduction

Chapter 6: Learning Best Practices for Model Evaluation and Hyperparameter Tuning

Chapter 7: Combining Different Models for Ensemble Learning

Chapter 8: Predicting Continuous Target Variables with Regression Analysis

The users who browse this book also browse