Python: Real-World Data Science

Author： Dusty Phillips;Fabrizio Romano;Phuong Vo.T.H;Martin Czygan;Robert Layton;Sebastian Raschka

Publisher： Packt Publishing‎

Publication year： 2016

E-ISBN: 9781786468413

P-ISBN(Paperback): 9781786465160

Subject： TN919.5 数据处理系统及设备;TP274 数据处理、数据处理系统;TP301.6 algorithm theory;TP31 computer software;TP312 程序语言、算法语言

Keyword：程序语言、算法语言,自动化技术、计算机技术,算法理论,计算机软件,数据处理、数据处理系统,数据处理系统及设备

Language： ENG

Access to resources Favorite

Disclaimer: Any content in publications that violate the sovereignty, the constitution or regulations of the PRC is not accepted or approved by CNPIEC.

Python: Real-World Data Science

Description

Unleash the power of Python and its robust data science capabilities About This Book • Unleash the power of Python 3 objects • Learn to use powerful Python libraries for effective data processing and analysis • Harness the power of Python to analyze data and create insightful predictive models • Unlock deeper insights into machine learning with this vital guide to cutting-edge predictive analytics Who This Book Is For Entry-level analysts who want to enter in the data science world will find this course very useful to get themselves acquainted with Python’s data science capabilities for doing real-world data analysis. What You Will Learn • Install and setup Python • Implement objects in Python by creating classes and defining methods • Get acquainted with NumPy to use it with arrays and array-oriented computing in data analysis • Create effective visualizations for presenting your data using Matplotlib • Process and analyze data using the time series capabilities of pandas • Interact with different kind of database systems, such as file, disk format, Mongo, and Redis • Apply data mining concepts to real-world problems • Compute on big data, including real-time data from the Internet • Explore how to use different machine learning models to ask different questions of your data In Detail The Python: Real-World Data Science course will take you on a journey to become an efficient data science practitioner by thoroughly understanding the key concepts of Python. This lear

Chapter

Cover

Meet Your Course Guide

Table of Contents

Introduction and First Steps – Take a Deep Breath

A proper introduction

Enter the Python

About Python

Portability

Coherence

Developer productivity

An extensive library

Software quality

Software integration

Satisfaction and enjoyment

What are the drawbacks?

Who is using Python today?

Setting up the environment

Python 2 versus Python 3 – the great debate

What you need for this course

Installing Python

Installing IPython

Installing additional packages

How you can run a Python program

Running Python scripts

Running the Python interactive shell

Running Python as a service

Running Python as a GUI application

How is Python code organized

How do we use modules and packages

Python's execution model

Names and namespaces

Scopes

Guidelines on how to write good code

The Python culture

A note on the IDEs

Object-oriented Design

Introducing object-oriented

Objects and classes

Specifying attributes and behaviors

Data describes objects

Behaviors are actions

Hiding details and creating the public interface

Composition

Inheritance

Inheritance provides abstraction

Multiple inheritance

Case study

Objects in Python

Creating Python classes

Adding attributes

Making it do something

Talking to yourself

More arguments

Initializing the object

Explaining yourself

Modules and packages

Organizing the modules

Absolute imports

Relative imports

Organizing module contents

Who can access my data?

Third-party libraries

Case study

When Objects Are Alike

Basic inheritance

Extending built-ins

Overriding and super

Multiple inheritance

The diamond problem

Different sets of arguments

Polymorphism

Abstract base classes

Using an abstract base class

Creating an abstract base class

Demystifying the magic

Case study

Expecting the Unexpected

Raising exceptions

Raising an exception

The effects of an exception

Handling exceptions

The exception hierarchy

Defining our own exceptions

Case study

When to Use Object-oriented Programming

Treat objects as objects

Adding behavior to class data with properties

Properties in detail

Decorators – another way to create properties

Deciding when to use properties

Manager objects

Removing duplicate code

In practice

Case study

Python Data Structures

Empty objects

Tuples and named tuples

Named tuples

Dictionaries

Dictionary use cases

Using defaultdict

Counter

Lists

Sorting lists

Sets

Extending built-ins

Queues

FIFO queues

LIFO queues

Priority queues

Case study

Python Object-oriented Shortcuts

Python built-in functions

The len() function

Reversed

Enumerate

File I/O

Placing it in context

An alternative to method overloading

Default arguments

Variable argument lists

Unpacking arguments

Functions are objects too

Using functions as attributes

Callable objects

Case study

Strings and Serialization

Strings

String manipulation

String formatting

Escaping braces

Keyword arguments

Container lookups

Object lookups

Making it look right

Strings are Unicode

Converting bytes to text

Converting text to bytes

Mutable byte strings

Regular expressions

Matching patterns

Matching a selection of characters

Escaping characters

Matching multiple characters

Grouping patterns together

Getting information from regular expressions

Making repeated regular expressions efficient

Serializing objects

Customizing pickles

Serializing web objects

Case study

The Iterator Pattern

Design patterns in brief

Iterators

The iterator protocol

Comprehensions

List comprehensions

Set and dictionary comprehensions

Generator expressions

Generators

Yield items from another iterable

Coroutines

Back to log parsing

Closing coroutines and throwing exceptions

The relationship between coroutines, generators, and functions

Case study

Python Design Patterns I

The decorator pattern

A decorator example

Decorators in Python

The observer pattern

An observer example

The strategy pattern

A strategy example

Strategy in Python

The state pattern

A state example

State versus strategy

State transition as coroutines

The singleton pattern

Singleton implementation

The template pattern

A template example

Python Design Patterns II

The adapter pattern

The facade pattern

The flyweight pattern

The command pattern

The abstract factory pattern

The composite pattern

Testing Object-oriented Programs

Why test?

Test-driven development

Unit testing

Assertion methods

Reducing boilerplate and cleaning up

Organizing and running tests

Ignoring broken tests

Testing with py.test

One way to do setup and cleanup

A completely different way to set up variables

Skipping tests with py.test

Imitating expensive objects

How much testing is enough?

Case study

Implementing it

Concurrency

Threads

The many problems with threads

Shared memory

The global interpreter lock

Thread overhead

Multiprocessing

Multiprocessing pools

Queues

The problems with multiprocessing

Futures

AsyncIO

AsyncIO in action

Reading an AsyncIO future

AsyncIO for networking

Using executors to wrap blocking code

Streams

Executors

Case study

Introducing Data Analysis and Libraries

Data analysis and processing

An overview of the libraries in data analysis

Python libraries in data analysis

NumPy

pandas

Matplotlib

PyMongo

The scikit-learn library

NumPy Arrays and Vectorized Computation

NumPy arrays

Data types

Array creation

Indexing and slicing

Fancy indexing

Numerical operations on arrays

Array functions

Data processing using arrays

Loading and saving data

Saving an array

Loading an array

Linear algebra with NumPy

NumPy random numbers

Data Analysis with pandas

An overview of the pandas package

The pandas data structure

Series

The DataFrame

The essential basic functionality

Reindexing and altering labels

Head and tail

Binary operations

Functional statistics

Function application

Sorting

Indexing and selecting data

Computational tools

Working with missing data

Advanced uses of pandas for data analysis

Hierarchical indexing

The Panel data

Data Visualization

The matplotlib API primer

Line properties

Figures and subplots

Exploring plot types

Scatter plots

Bar plots

Contour plots

Histogram plots

Legends and annotations

Plotting functions with pandas

Additional Python data visualization tools

Bokeh

MayaVi

Time Series

Time series primer

Working with date and time objects

Resampling time series

Downsampling time series data

Upsampling time series data

Timedeltas

Time series plotting

Interacting with Databases

Interacting with data in text format

Reading data from text format

Writing data to text format

Interacting with data in binary format

HDF5

Interacting with data in MongoDB

Interacting with data in Redis

The simple value

List

Set

Ordered set

Data Analysis Application Examples

Data munging

Cleaning data

Filtering

Merging data

Reshaping data

Data aggregation

Grouping data

Getting Started with Data Mining

Introducing data mining

A simple affinity analysis example

What is affinity analysis?

Product recommendations

Loading the dataset with NumPy

Implementing a simple ranking of rules

Ranking to find the best rules

A simple classification example

What is classification?

Loading and preparing the dataset

Implementing the OneR algorithm

Testing the algorithm

Classifying with scikit-learn Estimators

scikit-learn estimators

Nearest neighbors

Distance metrics

Loading the dataset

Moving towards a standard workflow

Running the algorithm

Setting parameters

Preprocessing using pipelines

An example

Standard preprocessing

Putting it all together

Pipelines

Predicting Sports Winners with Decision Trees

Loading the dataset

Collecting the data

Using pandas to load the dataset

Cleaning up the dataset

Extracting new features

Decision trees

Parameters in decision trees

Using decision trees

Sports outcome prediction

Putting it all together

Random forests

How do ensembles work?

Parameters in Random forests

Applying Random forests

Engineering new features

Recommending Movies Using Affinity Analysis

Affinity analysis

Algorithms for affinity analysis

Choosing parameters

The movie recommendation problem

Obtaining the dataset

Loading with pandas

Sparse data formats

The Apriori implementation

The Apriori algorithm

Implementation

Extracting association rules

Evaluation

Extracting Features with Transformers

Feature extraction

Representing reality in models

Common feature patterns

Creating good features

Feature selection

Selecting the best individual features

Feature creation

Creating your own transformer

The transformer API

Implementation details

Unit testing

Putting it all together

Social Media Insight Using Naive Bayes

Disambiguation

Downloading data from a social network

Loading and classifying the dataset

Creating a replicable dataset from Twitter

Text transformers

Bag-of-words

N-grams

Other features

Naive Bayes

Bayes' theorem

Naive Bayes algorithm

How it works

Application

Extracting word counts

Converting dictionaries to a matrix

Training the Naive Bayes classifier

Putting it all together

Evaluation using the F1-score

Getting useful features from models

Discovering Accounts to Follow Using Graph Mining

Loading the dataset

Classifying with an existing model

Getting follower information from Twitter

Building the network

Creating a graph

Creating a similarity graph

Finding subgraphs

Connected components

Optimizing criteria

Beating CAPTCHAs with Neural Networks

Artificial neural networks

An introduction to neural networks

Creating the dataset

Drawing basic CAPTCHAs

Splitting the image into individual letters

Creating a training dataset

Adjusting our training dataset to our methodology

Training and classifying

Back propagation

Predicting words

Improving accuracy using a dictionary

Ranking mechanisms for words

Putting it all together

Authorship Attribution

Attributing documents to authors

Applications and use cases

Attributing authorship

Getting the data

Function words

Counting function words

Classifying with function words

Support vector machines

Classifying with SVMs

Kernels

Character n-grams

Extracting character n-grams

Using the Enron dataset

Accessing the Enron dataset

Creating a dataset loader

Putting it all together

Evaluation

Clustering News Articles

Obtaining news articles

Using a Web API to get data

Reddit as a data source

Getting the data

Extracting text from arbitrary websites

Finding the stories in arbitrary websites

Putting it all together

Grouping news articles

The k-means algorithm

Evaluating the results

Extracting topic information from clusters

Using clustering algorithms as transformers

Clustering ensembles

Evidence accumulation

How it works

Implementation

Online learning

An introduction to online learning

Implementation

Classifying Objects in Images Using Deep Learning

Object classification

Application scenario and goals

Use cases

Deep neural networks

Intuition

Implementation

An introduction to Theano

An introduction to Lasagne

Implementing neural networks with nolearn

GPU optimization

When to use GPUs for computation

Running our code on a GPU

Setting up the environment

Application

Getting the data

Creating the neural network

Putting it all together

Working with Big Data

Big data

Application scenario and goals

MapReduce

Intuition

A word count example

Hadoop MapReduce

Application

Getting the data

Naive Bayes prediction

The mrjob package

Extracting the blog posts

Training Naive Bayes

Putting it all together

Training on Amazon's EMR infrastructure

Next Steps…

Chapter 1 – Getting Started with Data Mining

Scikit-learn tutorials

Extending the IPython Notebook

Chapter 2 – Classifying with scikit-learn Estimators

More complex pipelines

Comparing classifiers

Chapter 3: Predicting Sports Winners with Decision Trees