A summary of tools for data science for Python

Author: Dave Kuhlman
Contact: dkuhlman (at) davekuhlman (dot) org
Address:
http://www.davekuhlman.org
Revision: 1.0.1
Date: June 14, 2018
Copyright:Copyright (c) 2018 Dave Kuhlman. All Rights Reserved. This software is subject to the provisions of the MIT License http://www.opensource.org/licenses/mit-license.php.
Abstract:This document attempts to give a survey of data science tools for Python programming, along with brief introductions to help getting started with some of those tools.

Contents

1   Introduction and preliminaries

In this document I'll try to describe and summarize some significant tools that are available to Python programmers for data science, numerical processing, statistics, and visualizing numerical data. For each tool or package, I'll also try to give a brief overview of:

All these packages are available in the Anaconda distribution of Python, which makes Anaconda a very good option for data analytics and visualization. See:

It's likely that they are also available at http://pypi.python.org and can be installed with pip. If you plan on doing some exploration (and do not want to use the Anaconda distribution), you will want to consider using virtualenv (https://virtualenv.pypa.io/en/stable/) and, for even more convenience in trying out various packages and configurations, look at virtualenvwrapper (https://virtualenvwrapper.readthedocs.io/en/latest/).

More information:

Many on the examples in this document use the somewhat standard import statements, for example:

import numpy as np
import scipy as sp
import pandas as pd

2   Some helpers

2.1   ipython

IPython is an enhanced interactive Python shell. It has tab completion, gives more convenient access to help for Python modules and objects, enables you to edit and rerun previous commands, and much more.

For more information, see: https://ipython.org.

Anaconda ships with QtConsole that contains IPython for even more convenience.

2.1.1   IPython profiles

If you use IPython, then consider creating a data science profile. Use something like this:

$ ipython profile create datasci

Then, consider putting something like the following in ~/.ipython/profile_datasci/startup/50-config.py:

import sys
import numpy as np
import scipy as sp

def pdir(obj):
    """Print information about obj, including `dir(obj)`."""
    if isinstance(obj, type):
        print('class: {}'.format(obj.__name__))
    else:
        print('instance class name: {}'.format(obj.__class__.__name__))
    if obj.__doc__:
        print('doc string: {}'.format(obj.__doc__))
    else:
        print('doc string: no doc string')
    print(dir(obj))

def read_file_contents(filename):
    with open(filename, 'r') as infile:
        content = infile.read()
    return content

You can have multiple startup files. See the startup/README file in your profile directory.

Also, consider doing some customization in ~/.ipython/profile_datasci/ipython_config.py.

And, in order to use that profile, start IPython with this:

$ ipython --profile=datasci

You can find more help with profiles by running something like the following:

$ ipython help profile

Or, see the following: http://ipython.readthedocs.io/en/stable/config/intro.html#profiles

2.1.2   Getting (interactive) help and docs

Inside the standard Python interactive shell, you can get help on some_object with this:

>>> help(some_object)

Inside the IPython interactive shell, you can use the above, or you can do:

In [9]: import scipy.fftpack
In [10]: scipy.fftpack?
In [11]:
In [11]: from scipy import fftpack
In [12]: fftpack?
In [13]: fftpack.fft?

You can use pydoc to get help at the command line. For example:

$ pydoc numpy.arange

You can also use pydoc to run an HTTP server, and view the documentation in a Web browser. Do the following for help with that:

$ pydoc --help

And, of course, documentation is available for the Scipy suite of tools at: http://www.scipy.org.

2.2   Installing the tools

Unless otherwise noted, each of the tools described in this document can be described with pip install ... (the standard Python install tool) or, for those who are using the Anaconda Python distribution, with conda install ....

2.2.1   pip and virtualenv

If you use pip, I'd recommend using virtualenv, at the least, and even virtualenvwrapper, for extra convenience and flexibility. virtualenv enables you to install Python packages (and therefor, the tools discussed in this document) in a separate environment, separate from your standard Python installation, and without polluting that standard installation. Since that separate installation is in its own directory, you can remove it by simply deleting that directory. virtualenvwrapper extends virtualenv by enabling you to create, manage, and switch between different virtualenv environments easily. For example, you might want to create and switch (1) between one virtualenv for text processing and another for data science or (2) between one installation for Python 2 and another for Python 3. See:

2.2.2   Anaconda

The Anaconda installation of Python provides most of the tools discussed in this document in the standard Anaconda installation. Additional tools can be installed with conda install ..., and the installation can be kept up-to-date with conda update --all. In the event that you need a Python package that is not provided by Anaconda, you can use pip.

2.2.3   Other Python distributions for data science

For more options on installing Python with a slant toward data science and scientific programming (but much else besides), see: https://www.scipy.org/install.html.

3   Analytics

3.1   Numpy

Help with Numpy:

There are (at least) two aspects to Numpy:

  • Primitive Numpy numeric types or scalars, for example: np.int32, np.int64, np.float32, np.float64, etc. See the following for information on these primitive types and others: https://docs.scipy.org/doc/numpy/reference/arrays.scalars.html.

  • Array objects (instances of np.ndarray) along with ways to deal with them.

  • Operations on Numpy arrays -- For information on these, see the Numpy reference manual: https://docs.scipy.org/doc/numpy/reference/index.html. Here is a quick summary:

    • Array creation routines -- Create arrays of different kinds, e.g. all ones, all zeros, identity, from an existing array, as a copy of an array, etc.

    • Array manipulation routines -- Routines that reshape an array, transpose an array, change the number of dimensions, join (concatenate, stack, etc), tiling arrays (create by repeating an array), etc. split arrays, etc.

    • Binary operations -- Logical binary operations on arrays, packing arrays into bits, bit-shifting operations, etc.

    • String operations

    • C-Types Foreign Function Interface (numpy.ctypeslib)

    • Datetime Support Functions

    • Data type routines

    • Optionally Scipy-accelerated routines (numpy.dual)

    • Mathematical functions with automatic domain (numpy.emath) -- Routines possibly accelerated by Scipy, but available in Numpy if Scipy is not installed. For example, routines for eigenvalues, Fourier transforms, solving linear equations, etc. Use:

      >>> from numpy import dual
      
    • Floating point error handling

    • Discrete Fourier Transform (numpy.fft) -- Use:

      >>> from numpy import fft
      

      Or, just:

      >>> np.fft.fft( ... )      # etc.
      
    • Financial functions -- Loan, payment, and interest calculations.

    • Functional programming -- Routines and classes that assist with doing functional programming. For example, np.vectorize creates a "vectorized" function; np.frompyfunc creates a Numpy ufunc. (Note that vectorized functions and universal functions can be applied to arrays. For help with the difference between vectorized and universal functions, see: https://stackoverflow.com/questions/6768245/difference-between-frompyfunc-and-vectorize-in-numpy.)

      Also, remember to look at functools and itertools in the standard Python library: https://docs.python.org/3/library/functional.html

      And, if you need parallelism across multiple CPUs and cores, look at ipyparallel: https://ipyparallel.readthedocs.io/en/latest/

    • Numpy-specific help functions -- Functions for getting information about objects and help with Numpy. (Also, if you are using IPython, the "?" operator gives help with a function or object, for example, enumerate? gives help on the enumerate function.)

    • Indexing routines

    • Input and output -- Routines for saving and loading arrays. (But, you may also want to explore HDF5 and h5py or pytables. Both h5py and pytables are in the Anaconda Python distribution.) Also, routines for formatting arrays as strings, converting arrays to and from strings, etc..

    • Linear algebra (numpy.linalg) -- Routines for the following:

      • Matrix and vector products
      • Decompositions
      • Matrix eigenvalues
      • Norms and other numbers
      • Solving equations and inverting matrices
      • Exceptions
      • Linear algebra on several matrices at once
    • Logic functions -- Functions for performing various tests on elements of Numpy arrays.

    • Masked array operations -- Support for creating and using masked arrays. A masked array is an array with a mask that marks some elements of the array as invalid. You can find some help with masked arrays in this document: http://www.scipy-lectures.org/intro/numpy/numpy.html.

    • Mathematical functions -- Functions for:

      • Trigonometric functions
      • Hyperbolic functions
      • Rounding
      • Sums, products, differences
      • Exponents and logarithms
      • Other special functions
      • Floating point routines
      • Arithmetic operations
      • Handling complex numbers
      • etc
    • Matrix library (numpy.matlib) -- Functions for creating and using matrices, as opposed to numpy.ndarry. Use from numpy import matlib. See this for a bit of help on the differences between arrays and matrices in Numpy: https://stackoverflow.com/questions/4151128/what-are-the-differences-between-numpy-arrays-and-matrices-which-one-should-i-u

    • Miscellaneous routines

    • Padding Arrays

    • Polynomials

    • Random sampling (numpy.random)

    • Set routines

    • Sorting, searching, and counting

    • Statistics

    • Test Support (numpy.testing)

    • Window functions

3.2   Scipy

Note that Scipy, Numpy, Pandas, Matplotlib, IPython, and Sympy are all under the Scipy umbrella. For information about any of these, see: https://www.scipy.org/.

What is Scipy? (1) It is many things to many people. But more seriously, (2) it is a large collection of functions for performing operations on arrays of numerical data. Think of it this way: Numpy (and Pandas) give you ways to structure and manipulate multi-dimensional arrays of numbers; Scipy gives you many functions that perform operations on those multi-dimensional arrays of numbers.

What kinds of operations? Here are some categories with descriptions:

  • Basic functions
  • Special functions (scipy.special)

3.2.1   Integration (scipy.integrate)

For help with this set of functions, do the following:

>>> from scipy import integrate
>>> help(integrate)

Or, in IPython, do integrate?

Here is the list you will see:

  • Integrating functions, given function object

    • quad -- General purpose integration
    • dblquad -- General purpose double integration
    • tplquad -- General purpose triple integration
    • nquad -- General purpose n-dimensional integration
    • fixed_quad -- Integrate func(x) using Gaussian quadrature of order n
    • quadrature -- Integrate with given tolerance using Gaussian quadrature
    • romberg -- Integrate func using Romberg integration
    • quad_explain -- Print information for use of quad
    • newton_cotes -- Weights and error coefficient for Newton-Cotes integration
      IntegrationWarning -- Warning on issues during integration
  • Integrating functions, given fixed samples

    • trapz -- Use trapezoidal rule to compute integral.
    • cumtrapz -- Use trapezoidal rule to cumulatively compute integral.
    • simps -- Use Simpson's rule to compute integral from samples.
    • romb -- Use Romberg Integration to compute integral from (2**k + 1) evenly-spaced samples.
  • Solving initial value problems for ODE systems

    The solvers are implemented as individual classes which can be used directly (low-level usage) or through a convenience function.

    • solve_ivp -- Convenient function for ODE integration.
    • RK23 -- Explicit Runge-Kutta solver of order 3(2).
    • RK45 -- Explicit Runge-Kutta solver of order 5(4).
    • Radau -- Implicit Runge-Kutta solver of order 5.
    • BDF -- Implicit multi-step variable order (1 to 5) solver.
    • LSODA -- LSODA solver from ODEPACK Fortran package.
    • OdeSolver -- Base class for ODE solvers.
    • DenseOutput -- Local interpolant for computing a dense output.
    • OdeSolution -- Class which represents a continuous ODE solution.

3.2.2   Optimization (scipy.optimize)

Remember that for each the following (or any) functions, you can get help in the usual ways: help(some_func) or (in IPython) some_func?.

  • Local Optimization:

    • minimize -- Unified interface for minimizers of multivariate functions
    • minimize_scalar -- Unified interface for minimizers of univariate functions
    • OptimizeResult -- The optimization result returned by some optimizers
    • OptimizeWarning -- The optimization encountered problems
  • General-purpose multivariate methods:

    • fmin -- Nelder-Mead Simplex algorithm
    • fmin_powell -- Powell's (modified) level set method
    • fmin_cg -- Non-linear (Polak-Ribiere) conjugate gradient algorithm
    • fmin_bfgs -- Quasi-Newton method (Broydon-Fletcher-Goldfarb-Shanno)
    • fmin_ncg -- Line-search Newton Conjugate Gradient
  • Constrained multivariate methods:

    • fmin_l_bfgs_b -- Zhu, Byrd, and Nocedal's constrained optimizer
    • fmin_tnc -- Truncated Newton code
    • fmin_cobyla -- Constrained optimization by linear approximation
    • fmin_slsqp -- Minimization using sequential least-squares programming
    • differential_evolution -- stochastic minimization using differential evolution
  • Univariate (scalar) minimization methods:

    • fminbound -- Bounded minimization of a scalar function
    • brent -- 1-D function minimization using Brent method
    • golden -- 1-D function minimization using Golden Section method
  • Equation (Local) Minimizers:

    • leastsq -- Minimize the sum of squares of M equations in N unknowns
    • least_squares -- Feature-rich least-squares minimization.
    • nnls -- Linear least-squares problem with non-negativity constraint
    • lsq_linear -- Linear least-squares problem with bound constraints
  • Global Optimization:

    • basinhopping -- Basinhopping stochastic optimizer
    • brute -- Brute force searching optimizer
    • differential_evolution -- stochastic minimization using differential evolution
  • Rosenbrock function:

    • rosen -- The Rosenbrock function.
    • rosen_der -- The derivative of the Rosenbrock function.
    • rosen_hess -- The Hessian matrix of the Rosenbrock function.
    • rosen_hess_prod -- Product of the Rosenbrock Hessian with a vector.
  • Fitting:

    • curve_fit -- Fit curve to a set of points
  • Root finding -- Scalar functions:

    • brentq -- quadratic interpolation Brent method
    • brenth -- Brent method, modified by Harris with hyperbolic extrapolation
    • ridder -- Ridder's method
    • bisect -- Bisection method
    • newton -- Secant method or Newton's method
  • Fixed point finding:

    • fixed_point -- Single-variable fixed-point solver
  • General nonlinear solvers:

    • root -- Unified interface for nonlinear solvers of multivariate functions
    • fsolve -- Non-linear multi-variable equation solver
    • broyden1 -- Broyden's first method
    • broyden2 -- Broyden's second method
  • Large-scale nonlinear solvers:

    • newton_krylov
    • anderson
  • Simple iterations:

    • excitingmixing
    • linearmixing
    • diagbroyden

    Additional information on the nonlinear solvers can be obtained from the help on scipy.optimize.nonlin.

  • Linear Programming -- General linear programming solver:

    linprog -- Unified interface for minimizers of linear programming problems

  • The simplex method supports callback functions, such as:

    linprog_verbose_callback -- Sample callback function for linprog (simplex)

  • Assignment problems:

    • linear_sum_assignment -- Solves the linear-sum assignment problem
  • Utilities:

    • approx_fprime -- Approximate the gradient of a scalar function
    • bracket -- Bracket a minimum, given two starting points
    • check_grad -- Check the supplied derivative using finite differences
    • line_search -- Return a step that satisfies the strong Wolfe conditions
    • show_options -- Show specific options optimization solvers
    • LbfgsInvHessProduct -- Linear operator for L-BFGS approximate inverse Hessian

3.2.3   Interpolation (scipy.interpolate)

Sub-package for objects used in interpolation.

As listed below, this sub-package contains spline functions and classes, one-dimensional and multi-dimensional (univariate and multivariate) interpolation classes, Lagrange and Taylor polynomial interpolators, and wrappers for FITPACK and DFITPACK functions.

  • Univariate interpolation
    • interp1d
    • BarycentricInterpolator
    • KroghInterpolator
    • PchipInterpolator
    • barycentric_interpolate
    • krogh_interpolate
    • pchip_interpolate
    • Akima1DInterpolator
    • CubicSpline
    • PPoly
    • BPoly
  • Multivariate interpolation
    • Unstructured data:
      • griddata
      • LinearNDInterpolator
      • NearestNDInterpolator
      • CloughTocher2DInterpolator
      • Rbf
      • interp2d
    • For data on a grid:
      • interpn
      • RegularGridInterpolator
      • RectBivariateSpline

See also: scipy.ndimage.map_coordinates

  • Tensor product polynomials:

    • NdPPoly
  • 1-D Splines

    • BSpline
    • make_interp_spline
    • make_lsq_spline
  • Functional interface to FITPACK routines:

    • splrep
    • splprep
    • splev
    • splint
    • sproot
    • spalde
    • splder
    • splantider
    • insert
  • Object-oriented FITPACK interface:

    • UnivariateSpline
    • InterpolatedUnivariateSpline
    • LSQUnivariateSpline
  • 2-D Splines

    • For data on a grid:
      • RectBivariateSpline
      • RectSphereBivariateSpline
    • For unstructured data:
      • BivariateSpline
      • SmoothBivariateSpline
      • SmoothSphereBivariateSpline
      • LSQBivariateSpline
      • LSQSphereBivariateSpline
    • Low-level interface to FITPACK functions:
      • bisplrep
      • bisplev
  • Additional tools

    • lagrange
    • approximate_taylor_polynomial
    • pade

    See also:

    • scipy.ndimage.map_coordinates,
    • scipy.ndimage.spline_filter,
    • scipy.signal.resample,
    • scipy.signal.bspline,
    • scipy.signal.gauss_spline,
    • scipy.signal.qspline1d,
    • scipy.signal.cspline1d,
    • scipy.signal.qspline1d_eval,
    • scipy.signal.cspline1d_eval,
    • scipy.signal.qspline2d,
    • scipy.signal.cspline2d.
  • Functions existing for backward compatibility (should not be used in new code):

    • spleval
    • spline
    • splmake
    • spltopp
    • pchip

3.2.4   Fourier Transforms (scipy.fftpack)

There is help and a number of examples here: https://docs.scipy.org/doc/scipy/reference/tutorial/fftpack.html.

Here is an example, copied from the documentation in the above link:

import numpy as np
from scipy.fftpack import fft

def test():
    # Number of sample points
    N = 600
    # sample spacing
    T = 1.0 / 800.0
    x = np.linspace(0.0, N * T, N)
    y = np.sin(50.0 * 2.0 * np.pi * x) + 0.5 * np.sin(80.0 * 2.0 * np.pi * x)
    yf = fft(y)
    from scipy.signal import blackman
    w = blackman(N)
    ywf = fft(y * w)
    xf = np.linspace(0.0, 1.0 / (2.0 * T), N / 2)
    import matplotlib.pyplot as plt
    plt.semilogy(xf[1:N // 2], 2.0 / N * np.abs(yf[1:N // 2]), '-b')
    plt.semilogy(xf[1:N // 2], 2.0 / N * np.abs(ywf[1:N // 2]), '-r')
    plt.legend(['FFT', 'FFT w. window'])
    plt.grid()
    plt.show()

test()

Here is a summary of the Discrete Fourier transforms support in scipy.fftpack:

  • Fast Fourier Transforms (FFTs)
    • fft - Fast (discrete) Fourier Transform (FFT)
    • ifft - Inverse FFT
    • fft2 - Two dimensional FFT
    • ifft2 - Two dimensional inverse FFT
    • fftn - n-dimensional FFT
    • ifftn - n-dimensional inverse FFT
    • rfft - FFT of strictly real-valued sequence
    • irfft - Inverse of rfft
    • dct - Discrete cosine transform
    • idct - Inverse discrete cosine transform
    • dctn - n-dimensional Discrete cosine transform
    • idctn - n-dimensional Inverse discrete cosine transform
    • dst - Discrete sine transform
    • idst - Inverse discrete sine transform
    • dstn - n-dimensional Discrete sine transform
    • idstn - n-dimensional Inverse discrete sine transform
  • Differential and pseudo-differential operators
    • diff - Differentiation and integration of periodic sequences
    • tilbert - Tilbert transform: cs_diff(x,h,h)
    • itilbert - Inverse Tilbert transform: sc_diff(x,h,h)
    • hilbert - Hilbert transform: cs_diff(x,inf,inf)
    • ihilbert - Inverse Hilbert transform: sc_diff(x,inf,inf)
    • cs_diff - cosh/sinh pseudo-derivative of periodic sequences
    • sc_diff - sinh/cosh pseudo-derivative of periodic sequences
    • ss_diff - sinh/sinh pseudo-derivative of periodic sequences
    • cc_diff - cosh/cosh pseudo-derivative of periodic sequences
    • shift - Shift periodic sequences
  • Helper functions
    • fftshift - Shift the zero-frequency component to the center of the spectrum
    • ifftshift - The inverse of fftshift
    • fftfreq - Return the Discrete Fourier Transform sample frequencies
    • rfftfreq - DFT sample frequencies (for usage with rfft, irfft)
    • next_fast_len - Find the optimal length to zero-pad an FFT for speed
  • Convolutions (scipy.fftpack.convolve)
    • convolve
    • convolve_z
    • init_convolution_kernel
    • destroy_convolve_cache

3.2.5   Signal Processing (scipy.signal)

Use this module with either of the following:

>>> import scipy.signal
>>> from scipy import signal

Here is some summary:

  • Convolution
    • convolve -- N-dimensional convolution.
    • correlate -- N-dimensional correlation.
    • fftconvolve -- N-dimensional convolution using the FFT.
    • convolve2d -- 2-dimensional convolution (more options).
    • correlate2d -- 2-dimensional correlation (more options).
    • sepfir2d -- Convolve with a 2-D separable FIR filter.
    • choose_conv_method -- Chooses faster of FFT and direct convolution methods.
  • B-splines
    • bspline -- B-spline basis function of order n.
    • cubic -- B-spline basis function of order 3.
    • quadratic -- B-spline basis function of order 2.
    • gauss_spline -- Gaussian approximation to the B-spline basis function.
    • cspline1d -- Coefficients for 1-D cubic (3rd order) B-spline.
    • qspline1d -- Coefficients for 1-D quadratic (2nd order) B-spline.
    • cspline2d -- Coefficients for 2-D cubic (3rd order) B-spline.
    • qspline2d -- Coefficients for 2-D quadratic (2nd order) B-spline.
    • cspline1d_eval -- Evaluate a cubic spline at the given points.
    • qspline1d_eval -- Evaluate a quadratic spline at the given points.
    • spline_filter -- Smoothing spline (cubic) filtering of a rank-2 array.
  • Filtering
    • order_filter -- N-dimensional order filter.
    • medfilt -- N-dimensional median filter.
    • medfilt2d -- 2-dimensional median filter (faster).
    • wiener -- N-dimensional wiener filter.
    • symiirorder1 -- 2nd-order IIR filter (cascade of first-order systems).
    • symiirorder2 -- 4th-order IIR filter (cascade of second-order systems).
    • lfilter -- 1-dimensional FIR and IIR digital linear filtering.
    • lfiltic -- Construct initial conditions for lfilter.
    • lfilter_zi -- Compute an initial state zi for the lfilter function that corresponds to the steady state of the step response.
    • filtfilt -- A forward-backward filter.
    • savgol_filter -- Filter a signal using the Savitzky-Golay filter.
    • deconvolve -- 1-d deconvolution using lfilter.
    • sosfilt -- 1-dimensional IIR digital linear filtering using a second-order sections filter representation.
    • sosfilt_zi -- Compute an initial state zi for the sosfilt function that corresponds to the steady state of the step response.
    • sosfiltfilt -- A forward-backward filter for second-order sections.
    • hilbert -- Compute 1-D analytic signal, using the Hilbert transform.
    • hilbert2 -- Compute 2-D analytic signal, using the Hilbert transform.
    • decimate -- Downsample a signal.
    • detrend -- Remove linear and/or constant trends from data.
    • resample -- Resample using Fourier method.
    • resample_poly -- Resample using polyphase filtering method.
    • upfirdn -- Upsample, apply FIR filter, downsample.
  • Filter design
    • bilinear -- Digital filter from an analog filter using the bilinear transform.
    • findfreqs -- Find array of frequencies for computing filter response.
    • firls -- FIR filter design using least-squares error minimization.
    • firwin -- Windowed FIR filter design, with frequency response defined as pass and stop bands.
    • firwin2 -- Windowed FIR filter design, with arbitrary frequency response.
    • freqs -- Analog filter frequency response from TF coefficients.
    • freqs_zpk -- Analog filter frequency response from ZPK coefficients.
    • freqz -- Digital filter frequency response from TF coefficients.
    • freqz_zpk -- Digital filter frequency response from ZPK coefficients.
    • sosfreqz -- Digital filter frequency response for SOS format filter.
    • group_delay -- Digital filter group delay.
    • iirdesign -- IIR filter design given bands and gains.
    • iirfilter -- IIR filter design given order and critical frequencies.
    • kaiser_atten -- Compute the attenuation of a Kaiser FIR filter, given the number of taps and the transition width at discontinuities in the frequency response.
    • kaiser_beta -- Compute the Kaiser parameter beta, given the desired FIR filter attenuation.
    • kaiserord -- Design a Kaiser window to limit ripple and width of transition region.
    • minimum_phase -- Convert a linear phase FIR filter to minimum phase.
    • savgol_coeffs -- Compute the FIR filter coefficients for a Savitzky-Golay filter.
    • remez -- Optimal FIR filter design.
    • unique_roots -- Unique roots and their multiplicities.
    • residue -- Partial fraction expansion of b(s) / a(s).
    • residuez -- Partial fraction expansion of b(z) / a(z).
    • invres -- Inverse partial fraction expansion for analog filter.
    • invresz -- Inverse partial fraction expansion for digital filter.
    • BadCoefficients -- Warning on badly conditioned filter coefficients
  • Lower-level filter design functions:
    • abcd_normalize -- Check state-space matrices and ensure they are rank-2.
    • band_stop_obj -- Band Stop Objective Function for order minimization.
    • besselap -- Return (z,p,k) for analog prototype of Bessel filter.
    • buttap -- Return (z,p,k) for analog prototype of Butterworth filter.
    • cheb1ap -- Return (z,p,k) for type I Chebyshev filter.
    • cheb2ap -- Return (z,p,k) for type II Chebyshev filter.
    • cmplx_sort -- Sort roots based on magnitude.
    • ellipap -- Return (z,p,k) for analog prototype of elliptic filter.
    • lp2bp -- Transform a lowpass filter prototype to a bandpass filter.
    • lp2bs -- Transform a lowpass filter prototype to a bandstop filter.
    • lp2hp -- Transform a lowpass filter prototype to a highpass filter.
    • lp2lp -- Transform a lowpass filter prototype to a lowpass filter.
    • normalize -- Normalize polynomial representation of a transfer function.
  • Matlab-style IIR filter design
    • butter -- Butterworth
    • buttord
    • cheby1 -- Chebyshev Type I
    • cheb1ord
    • cheby2 -- Chebyshev Type II
    • cheb2ord
    • ellip -- Elliptic (Cauer)
    • ellipord
    • bessel -- Bessel (no order selection available -- try butterod)
    • iirnotch -- Design second-order IIR notch digital filter.
    • iirpeak -- Design second-order IIR peak (resonant) digital filter.
  • Continuous-Time Linear Systems
    • lti -- Continuous-time linear time invariant system base class.
    • StateSpace -- Linear time invariant system in state space form.
    • TransferFunction -- Linear time invariant system in transfer function form.
    • ZerosPolesGain -- Linear time invariant system in zeros, poles, gain form.
    • lsim -- continuous-time simulation of output to linear system.
    • lsim2 -- like lsim, but scipy.integrate.odeint is used.
    • impulse -- impulse response of linear, time-invariant (LTI) system.
    • impulse2 -- like impulse, but scipy.integrate.odeint is used.
    • step -- step response of continous-time LTI system.
    • step2 -- like step, but scipy.integrate.odeint is used.
    • freqresp -- frequency response of a continuous-time LTI system.
    • bode -- Bode magnitude and phase data (continuous-time LTI).
  • Discrete-Time Linear Systems
    • dlti -- Discrete-time linear time invariant system base class.
    • StateSpace -- Linear time invariant system in state space form.
    • TransferFunction -- Linear time invariant system in transfer function form.
    • ZerosPolesGain -- Linear time invariant system in zeros, poles, gain form.
    • dlsim -- simulation of output to a discrete-time linear system.
    • dimpulse -- impulse response of a discrete-time LTI system.
    • dstep -- step response of a discrete-time LTI system.
    • dfreqresp -- frequency response of a discrete-time LTI system.
    • dbode -- Bode magnitude and phase data (discrete-time LTI).
  • LTI Representations
    • tf2zpk -- transfer function to zero-pole-gain.
    • tf2sos -- transfer function to second-order sections.
    • tf2ss -- transfer function to state-space.
    • zpk2tf -- zero-pole-gain to transfer function.
    • zpk2sos -- zero-pole-gain to second-order sections.
    • zpk2ss -- zero-pole-gain to state-space.
    • ss2tf -- state-pace to transfer function.
    • ss2zpk -- state-space to pole-zero-gain.
    • sos2zpk -- second-order sections to zero-pole-gain.
    • sos2tf -- second-order sections to transfer function.
    • cont2discrete -- continuous-time to discrete-time LTI conversion.
    • place_poles -- pole placement.
  • Waveforms
    • chirp -- Frequency swept cosine signal, with several freq functions.
    • gausspulse -- Gaussian modulated sinusoid
    • max_len_seq -- Maximum length sequence
    • sawtooth -- Periodic sawtooth
    • square -- Square wave
    • sweep_poly -- Frequency swept cosine signal; freq is arbitrary polynomial
    • unit_impulse -- Discrete unit impulse
  • Window functions
    • get_window -- Return a window of a given length and type.
    • barthann -- Bartlett-Hann window
    • bartlett -- Bartlett window
    • blackman -- Blackman window
    • blackmanharris -- Minimum 4-term Blackman-Harris window
    • bohman -- Bohman window
    • boxcar -- Boxcar window
    • chebwin -- Dolph-Chebyshev window
    • cosine -- Cosine window
    • exponential -- Exponential window
    • flattop -- Flat top window
    • gaussian -- Gaussian window
    • general_gaussian -- Generalized Gaussian window
    • hamming -- Hamming window
    • hann -- Hann window
    • hanning -- Hann window
    • kaiser -- Kaiser window
    • nuttall -- Nuttall's minimum 4-term Blackman-Harris window
    • parzen -- Parzen window
    • slepian -- Slepian window
    • triang -- Triangular window
    • tukey -- Tukey window
  • Wavelets
    • cascade -- compute scaling function and wavelet from coefficients
    • daub -- return low-pass
    • morlet -- Complex Morlet wavelet.
    • qmf -- return quadrature mirror filter from low-pass
    • ricker -- return ricker wavelet
    • cwt -- perform continuous wavelet transform
  • Peak finding
    • find_peaks_cwt -- Attempt to find the peaks in the given 1-D array
    • argrelmin -- Calculate the relative minima of data
    • argrelmax -- Calculate the relative maxima of data
    • argrelextrema -- Calculate the relative extrema of data
  • Spectral Analysis
    • periodogram -- Compute a (modified) periodogram
    • welch -- Compute a periodogram using Welch's method
    • csd -- Compute the cross spectral density, using Welch's method
    • coherence -- Compute the magnitude squared coherence, using Welch's method
    • spectrogram -- Compute the spectrogram
    • lombscargle -- Computes the Lomb-Scargle periodogram
    • vectorstrength -- Computes the vector strength
    • stft -- Compute the Short Time Fourier Transform
    • istft -- Compute the Inverse Short Time Fourier Transform
    • check_COLA -- Check the COLA constraint for iSTFT reconstruction

3.2.6   Linear Algebra (scipy.linalg)

Use this module with either of the following:

>>> import scipy.linalg
>>> from scipy import linalg

Here is some summary:

  • Basics

    • inv -- Find the inverse of a square matrix
    • solve -- Solve a linear system of equations
    • solve_banded -- Solve a banded linear system
    • solveh_banded -- Solve a Hermitian or symmetric banded system
    • solve_circulant -- Solve a circulant system
    • solve_triangular -- Solve a triangular matrix
    • solve_toeplitz -- Solve a toeplitz matrix
    • det -- Find the determinant of a square matrix
    • norm -- Matrix and vector norm
    • lstsq -- Solve a linear least-squares problem
    • pinv -- Pseudo-inverse (Moore-Penrose) using lstsq
    • pinv2 -- Pseudo-inverse using svd
    • pinvh -- Pseudo-inverse of hermitian matrix
    • kron -- Kronecker product of two arrays
    • tril -- Construct a lower-triangular matrix from a given matrix
    • triu -- Construct an upper-triangular matrix from a given matrix orthogonal_procrustes -- Solve an orthogonal Procrustes problem matrix_balance -- Balance matrix entries with a similarity transformation subspace_angles -- Compute the subspace angles between two matrices
    • LinAlgError -- Generic Python-exception-derived object raised by linalg functions.
  • Eigenvalue Problems

    • eig -- Find the eigenvalues and eigenvectors of a square matrix
    • eigvals -- Find just the eigenvalues of a square matrix
    • eigh -- Find the e-vals and e-vectors of a Hermitian or symmetric matrix
    • eigvalsh -- Find just the eigenvalues of a Hermitian or symmetric matrix
    • eig_banded -- Find the eigenvalues and eigenvectors of a banded matrix
    • eigvals_banded -- Find just the eigenvalues of a banded matrix
    • eigh_tridiagonal -- Find the eigenvalues and eigenvectors of a tridiagonal matrix
    • eigvalsh_tridiagonal -- Find just the eigenvalues of a tridiagonal matrix
  • Decompositions

    • lu -- LU decomposition of a matrix
    • lu_factor -- LU decomposition returning unordered matrix and pivots
    • lu_solve -- Solve Ax=b using back substitution with output of lu_factor
    • svd -- Singular value decomposition of a matrix
    • svdvals -- Singular values of a matrix
    • diagsvd -- Construct matrix of singular values from output of svd
    • orth -- Construct orthonormal basis for the range of A using svd
    • cholesky -- Cholesky decomposition of a matrix
    • cholesky_banded -- Cholesky decomp. of a sym. or Hermitian banded matrix
    • cho_factor -- Cholesky decomposition for use in solving a linear system
    • cho_solve -- Solve previously factored linear system
    • cho_solve_banded -- Solve previously factored banded linear system
    • polar -- Compute the polar decomposition.
    • qr -- QR decomposition of a matrix
    • qr_multiply -- QR decomposition and multiplication by Q
    • qr_update -- Rank k QR update
    • qr_delete -- QR downdate on row or column deletion
    • qr_insert -- QR update on row or column insertion
    • rq -- RQ decomposition of a matrix
    • qz -- QZ decomposition of a pair of matrices
    • ordqz -- QZ decomposition of a pair of matrices with reordering
    • schur -- Schur decomposition of a matrix
    • rsf2csf -- Real to complex Schur form
    • hessenberg -- Hessenberg form of a matrix

    See also: scipy.linalg.interpolative -- Interpolative matrix decompositions

  • Matrix Functions

    • expm -- Matrix exponential
    • logm -- Matrix logarithm
    • cosm -- Matrix cosine
    • sinm -- Matrix sine
    • tanm -- Matrix tangent
    • coshm -- Matrix hyperbolic cosine
    • sinhm -- Matrix hyperbolic sine
    • tanhm -- Matrix hyperbolic tangent
    • signm -- Matrix sign
    • sqrtm -- Matrix square root
    • funm -- Evaluating an arbitrary matrix function
    • expm_frechet -- Frechet derivative of the matrix exponential
    • expm_cond -- Relative condition number of expm in the Frobenius norm
    • fractional_matrix_power -- Fractional matrix power
  • Matrix Equation Solvers

    • solve_sylvester -- Solve the Sylvester matrix equation
    • solve_continuous_are -- Solve the continuous-time algebraic Riccati equation
    • solve_discrete_are -- Solve the discrete-time algebraic Riccati equation
    • solve_continuous_lyapunov -- Solve the continous-time Lyapunov equation
    • solve_discrete_lyapunov -- Solve the discrete-time Lyapunov equation
  • Sketches and Random Projections

    • clarkson_woodruff_transform -- Applies the Clarkson Woodruff Sketch (a.k.a CountMin Sketch)
  • Special Matrices

    • block_diag -- Construct a block diagonal matrix from submatrices
    • circulant -- Circulant matrix
    • companion -- Companion matrix
    • dft -- Discrete Fourier transform matrix
    • hadamard -- Hadamard matrix of order 2**n
    • hankel -- Hankel matrix
    • helmert -- Helmert matrix
    • hilbert -- Hilbert matrix
    • invhilbert -- Inverse Hilbert matrix
    • leslie -- Leslie matrix
    • pascal -- Pascal matrix
    • invpascal -- Inverse Pascal matrix
    • toeplitz -- Toeplitz matrix
    • tri -- Construct a matrix filled with ones at and below a given diagonal
  • Low-level routines

    • get_blas_funcs
    • get_lapack_funcs
    • find_best_blas_type
  • See also:

    • scipy.linalg.blas -- Low-level BLAS functions
    • scipy.linalg.lapack -- Low-level LAPACK functions
    • scipy.linalg.cython_blas -- Low-level BLAS functions for Cython
    • scipy.linalg.cython_lapack -- Low-level LAPACK functions for Cython

3.2.7   Sparse Eigenvalue Problems with ARPACK

There are examples in the Scipy documentation, here: https://docs.scipy.org/doc/scipy/reference/tutorial/arpack.html

And, here is a summary copied from that document:

"ARPACK is a Fortran package which provides routines for quickly finding a few eigenvalues/eigenvectors of large sparse matrices. In order to find these solutions, it requires only left-multiplication by the matrix in question. This operation is performed through a reverse-communication interface. The result of this structure is that ARPACK is able to find eigenvalues and eigenvectors of any linear function mapping a vector to a vector.

"All of the functionality provided in ARPACK is contained within the two high-level interfaces scipy.sparse.linalg.eigs and scipy.sparse.linalg.eigsh. eigs provides interfaces to find the eigenvalues/vectors of real or complex nonsymmetric square matrices, while eigsh provides interfaces for real-symmetric or complex-hermitian matrices."

3.2.8   Compressed Sparse Graph Routines (scipy.sparse.csgraph)

There is an example that implements a search for the shortest path between two words (of equal) length in a word ladder (i.e. changing just one letter in each step) in the Scipy documentation. You can find it here: https://docs.scipy.org/doc/scipy/reference/tutorial/csgraph.html.

You can get documentation with the following:

$ pydoc scipy.sparse.csgraph

And, in IPython, do something like this:

In [41]: from scipy.sparse import csgraph
In [42]: csgraph.connected_components?

Here is a summary of the contents:

  • connected_components -- determine connected components of a graph.
  • laplacian -- compute the laplacian of a graph.
  • shortest_path -- compute the shortest path between points on a positive graph.
  • dijkstra -- use Dijkstra's algorithm for shortest path.
  • floyd_warshall -- use the Floyd-Warshall algorithm for shortest path.
  • bellman_ford -- use the Bellman-Ford algorithm for shortest path.
  • johnson -- use Johnson's algorithm for shortest path.
  • breadth_first_order -- compute a breadth-first order of nodes.
  • depth_first_order -- compute a depth-first order of nodes.
  • breadth_first_tree -- construct the breadth-first tree from a given node.
  • depth_first_tree -- construct a depth-first tree from a given node.
  • minimum_spanning_tree -- construct the minimum spanning tree of a graph.
  • reverse_cuthill_mckee -- compute permutation for reverse Cuthill-McKee ordering.
  • maximum_bipartite_matching -- compute permutation to make diagonal zero free.
  • structural_rank -- compute the structural rank of a graph.
  • construct_dist_matrix -- Construct distance matrix from a predecessor matrix.
  • csgraph_from_dense -- Construct a CSR-format sparse graph from a dense matrix.
  • csgraph_from_masked -- Construct a CSR-format graph from a masked array.
  • csgraph_masked_from_dense -- Construct a CSR-format sparse graph from a dense matrix.
  • csgraph_to_dense -- Convert a sparse graph representation to a dense representation.
  • csgraph_to_masked -- Convert a sparse graph representation to a masked array representation.
  • reconstruct_path -- Construct a tree from a graph and a predecessor list.
  • NegativeCycleError -- Common base class for all non-exit exceptions

Note that there are other sparse graph libraries for Python. One is Another Python Graph Library: https://pythonhosted.org/apgl/index.html.

3.2.9   Spatial data structures and algorithms (scipy.spatial)

Provides spatial algorithms and data structures.

Here is an example, copied from the documentation:

import numpy as np
from scipy.spatial import Delaunay
import matplotlib.pyplot as plt

def test():
        points = np.array([[0, 0], [0, 1.1], [1, 0], [1, 1]])
        tri = Delaunay(points)
        #
        # We can visualize it:
        plt.triplot(points[:, 0], points[:, 1], tri.simplices.copy())
        plt.plot(points[:, 0], points[:, 1], 'o')
        #
        # And add some further decorations:
        for j, p in enumerate(points):
                # label the points
                plt.text(p[0] - 0.03, p[1] + 0.03, j, ha='right')
        for j, s in enumerate(tri.simplices):
                p = points[s].mean(axis=0)
                # label triangles
                plt.text(p[0], p[1], '#%d' % j, ha='center')
        plt.xlim(-0.5, 1.5)
        plt.ylim(-0.5, 1.5)
        plt.show()
        #
        # The structure of the triangulation is encoded in the following way: the
        # simplices attribute contains the indices of the points in the
        # points array
        # that make up the triangle. For instance:
        i = 1
        tri.simplices[i, :]
        points[tri.simplices[i, :]]
        return tri, points

Here is a summary of the contents of scipy.spatial (obtained by doing $ pydoc scipy.spatial):

  • Nearest-neighbor Queries:

    • KDTree -- class for efficient nearest-neighbor queries
    • cKDTree -- class for efficient nearest-neighbor queries (faster impl.)
    • distance -- module containing many different distance measures
    • Rectangle -- Hyperrectangle class. Represents a Cartesian product of intervals.
  • Delaunay Triangulation, Convex Hulls, and Voronoi Diagrams:

    • Delaunay -- compute Delaunay triangulation of input points
    • ConvexHull -- compute a convex hull for input points
    • Voronoi -- compute a Voronoi diagram hull from input points
    • SphericalVoronoi -- compute a Voronoi diagram from input points on the surface of a sphere
    • HalfspaceIntersection -- compute the intersection points of input halfspaces
  • Plotting Helpers:

    • delaunay_plot_2d -- plot 2-D triangulation
    • convex_hull_plot_2d -- plot 2-D convex hull
    • voronoi_plot_2d -- plot 2-D voronoi diagram
  • Simplex representation:

    The simplices (triangles, tetrahedra, ...) appearing in the Delaunay tesselation (N-dim simplices), convex hull facets, and Voronoi ridges (N-1 dim simplices) are represented in the following scheme:

    tess = Delaunay(points)
    hull = ConvexHull(points)
    voro = Voronoi(points)
    # coordinates of the j-th vertex of the i-th simplex
    tess.points[tess.simplices[i, j], :]        # tesselation element
    hull.points[hull.simplices[i, j], :]        # convex hull facet
    voro.vertices[voro.ridge_vertices[i, j], :] # ridge between Voronoi cells
    

    For Delaunay triangulations and convex hulls, the neighborhood structure of the simplices satisfies the condition:

    tess.neighbors[i,j] is the neighboring simplex of the i-th simplex, opposite to the j-vertex. It is -1 in case of no neighbor.

    Convex hull facets also define a hyperplane equation:

    (hull.equations[i,:-1] * coord).sum() + hull.equations[i,-1] == 0
    

    Similar hyperplane equations for the Delaunay triangulation correspond to the convex hull facets on the corresponding N+1 dimensional paraboloid.

    The Delaunay triangulation objects offer a method for locating the simplex containing a given point, and barycentric coordinate computations.

  • Functions:

    • tsearch
    • distance_matrix
    • minkowski_distance
    • minkowski_distance_p
    • procrustes

3.2.10   Statistics (scipy.stats)

This module contains a large number of probability distributions as well as a growing library of statistical functions.

Each univariate distribution is an instance of a subclass of rv_continuous (rv_discrete for discrete distributions):

  • rv_continuous
  • rv_discrete
  • rv_histogram

Here is a summary of the items in scipy.stats:

  • Continuous distributions

    • alpha -- Alpha
    • anglit -- Anglit
    • arcsine -- Arcsine
    • argus -- Argus
    • beta -- Beta
    • betaprime -- Beta Prime
    • bradford -- Bradford
    • burr -- Burr (Type III)
    • burr12 -- Burr (Type XII)
    • cauchy -- Cauchy
    • chi -- Chi
    • chi2 -- Chi-squared
    • cosine -- Cosine
    • crystalball -- Crystalball
    • dgamma -- Double Gamma
    • dweibull -- Double Weibull
    • erlang -- Erlang
    • expon -- Exponential
    • exponnorm -- Exponentially Modified Normal
    • exponweib -- Exponentiated Weibull
    • exponpow -- Exponential Power
    • f -- F (Snecdor F)
    • fatiguelife -- Fatigue Life (Birnbaum-Saunders)
    • fisk -- Fisk
    • foldcauchy -- Folded Cauchy
    • foldnorm -- Folded Normal
    • frechet_r -- Deprecated. Alias for weibull_min
    • frechet_l -- Deprecated. Alias for weibull_max
    • genlogistic -- Generalized Logistic
    • gennorm -- Generalized normal
    • genpareto -- Generalized Pareto
    • genexpon -- Generalized Exponential
    • genextreme -- Generalized Extreme Value
    • gausshyper -- Gauss Hypergeometric
    • gamma -- Gamma
    • gengamma -- Generalized gamma
    • genhalflogistic -- Generalized Half Logistic
    • gilbrat -- Gilbrat
    • gompertz -- Gompertz (Truncated Gumbel)
    • gumbel_r -- Right Sided Gumbel, Log-Weibull, Fisher-Tippett, Extreme Value Type I
    • gumbel_l -- Left Sided Gumbel, etc.
    • halfcauchy -- Half Cauchy
    • halflogistic -- Half Logistic
    • halfnorm -- Half Normal
    • halfgennorm -- Generalized Half Normal
    • hypsecant -- Hyperbolic Secant
    • invgamma -- Inverse Gamma
    • invgauss -- Inverse Gaussian
    • invweibull -- Inverse Weibull
    • johnsonsb -- Johnson SB
    • johnsonsu -- Johnson SU
    • kappa4 -- Kappa 4 parameter
    • kappa3 -- Kappa 3 parameter
    • ksone -- Kolmogorov-Smirnov one-sided (no stats)
    • kstwobign -- Kolmogorov-Smirnov two-sided test for Large N (no stats)
    • laplace -- Laplace
    • levy -- Levy
    • levy_l
    • levy_stable
    • logistic -- Logistic
    • loggamma -- Log-Gamma
    • loglaplace -- Log-Laplace (Log Double Exponential)
    • lognorm -- Log-Normal
    • lomax -- Lomax (Pareto of the second kind)
    • maxwell -- Maxwell
    • mielke -- Mielke's Beta-Kappa
    • nakagami -- Nakagami
    • ncx2 -- Non-central chi-squared
    • ncf -- Non-central F
    • nct -- Non-central Student's T
    • norm -- Normal (Gaussian)
    • pareto -- Pareto
    • pearson3 -- Pearson type III
    • powerlaw -- Power-function
    • powerlognorm -- Power log normal
    • powernorm -- Power normal
    • rdist -- R-distribution
    • reciprocal -- Reciprocal
    • rayleigh -- Rayleigh
    • rice -- Rice
    • recipinvgauss -- Reciprocal Inverse Gaussian
    • semicircular -- Semicircular
    • skewnorm -- Skew normal
    • t -- Student's T
    • trapz -- Trapezoidal
    • triang -- Triangular
    • truncexpon -- Truncated Exponential
    • truncnorm -- Truncated Normal
    • tukeylambda -- Tukey-Lambda
    • uniform -- Uniform
    • vonmises -- Von-Mises (Circular)
    • vonmises_line -- Von-Mises (Line)
    • wald -- Wald
    • weibull_min -- Minimum Weibull (see Frechet)
    • weibull_max -- Maximum Weibull (see Frechet)
    • wrapcauchy -- Wrapped Cauchy
  • Multivariate distributions

    • multivariate_normal -- Multivariate normal distribution
    • matrix_normal -- Matrix normal distribution
    • dirichlet -- Dirichlet
    • wishart -- Wishart
    • invwishart -- Inverse Wishart
    • multinomial -- Multinomial distribution
    • special_ortho_group -- SO(N) group
    • ortho_group -- O(N) group
    • unitary_group -- U(N) gropu
    • random_correlation -- random correlation matrices
  • Discrete distributions

    • bernoulli -- Bernoulli
    • binom -- Binomial
    • boltzmann -- Boltzmann (Truncated Discrete Exponential)
    • dlaplace -- Discrete Laplacian
    • geom -- Geometric
    • hypergeom -- Hypergeometric
    • logser -- Logarithmic (Log-Series, Series)
    • nbinom -- Negative Binomial
    • planck -- Planck (Discrete Exponential)
    • poisson -- Poisson
    • randint -- Discrete Uniform
    • skellam -- Skellam
    • zipf -- Zipf
  • Statistical functions -- Several of these functions have a similar version in scipy.stats.mstats which work for masked arrays.

    • describe -- Descriptive statistics
    • gmean -- Geometric mean
    • hmean -- Harmonic mean
    • kurtosis -- Fisher or Pearson kurtosis
    • kurtosistest -- Test whether a dataset has normal kurtosis.
    • mode -- Modal value
    • moment -- Central moment
    • normaltest --
    • skew -- Skewness
    • skewtest --
    • kstat --
    • kstatvar --
    • tmean -- Truncated arithmetic mean
    • tvar -- Truncated variance
    • tmin --
    • tmax --
    • tstd --
    • tsem --
    • variation -- Coefficient of variation
    • find_repeats
    • trim_mean
    • cumfreq
    • itemfreq
    • percentileofscore
    • scoreatpercentile
    • relfreq
    • binned_statistic -- Compute a binned statistic for a set of data.
    • binned_statistic_2d -- Compute a 2-D binned statistic for a set of data.
    • binned_statistic_dd -- Compute a d-D binned statistic for a set of data.
    • obrientransform
    • bayes_mvs
    • mvsdist
    • sem
    • zmap
    • zscore
    • iqr
    • sigmaclip
    • trimboth
    • trim1
    • f_oneway
    • pearsonr
    • spearmanr
    • pointbiserialr
    • kendalltau
    • weightedtau
    • linregress
    • theilslopes
    • ttest_1samp
    • ttest_ind
    • ttest_ind_from_stats
    • ttest_rel
    • kstest
    • chisquare
    • power_divergence
    • ks_2samp
    • mannwhitneyu
    • tiecorrect
    • rankdata
    • ranksums
    • wilcoxon
    • kruskal
    • friedmanchisquare
    • combine_pvalues
    • jarque_bera
    • ansari
    • bartlett
    • levene
    • shapiro
    • anderson
    • anderson_ksamp
    • binom_test
    • fligner
    • median_test
    • mood
    • boxcox
    • boxcox_normmax
    • boxcox_llf
    • entropy
    • wasserstein_distance
    • energy_distance
  • Circular statistical functions

    • circmean
    • circvar
    • circstd
  • Contingency table functions

    • chi2_contingency
    • contingency expected_freq
    • contingency margins
    • fisher_exact
  • Plot-tests

    • ppcc_max
    • ppcc_plot
    • probplot
    • boxcox_normplot
  • Masked statistics functions -- Module scipy.stats.mstats contains statistical functions for masked arrays.

    For more information in IPython, do:

    In [1]: from scipy.stats import mstats
    In [2]: mstats?
    

    Or, from the command line do $ pydoc scipy.stats.mstats.

  • Univariate and multivariate kernel density estimation (scipy.stats.kde)

    • gaussian_kde -- Representation of a kernel-density estimate using Gaussian kernels.

      Kernel density estimation is a way to estimate the probability density function (PDF) of a random variable in a non-parametric way. gaussian_kde works for both uni-variate and multi-variate data. It includes automatic bandwidth determination. The estimation works best for a unimodal distribution; bimodal or multi-modal distributions tend to be oversmoothed.

For many more stat related functions install the software R and the interface package rpy`.

3.2.11   Multidimensional image processing (scipy.ndimage)

The module scipy.ndimage contains various functions for multi-dimensional image processing.

For information on these functions, do (for example, in IPython):

In [6]: from scipy import ndimage
In [7]: ndimage?
In [8]: ndimage.convolve?

Or, from the command line, do: $ pydoc scipy.ndimage.convolve.

Here is an example -- It computes the multi-dimensional convolution of an Numpy ndarray:

import numpy as np
from scipy import ndimage


def test():
    a = np.array([[1, 2, 0, 0],
                  [5, 3, 0, 4],
                  [0, 0, 0, 7],
                  [9, 3, 0, 0]])
    k = np.array([[1, 1, 1], [1, 1, 0], [1, 0, 0]])
    result = ndimage.convolve(a, k, mode='constant', cval=0.0)
    return result

Here is a summary of the contents of scipy.ndimage:

  • Filters
    • convolve -- Multi-dimensional convolution
    • convolve1d -- 1-D convolution along the given axis
    • correlate -- Multi-dimensional correlation
    • correlate1d -- 1-D correlation along the given axis
    • gaussian_filter -
    • gaussian_filter1d -
    • gaussian_gradient_magnitude -
    • gaussian_laplace -
    • generic_filter -- Multi-dimensional filter using a given function
    • generic_filter1d -- 1-D generic filter along the given axis
    • generic_gradient_magnitude
    • generic_laplace
    • laplace -- n-D Laplace filter based on approximate second derivatives
    • maximum_filter
    • maximum_filter1d
    • median_filter -- Calculates a multi-dimensional median filter
    • minimum_filter
    • minimum_filter1d
    • percentile_filter -- Calculates a multi-dimensional percentile filter
    • prewitt
    • rank_filter -- Calculates a multi-dimensional rank filter
    • sobel
    • uniform_filter -- Multi-dimensional uniform filter
    • uniform_filter1d -- 1-D uniform filter along the given axis
  • Fourier filters
    • fourier_ellipsoid
    • fourier_gaussian
    • fourier_shift
    • fourier_uniform
  • Interpolation
    • affine_transform -- Apply an affine transformation
    • geometric_transform -- Apply an arbritrary geometric transform
    • map_coordinates -- Map input array to new coordinates by interpolation
    • rotate -- Rotate an array
    • shift -- Shift an array
    • spline_filter
    • spline_filter1d
    • zoom -- Zoom an array
  • Measurements
    • center_of_mass -- The center of mass of the values of an array at labels
    • extrema -- Min's and max's of an array at labels, with their positions
    • find_objects -- Find objects in a labeled array
    • histogram -- Histogram of the values of an array, optionally at labels
    • label -- Label features in an array
    • labeled_comprehension
    • maximum
    • maximum_position
    • mean -- Mean of the values of an array at labels
    • median
    • minimum
    • minimum_position
    • standard_deviation -- Standard deviation of an n-D image array
    • sum -- Sum of the values of the array
    • variance -- Variance of the values of an n-D image array
    • watershed_ift
  • Morphology
    • binary_closing
    • binary_dilation
    • binary_erosion
    • binary_fill_holes
    • binary_hit_or_miss
    • binary_opening
    • binary_propagation
    • black_tophat
    • distance_transform_bf
    • distance_transform_cdt
    • distance_transform_edt
    • generate_binary_structure
    • grey_closing
    • grey_dilation
    • grey_erosion
    • grey_opening
    • iterate_structure
    • morphological_gradient
    • morphological_laplace
    • white_tophat
  • Utility
    • imread -- Load an image from a file

3.2.12   File IO (scipy.io)

Scipy provides routines to read/write a number of special file formats. Here are some of them:

  • MATLAB® files:
    • loadmat -- Read a MATLAB style mat file (version 4 through 7.1)
    • savemat -- Write a MATLAB style mat file (version 4 through 7.1)
    • whosmat -- List contents of a MATLAB style mat file (version 4 through 7.1)
  • IDL® files:
    • readsav -- Read an IDL 'save' file
  • Matrix Market files:
    • mminfo -- Query matrix info from Matrix Market formatted file
    • mmread -- Read matrix from Matrix Market formatted file
    • mmwrite -- Write matrix to Matrix Market formatted file
  • Unformatted Fortran files:
    • FortranFile -- A file object for unformatted sequential Fortran files
  • Netcdf:
    • netcdf_file -- A file object for NetCDF data
    • netcdf_variable -- A data object for the netcdf module
  • Harwell-Boeing files:
    • hb_read -- read H-B file
    • hb_write -- write H-B file
  • Wav sound files (scipy.io.wavfile):
    • read -- Return the sample rate (in samples/sec) and data from a WAV file.
    • write -- Write a numpy array as a WAV file.
    • WavFileWarning -- Base class for warnings generated by user code.
  • Arff files (scipy.io.arff):
    • loadarff -- Read an arff file.
    • MetaData -- Small container to keep useful information on a ARFF dataset.
    • ArffError -- Base class for I/O related errors.
    • ParseArffError -- Base class for I/O related errors.

3.3   Pandas

Pandas vs. Numpy -- Pandas raises Numpy data structures to a higher level. In particular, see the DataFrame object.

For documentation on Pandas, see: http://pandas.pydata.org/pandas-docs/stable/. There are tutorials, get-started guides, cookbook docs, and more.

10 Minutes to pandas seems especially helpful, although it does contain an lot more than 10 minutes worth of material. It gives basic instructions on how to use Pandas data types.

And, be sure to look at the various Pandas tutorials.

There are also cookbooks full of code snippets:

Perhaps it's advisable to view Pandas as just as much about learning techniques for (1) cleaning up your data; (2) exploring and finding significant aspects of your data, and (3) viewing and displaying your data, as it is about performing calculations and analysis on it. Panda contains and provides such a rich set of techniques for working with your data that you should expect to take a reasonable amount of time learning to do the tasks you need, rather than just quickly learn some small set of functions.

3.3.1   Create Pandas data structures

Here is an example that creates several of the Pandas data structures that are used in the "10 Minutes to pandas" document referenced above:

def make_sample_dataframe():
    """Make sample dates and DataFrame.  Returns (dates, df)."""
    dates = pd.date_range('20130101', periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
    return dates, df

And, here is an example of the use of the above function:

In [117]: import utils01
In [118]: dates, df = utils01.make_sample_dataframe()
In [119]:
In [119]: dates
Out[119]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
                           '2013-01-05', '2013-01-06'],
                          dtype='datetime64[ns]', freq='D')
In [120]:
In [120]: df
Out[120]:
                                   A         B         C         D
2013-01-01  0.521515  1.006002 -1.408913 -0.218981
2013-01-02 -0.517541 -0.190499  0.397701  0.895858
2013-01-03  0.068253  0.499286 -1.098401 -1.323183
2013-01-04 -0.086779  0.025269  0.459892  0.588754
2013-01-05  1.384825 -1.141312  0.097294  0.169665
2013-01-06 -0.391738 -0.072600  0.196514  0.799174

3.3.2   View Pandas data structures

View the first and last rows of a DataFrame:

In [34]: df.head(n=2)
Out[34]:
                                   A         B         C         D
2013-01-01 -0.557541  1.016474  0.933149 -0.524661
2013-01-02  1.682318 -1.605635 -0.324727  2.057636
In [35]:
In [35]: df.tail(n=3)
Out[35]:
                                   A         B         C         D
2013-01-04  0.696414  0.538999  1.131596 -0.960681
2013-01-05 -0.175765 -0.494210  1.111779 -0.670209
2013-01-06 -1.615098  0.018027  0.584815 -1.508152

Get the shape, column (labels), and actual data from a DataFrame:

In [38]: df.shape
Out[38]: (6, 4)
In [39]: df.columns
Out[39]: Index(['A', 'B', 'C', 'D'], dtype='object')
In [40]: df.values
Out[40]:
array([[-0.55754086,  1.01647419,  0.93314867, -0.52466119],
       [ 1.68231758, -1.60563477, -0.32472655,  2.05763649],
       [-0.45481149, -0.09087637, -1.1383327 , -0.7950994 ],
       [ 0.69641379,  0.53899898,  1.13159619, -0.96068123],
       [-0.17576451, -0.49421043,  1.11177912, -0.67020918],
       [-1.61509837,  0.01802738,  0.58481469, -1.50815216]])
In [41]: type(df.values)
Out[41]: numpy.ndarray

Note that df.values returns an ndarray.

3.3.3   Access the contents of a DataFrame

Access a row or range of rows -- Use .iloc with a single index or a slice. Examples:

In [72]: df.iloc[1]
Out[72]:
A    0.721339
B    0.733763
C   -1.153457
D   -1.345582
Name: 2013-01-02 00:00:00, dtype: float64
In [73]: df.iloc[1:2]
Out[73]:
                   A         B         C         D
2013-01-02  0.721339  0.733763 -1.153457 -1.345582
In [74]: df.iloc[1:4]
Out[74]:
                   A         B         C         D
2013-01-02  0.721339  0.733763 -1.153457 -1.345582
2013-01-03  2.047318  0.406103 -1.893892  0.065913
2013-01-04  0.737643 -1.539155  0.410927  0.038682

Access a row or range of rows -- Use .loc with index labels. Examples:

In [64]: df.loc[dates[1]]
Out[64]:
A    0.721339
B    0.733763
C   -1.153457
D   -1.345582
Name: 2013-01-02 00:00:00, dtype: float64
In [65]: df.loc[dates[1]:dates[2]]
Out[65]:
                                   A         B         C         D
2013-01-02  0.721339  0.733763 -1.153457 -1.345582
2013-01-03  2.047318  0.406103 -1.893892  0.065913
In [66]: df.loc[dates[1]:dates[1]]
Out[66]:
                                   A         B         C         D
2013-01-02  0.721339  0.733763 -1.153457 -1.345582
In [67]: df.loc['2013-01-01']
Out[67]:
A    1.373992
B   -0.080698
C   -0.018425
D   -0.424205
Name: 2013-01-01 00:00:00, dtype: float64
In [68]: df.loc['2013-01-01':'2013-01-03']
Out[68]:
                                   A         B         C         D
2013-01-01  1.373992 -0.080698 -0.018425 -0.424205
2013-01-02  0.721339  0.733763 -1.153457 -1.345582
2013-01-03  2.047318  0.406103 -1.893892  0.065913

Notes:

  • dates was used to create the index for df:

    def make_sample_dataframe1():
        """Make sample dates and DataFrame.  Returns (dates, df)."""
        dates = pd.date_range('20130101', periods=6)
        df = pd.DataFrame(
            np.random.randn(6, 4),
            index=dates,
            columns=list('ABCD'))
        return dates, df
    

Access the rows where the content of a item (column) in that row satisfies a condition or test:

In [10]: df.loc[df.B > 0].head()
Out[10]:
    Unnamed: 0         A         B         C         D
    2   2013-01-03  0.986316  1.870495 -1.598345 -2.551315
    5   2013-01-06  1.385534  1.328005  1.741578 -0.409209
    7   2013-01-08 -0.820344  0.318531  0.278434 -0.898119
    9   2013-01-10 -2.342766  0.048417 -0.352930 -0.134832
    20  2013-01-21 -0.567319  1.784550 -0.114723  0.315661

Or:

In [9]: df.loc[df.B.apply(lambda x: x > 0)].head()
Out[9]:
    Unnamed: 0         A         B         C         D
    2   2013-01-03  0.986316  1.870495 -1.598345 -2.551315
    5   2013-01-06  1.385534  1.328005  1.741578 -0.409209
    7   2013-01-08 -0.820344  0.318531  0.278434 -0.898119
    9   2013-01-10 -2.342766  0.048417 -0.352930 -0.134832
    20  2013-01-21 -0.567319  1.784550 -0.114723  0.315661

Notes:

  • The use of .apply() along with lambda (or a named Python function) enables us to select rows based on an arbitrarily complex condition.

  • Also, consider using functools.partial(). The following selects rows where the value in column B is in the range -0.1 to 0.1:

    In [25]: import functools
    In [26]: f = functools.partial(lambda x, y, z: z > x and z < y, -0.1, 0.1)
    In [27]:
    In [27]: df.loc[df.B.apply(f)].head()
    Out[27]:
        Unnamed: 0         A         B         C         D
        9   2013-01-10 -2.342766  0.048417 -0.352930 -0.134832
        27  2013-01-28 -0.673330  0.075427 -0.477715 -0.475463
        33  2013-02-03 -0.776301  0.015220  0.518606 -0.286090
        38  2013-02-08  0.894722  0.005027 -0.763636 -0.150279
        44  2013-02-14 -0.403519 -0.059570  0.929560 -1.065283
    

Access a column or several columns -- Use the Python indexing operator ([]), with a column label or iterable of column labels. Or, for a single column, use dot notation. Examples:

In [98]: df['B']
Out[98]:
2013-01-01   -0.080698
2013-01-02    0.733763
2013-01-03    0.406103
2013-01-04   -1.539155
2013-01-05   -0.963585
2013-01-06    0.934215
Freq: D, Name: B, dtype: float64
In [99]: df[['B', 'D']]
Out[99]:
                   B         D
2013-01-01 -0.080698 -0.424205
2013-01-02  0.733763 -1.345582
2013-01-03  0.406103  0.065913
2013-01-04 -1.539155  0.038682
2013-01-05 -0.963585 -0.449162
2013-01-06  0.934215  1.473294
In [100]:
In [100]: df.C
Out[100]:
2013-01-01   -0.018425
2013-01-02   -1.153457
2013-01-03   -1.893892
2013-01-04    0.410927
2013-01-05   -1.627970
2013-01-06    0.240306
Freq: D, Name: C, dtype: float64

Access individual elements by index relative to zero -- Use .iloc[r, c]:

In [42]: df.iloc[0]
Out[42]:
A    1.373992
B   -0.080698
C   -0.018425
D   -0.424205
Name: 2013-01-01 00:00:00, dtype: float64
In [43]: df.iloc[0, 1]
Out[43]: -0.08069801201343964
In [44]: df.iloc[0, 1:3]
Out[44]:
B   -0.080698
C   -0.018425
Name: 2013-01-01 00:00:00, dtype: float64
In [45]: df.iloc[0:4, 1]
Out[45]:
2013-01-01   -0.080698
2013-01-02    0.733763
2013-01-03    0.406103
2013-01-04   -1.539155
Freq: D, Name: B, dtype: float64
In [46]: df.iloc[0:4, 1:-1]
Out[46]:
                   B         C
2013-01-01 -0.080698 -0.018425
2013-01-02  0.733763 -1.153457
2013-01-03  0.406103 -1.893892
2013-01-04 -1.539155  0.410927
In [47]: df.iloc[0:4, 1:]
Out[47]:
                   B         C         D
2013-01-01 -0.080698 -0.018425 -0.424205
2013-01-02  0.733763 -1.153457 -1.345582
2013-01-03  0.406103 -1.893892  0.065913
2013-01-04 -1.539155  0.410927  0.038682

3.3.4   Iterate over a DataFrame

There are several ways to do this. Here are some examples:

import utils01

def test():
        dates, df = utils01.make_sample_dataframe1()
        # iterate over column labels.
        print("*\n* column labels --\n*")
        print([x for x in df])
        # iterate over items
        print("*\n* items --\n*")
        print([x for x in df.head(n=2).iteritems()])
        # iterate over rows
        print("*\n* rows --\n*")
        print([x for x in df.head(n=2).iterrows()])
        # iterate over rows as named tuples.
        print("*\n* named tuples --\n*")
        print([x for x in df.head(n=2).itertuples()])
        # iterate over rows as named tuples returning one column from each tuple.
        print("*\n* column \"B\" from named tuple --\n*")
        print([x.B for x in df.head(n=2).itertuples()])

Here is the output from the above function:

In [67]: test()
*
* column labels --
*
['A', 'B', 'C', 'D']
*
* items --
*
[('A', 2013-01-01   -2.443710
2013-01-02   -1.003475
Freq: D, Name: A, dtype: float64), ('B', 2013-01-01   -0.320540
2013-01-02   -1.020769
Freq: D, Name: B, dtype: float64), ('C', 2013-01-01    0.010302
2013-01-02    0.115615
Freq: D, Name: C, dtype: float64), ('D', 2013-01-01    0.935831
2013-01-02   -0.514601
Freq: D, Name: D, dtype: float64)]
*
* rows --
*
[(Timestamp('2013-01-01 00:00:00', freq='D'), A   -2.443710
B   -0.320540
C    0.010302
D    0.935831
Name: 2013-01-01 00:00:00, dtype: float64), (Timestamp('2013-01-02 00:00:00', freq='D'), A   -1.003475
B   -1.020769
C    0.115615
D   -0.514601
Name: 2013-01-02 00:00:00, dtype: float64)]
*
* named tuples --
*
[Pandas(Index=Timestamp('2013-01-01 00:00:00', freq='D'), A=-2.4437103289150857, B=-0.32054023603910436, C=0.01030189942471081, D=0.9358311337233644), Pandas(Index=Timestamp('2013-01-02 00:00:00', freq='D'), A=-1.0034752077816913, B=-1.0207687970125863, C=0.11561494820245698, D=-0.5146012044818192)]
*
* column "B" from named tuple --
*
[-0.32054023603910436, -1.0207687970125863]

While iterating over a pandas.DataFrame produces the column label, which can be used to access the columns of the DataFrame. Example:

In [92]: for column in df:
    ...:     print("{}[0]: {:7.3f}".format(column, getattr(df, column)[0]))
    ...:
A[0]:  -0.368
B[0]:   1.122
C[0]:  -0.890
D[0]:   0.076

An easier (and cleaner?) way to access a column would be: df[column].

In contrast, iterating over a pandas.Series, produces the items in the Series. Example (note that dates is a Series):

In [112]: for date in dates:
     ...:     print('date: {}/{}/{}'.format(date.month, date.day, date.year))
     ...:
date: 1/1/2013
date: 1/2/2013
date: 1/3/2013
date: 1/4/2013
date: 1/5/2013
date: 1/6/2013

Here is a simple bit of code that iterates over each of the items (cells) in a Pandas DataFrame. This function prints out elements column by column:

def show_df(df):
    for idx1, label in enumerate(df):
        print('{}. Column: {}'.format(idx1, label))
        for idx2, item in enumerate(df[label]):
            print('    {}.{}. {:+6.4f}'.format(idx1, idx2, item))

And, here is what the above (function show_df) might display:

In [78]: show_df(df.head(n=2))
0. Column: A
    0.0. +0.9590
    0.1. -3.6568
1. Column: B
    1.0. +1.1409
    1.1. -0.4395
2. Column: C
    2.0. +1.2634
    2.1. -0.3644
3. Column: D
    3.0. +0.0824
    3.1. +1.1789

And, here is a function that prints out elements row by row (i.e. one row after another):

def show_df_by_rows(df):
    columns = df.columns
    for row, index in enumerate(df.index):
        print('{}. Row: {}'.format(row, index))
        for idx, item in enumerate(df.loc[index]):
            print('    {}.{}. {:+6.4f}'.format(idx, columns[idx], item))

Here is a sample printout from the above function:

0. Row: 2013-01-01 00:00:00
        0.A. +0.9590
        1.B. +1.1409
        2.C. +1.2634
        3.D. +0.0824
1. Row: 2013-01-02 00:00:00
        0.A. -3.6568
        1.B. -0.4395
        2.C. -0.3644
        3.D. +1.1789

You can do something analogous with list comprehensions or generator expressions. For example, consider this code:

def show_dataframe(df):
        generator = ((index, b.items()) for (index, b) in
                                 ((index, df.loc[index]) for index in df.index))
        for date, data in generator:
                print('date: {}'.format(date))
                for col, item in data:
                        print('    col: {}  item: {:12.4f}'.format(col, item))

When we run the above, calling show_dataframe, we might see:

In [90]: show_dataframe(df.tail(2))
date: 2013-01-05 00:00:00
        col: A  item:       0.2175
        col: B  item:       0.1573
        col: C  item:      -0.2240
        col: D  item:       0.2395
date: 2013-01-06 00:00:00
        col: A  item:       0.1440
        col: B  item:      -0.9796
        col: C  item:      -2.2432
        col: D  item:      -0.7182

Notes:

  • In the above example, we produced generator expressions. Note the parentheses around the outer expression and inner expression used to produce generator. If we had used square brackets instead of parentheses, that expression would have produced lists.
  • The function show_items contains a nested loop whose outer loop iterates over the outer generator expression and within that outer loop, an inner loop iterates over each nested inner generator expression.

3.3.5   Grouping items in a DataFrame

You can group items in a DataFrame according to some criteria, then process only items in that group. For example:

In [363]: dates, df = utils01.make_sample_dataframe1()
In [364]: df
Out[364]:
                   A         B         C         D
2013-01-01  0.286823 -0.490076  1.876985  0.900970
2013-01-02  0.338896 -0.111205 -1.516295  1.344511
2013-01-03 -1.045215 -0.155277 -0.238831  0.763586
2013-01-04  0.911923  0.383383 -1.838096 -0.233212
2013-01-05 -0.424031 -0.396694 -1.260573  1.912463
2013-01-06  1.198149 -0.729439  1.578052 -1.139293
In [365]: f1 = lambda x: 0 if x < 0.0 else 1
In [366]: df["E"] = [f1(x) for x in df.A]
In [367]: df
Out[367]:
                   A         B         C         D  E
2013-01-01  0.286823 -0.490076  1.876985  0.900970  1
2013-01-02  0.338896 -0.111205 -1.516295  1.344511  1
2013-01-03 -1.045215 -0.155277 -0.238831  0.763586  0
2013-01-04  0.911923  0.383383 -1.838096 -0.233212  1
2013-01-05 -0.424031 -0.396694 -1.260573  1.912463  0
2013-01-06  1.198149 -0.729439  1.578052 -1.139293  1
In [368]: groups = df.groupby("E")
In [369]:
In [369]: len(groups)
Out[369]: 2
In [371]: groups.get_group(0)
Out[371]:
                   A         B         C         D  E
2013-01-03 -1.045215 -0.155277 -0.238831  0.763586  0
2013-01-05 -0.424031 -0.396694 -1.260573  1.912463  0
In [372]:
In [372]: groups.get_group(1)
Out[372]:
                   A         B         C         D  E
2013-01-01  0.286823 -0.490076  1.876985  0.900970  1
2013-01-02  0.338896 -0.111205 -1.516295  1.344511  1
2013-01-04  0.911923  0.383383 -1.838096 -0.233212  1
2013-01-06  1.198149 -0.729439  1.578052 -1.139293  1

Notes:

  • We use the function/lambda f1 to distinguish between values that are less than zero and those that are greater than or equal to zero.
  • We create a list of keys depending on the values in column "A".
  • We create a new column in our DataFrame containing these keys.
  • We group the DataFrame depending on the values in this new column.
  • Next we can determine the number of groups (using len(df)).
  • And we can access each group individually (with df.get_group(n)).
  • Notice that all the items in the first group have negative values in column "A", and all the items in the second group have positive values in column "A".

An alternative way to do the above task would pass a function to the .groupby method. That function could assign or select rows in arbitrarily complex ways. For example, the following function could assign items to two groups depending on whether the value in column "A" is negative or positive:

In [33]: def f1(index):
    ...:     return 1 if df.loc[index].A < 0.0 else 0
    ...:
    ...:
In [34]:
In [34]: a = df.groupby(f1)
In [35]:
In [35]: len(a)
Out[35]: 2
In [36]:
In [36]: a.get_group(0)
Out[36]:
                   A         B         C         D  E
2013-01-01  0.823745  1.259863  0.099038  2.401296  0
2013-01-03  1.067624  1.106958  1.616902  0.939021  0
2013-01-04  1.152899  0.190998 -0.062540 -1.786131  0
2013-01-06  0.680271  1.307369 -0.024296 -0.973855  0
In [37]:
In [37]: a.get_group(1)
Out[37]:
                   A         B         C         D  E
2013-01-02 -0.358235 -1.920455 -0.553173  0.580201  1
2013-01-05 -0.226727  0.180529  0.900700 -1.835082  1

3.3.6   Applying functions to a DataFrame

You can do this in a variety of ways:

  • Element-wise -- Use .map for Series and .applymap for DataFrame:

    In [171]: dates.map(lambda x: x.day)
    Out[171]: Int64Index([1, 2, 3, 4, 5, 6], dtype='int64')
    In [172]: df.applymap(lambda x: 0.0 if x < 0.0 else x * 10.0)
    Out[172]:
                                       A          B          C         D
    2013-01-01  0.000000  11.222224   0.000000  0.764820
    2013-01-02  8.165304   0.000000   8.425176  0.000000
    2013-01-03  0.000000   7.066568  10.162480  0.000000
    2013-01-04  7.097722   0.000000  10.544352  2.593139
    2013-01-05  0.000000   0.000000  10.031058  6.354610
    2013-01-06  5.629199   1.180783   0.000000  0.000000
    
  • Row-wise and column-wise -- Use one of:

    • df.apply(fn) -- Apply function to each column.
    • df.apply(fn, axis=1 -- Apply function to each row.
  • For functions that take and return a DataFrame or that take and return a Series, use .pipe. Example:

    In [197]: fn = lambda x: np.abs(x)
    In [198]: df.pipe(fn)
    Out[198]:
                       A         B         C         D
    2013-01-01  0.368409  1.122222  0.889764  0.076482
    2013-01-02  0.816530  0.963447  0.842518  1.371106
    2013-01-03  0.164827  0.706657  1.016248  0.474849
    2013-01-04  0.709772  1.695648  1.054435  0.259314
    2013-01-05  0.057673  0.713738  1.003106  0.635461
    2013-01-06  0.562920  0.118078  1.904701  0.149196
    

And, remember that there may be use cases where it is useful to create a "vectorized" function with numpy.vectorize.

3.3.8   Statistical analysis

You can do preliminary and rudimentary statistical analysis. See: http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics.

For more complex work, consider using the Scipy tools.

Examples:

In [65]: df.describe()
Out[65]:
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.255717 -0.067143  0.211290 -0.127586
std    1.102925  0.651381  0.663725  0.691202
min   -0.746677 -1.277578 -0.445694 -1.101834
25%   -0.415984 -0.110226 -0.142937 -0.473979
50%   -0.111748  0.004162 -0.060588 -0.210746
75%    0.545268  0.374949  0.470344  0.363150
max    2.257601  0.516208  1.357676  0.765088
In [66]:
In [66]: sp.mean(df.A)
Out[66]: 0.2557174574376679
In [67]:
In [67]: sp.std(df.A, ddof=1)
Out[67]: 1.102925321931004

4   Visualization and graphing

4.2   Bokeh

See: https://bokeh.pydata.org/en/latest/

Here are Bokeh examples taken from the documentaion:

#!/usr/bin/env python

from bokeh.plotting import figure, output_file, show

def test01():
        # prepare some data
        x = [1, 2, 3, 4, 5]
        y = [6, 7, 2, 4, 5]
        # output to static HTML file
        output_file("lines.html")
        # create a new plot with a title and axis labels
        p = figure(title="simple line example", x_axis_label='x', y_axis_label='y')
        # add a line renderer with legend and line thickness
        p.line(x, y, legend="Temp.", line_width=2)
        # show the results
        show(p)

def test02():
        # prepare some data
        x = [0.1, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0]
        y0 = [i**2 for i in x]
        y1 = [10**i for i in x]
        y2 = [10**(i**2) for i in x]
        # output to static HTML file
        output_file("log_lines.html")
        # create a new plot
        p = figure(
                tools="pan,box_zoom,reset,save",
                y_axis_type="log", y_range=[0.001, 10**11], title="log axis example",
                x_axis_label='sections', y_axis_label='particles'
        )
        # add some renderers
        p.line(x, x, legend="y=x")
        p.circle(x, x, legend="y=x", fill_color="white", size=8)
        p.line(x, y0, legend="y=x^2", line_width=3)
        p.line(x, y1, legend="y=10^x", line_color="red")
        p.circle(
                x, y1,
                legend="y=10^x",
                fill_color="red", line_color="red",
                size=6)
        p.line(x, y2, legend="y=10^x^2", line_color="orange", line_dash="4 4")
        # show the results
        #show(p, browser="firefox")
        show(p)

def main():
        test01()
        test02()

if __name__ == '__main__':
        main()

There are more examples in the Bokeh "Quickstart" document: https://bokeh.pydata.org/en/latest/docs/user_guide/quickstart.html#userguide-quickstart

4.3   Altair

See: https://pypi.python.org/pypi/altair

Note that Altair is not in the Anaconda distribution, but is easy to install with pip.

5   Optimization, parallel processing, access to C/C++, etc.

5.1   Numba

See: http://numba.pydata.org/numba-doc/dev/index.html.

And, here is a interesting article related to Numba: https://www.anaconda.com/blog/developer-blog/parallel-python-with-numba-and-parallelaccelerator/.

From the Numba docs:

From the Numba user manual:

Numba is a compiler for Python array and numerical functions
that gives you the power to speed up your applications with high
performance functions written directly in Python.

Numba generates optimized machine code from pure Python code
using the LLVM compiler infrastructure. With a few simple
annotations, array-oriented and math-heavy Python code can be
just-in-time optimized to performance similar as C, C++ and
Fortran, without having to switch languages or Python
interpreters.

Numba’s main features are:

  * on-the-fly code generation (at import time or runtime, at the
    user’s preference)
  * native code generation for the CPU (default) and GPU hardware
  * integration with the Python scientific software stack (thanks
    to Numpy)

Here is some sample test code, copied from the Numba documentation:

# file: numba_test01.py

import numba

@numba.jit
def sum2d(arr):
    M, N = arr.shape
    result = 0.0
    for i in range(M):
        for j in range(N):
            result += arr[i, j]
    return result

def plain_sum2d(arr):
    M, N = arr.shape
    result = 0.0
    for i in range(M):
        for j in range(N):
            result += arr[i, j]
    return result

And, here is an example that calls the two above functions, one optimized by Numba and the other not. Notice the timings. The Numba optimized version is more than two orders of magnitude faster:

In [30]: import numba_test01 as nt
In [31]: a = np.ones((1000, 1200))
In [32]: time nt.plain_sum2d(a)
CPU times: user 621 ms, sys: 0 ns, total: 621 ms
Wall time: 622 ms
Out[32]: 1200000.0
In [33]: time nt.sum2d(a)
CPU times: user 3.68 ms, sys: 0 ns, total: 3.68 ms
Wall time: 3.7 ms
Out[33]: 1200000.0

There is lots more that can be done with Numba in the way of optimizing code. See the docs.

5.2   Dask

The documentation on Dask can be found here: http://dask.pydata.org/en/latest/docs.html.

This summary of Dask is from the Dask documentation:

Dask is a flexible parallel computing library for analytic computing.

Dask is composed of two components:

 1. Dynamic task scheduling optimized for computation. This is similar to
    Airflow, Luigi, Celery, or Make, but optimized for interactive
    computational workloads.
 2. “Big Data” collections like parallel arrays, dataframes, and lists
    that extend common interfaces like NumPy, Pandas, or Python iterators
    to larger-than-memory or distributed environments. These parallel
    collectiont
    run on top of the dynamic task schedulers.

If you are beginning to learn Dask, you might want some sample data:

  • The dask tutorial contains a script for generating sample data files. You can find the tutorial repository here: https://github.com/dask/dask-tutorial.

  • And, here is a script that will generate a few HDF5 files. I copied it from the Dask Web site (http://dask.pydata.org/en/latest/examples/dataframe-hdf5.html), and made a few minor modifications:

    #!/usr/bin/env python
    
    """
    synopsis:
        generate sample dask data files.
    usage:
        python generate_dask_data.py <file_prefix>
    options:
        -h, --help
                Display this help.
    """
    
    import sys
    import string
    import random
    import pandas as pd
    import numpy as np
    
    def generate(prefix):
        # dict to keep track of hdf5 filename and each key
        fileKeys = {}
        for i in range(10):
            # randomly pick letter as dataset key
            groupkey = random.choice(list(string.ascii_lowercase))
            # randomly pick a number as hdf5 filename
            filename = prefix + str(np.random.randint(100)) + '.h5'
            # Make a dataframe; 26 rows, 2 columns
            df = pd.DataFrame({'x': np.random.randint(1, 1000, 26),
                              'y': np.random.randint(1, 1000, 26)},
                              index=list(string.ascii_lowercase))
            # Write hdf5 to current directory
            df.to_hdf(filename, key='/' + groupkey, format='table')
            fileKeys[filename] = groupkey
        # prints hdf5 filenames and keys for each
        print(fileKeys)
    
    def main():
        args = sys.argv[1:]
        if len(args) != 1:
            sys.exit(__doc__)
        if args[0] in ('-h', '--help'):
            sys.exit(__doc__)
        prefix = args[0]
        generate(prefix)
    
    if __name__ == '__main__':
        main()
    

I used the above script to build sample data files as follows:

$ ./generate_dask_data.py "data02/sample_"

Then I read these HDF5 files into a Dask DataFrame by using the following:

In [38]: df = dd.read_hdf('./data02/sample_*.h5', key='/*')
In [39]: df
Out[39]:
Dask DataFrame Structure:
                    x      y
npartitions=10
                int64  int64
                  ...    ...
...               ...    ...
                  ...    ...
                  ...    ...
Dask Name: concat, 22 tasks
In [40]:

After which, I can do the following, for example:

In [40]: df.x.mean().compute()
Out[40]: 501.53076923076924

We can do something that indicates how our data has been broken down into separate partitions. I can use this function:

def test(df):
    results = []
    for idx in range(df.npartitions):
        mean = df.get_partition(idx).x.mean().compute()
        print('partition: {}  mean: {}'.format(idx, mean))
        results.append((idx, mean))
    return results

Which produces something like the following:

In [10]: test(df)
idx: 0  mean: 473.7692307692308
idx: 1  mean: 436.5769230769231
idx: 2  mean: 501.2692307692308
idx: 3  mean: 565.4230769230769
idx: 4  mean: 516.8846153846154
idx: 5  mean: 501.34615384615387
idx: 6  mean: 531.3076923076923
idx: 7  mean: 428.61538461538464
idx: 8  mean: 565.2307692307693
idx: 9  mean: 494.88461538461536
Out[10]:
[(0, 473.7692307692308),
 (1, 436.5769230769231),
 (2, 501.2692307692308),
 (3, 565.4230769230769),
 (4, 516.8846153846154),
 (5, 501.34615384615387),
 (6, 531.3076923076923),
 (7, 428.61538461538464),
 (8, 565.2307692307693),
 (9, 494.88461538461536)]

5.2.1   Dask for big data

Dask enables you to divide a large data structure or data set, for example, a Pandas DataFrame, into smaller structures, for example, smaller DataFrames, then load those smaller chunks from disk and process them.

Example:

  1. First we'll create a data set, a Pandas DataFrame, that we can divide up into smaller chunks. Here is a Python script that we can use to create a sample CSV (comma separated values) file:

    #!/usr/bin/env python
    
    # file: write_csv.py
    
    """
    synopsis:
        Write sample CSV file from Pandas DataFrame.
    usage:
        python write_csv.py <outfilename> <num_rows>
    example:
        python write_csv.py test_data.csv 200
    """
    
    import sys
    import numpy as np
    import pandas as pd
    
    def make_sample_dataframe(periods):
        """Make sample dates and DataFrame.  Returns (dates, df)."""
        dates = pd.date_range('20130101', periods=periods)
        df = pd.DataFrame(
            np.random.randn(periods, 4),
            index=dates,
            columns=list('ABCD'))
        return dates, df
    
    def create_data(outfilename, count):
        dates, df = make_sample_dataframe(count)
        df.to_csv(outfilename)
    
    def main():
        args = sys.argv[1:]
        if len(args) != 2:
            sys.exit(__doc__)
        outfilename = args[0]
        count = int(args[1])
        create_data(outfilename, count)
    
    if __name__ == '__main__':
        main()
    

    And, from within IPython, we can run it to create a CSV file as follows:

    In [113]: %run write_csv.py tmp2.csv 200
    

    Now, we can read that file to create a Dask DataFrame with the following:

    In [115]: import dask.dataframe as dd
    In [116]: daskdf = dd.read_csv('tmp2.csv')
    
  2. We can look at our data with df.head() and df.tail():

    In [117]: daskdf.head()
    Out[117]:
       Unnamed: 0         A         B         C         D
    0  2013-01-01  1.719008  0.168998 -0.582670 -0.199597
    1  2013-01-02  0.947192  1.449137 -0.701263  0.342353
    2  2013-01-03  1.321397  0.035692  0.147275  1.551782
    3  2013-01-04 -0.286258  0.592772  1.770504  1.752572
    4  2013-01-05  1.695924  0.159782  2.150698 -0.060106
    In [118]: daskdf.tail()
    Out[118]:
      Unnamed: 0         A         B         C         D
    195  2013-07-15  0.303020  0.710051 -0.904407 -0.451793
    196  2013-07-16 -0.703248 -0.973423 -0.830585  0.183094
    197  2013-07-17  0.886046  1.530008  1.319875 -0.318807
    198  2013-07-18  0.021749  2.570984  0.572013  1.249558
    199  2013-07-19 -0.570810 -0.240768  2.203662 -0.014111
    

    Also see the Pandas section for ways to view structures, for example: View Pandas data structures

  3. Next, we'll divide it up -- This is an important capability of Dask; it enables us to process Dataframes/arrays that are either too large to fit comfortably in memory or which we are only interested in sub-slices. In this case, we'll specify a block size (or a partition size) when we read the CSV file and create a Dask DataFrame:

    In [58]: %run write_csv.py tmp4.csv 500
    In [59]:
    In [59]: df3 = dd.read_csv('tmp3.csv', blocksize=600)
    In [60]:
    In [60]: df3.head()
    Out[60]:
       Unnamed: 0         A         B         C         D
    0  2013-01-01  1.907704  0.317188  0.779075  0.327731
    1  2013-01-02 -0.936242 -0.679869 -0.817254 -0.810020
    2  2013-01-03 -1.465717 -0.775163 -0.621830 -0.171773
    3  2013-01-04  0.878534 -0.910678 -0.363762  0.462970
    4  2013-01-05 -0.182779  0.174225 -1.483841 -0.062528
    In [61]: df3.tail()
    Out[61]:
       Unnamed: 0         A         B         C         D
    0  2013-07-15  0.426699 -2.126057 -0.784172  0.780982
    1  2013-07-16 -0.727647 -1.552699  0.750276 -0.788475
    2  2013-07-17  0.452168 -0.525214  0.003892 -0.029953
    3  2013-07-18 -1.135117  0.626181 -0.895456  2.096875
    4  2013-07-19  1.365505 -0.208806  0.115254 -1.210855
    In [62]:
    In [62]: df3.A.mean().compute()
    Out[62]: 0.04365032375682896
    In [63]:
    
  4. And, now, we'll process that data chunk by chunk:

    In [63]: for idx in range(df3.npartitions):
     ...:     data = df3.get_partition(idx)
     ...:     mean = data.A.mean().compute()
     ...:     print('partition: {}  mean: {}'.format(idx, mean))
     ...:
    partition: 0  mean: 0.1307434691610682
    partition: 1  mean: -0.10723637021736673
    partition: 2  mean: 0.47059788011488657
    partition: 3  mean: -0.029706498960742605
    partition: 4  mean: 0.06754303873144374
    partition: 5  mean: 0.1604556981338858
    partition: 6  mean: -0.4161510144675041
    partition: 7  mean: 0.6799116374415602
    partition: 8  mean: 0.6303390153859068
    partition: 9  mean: 0.6517677726166038
    partition: 10  mean: -0.02111769936010994
        o
        o
        o
    In [64]:
    

    Notes:

    • Keep in mind that Dask is capable of "parallelizing" the above operation. It can process multiple partitions in parallel on a multi-core/multi-CPU machine. See the next section for help with that.

5.2.2   Dask for optimized (and parallel) computing

Dask enables you to describe a complex process in terms of an execution graph: a digraph (directed graph) whose nodes are sub-processes. The valuable thing about being able to do so is that Dask can schedule the execution of that larger process so that some sub-processes are executed in parallel. On multi-CPU/multi-core hardware, this can be a big win.

Dask supports parallel processing on both a single machine and one multiple, distributed machines. In what follows, however, I will discuss parallel computation on a single machine.

To learn more about this, you will want to read the following:

Controlling parallelism in Dask requires understanding Dask schedulers, how they are used by Dask, and how to use them.

Note that Dask has default schedulers. If you do nothing to change or set the scheduler, you will be using the default, which is most ofter what you want. The notes that follow will attempt to help you determine when and under what conditions you might want to use a different scheduler and how to do that.

Also, keep in mind two concepts that are both related to optimization in Dask: (1) Parallelism is what you want when you have multiple tasks and want to speed them up by running/computing them in parallel. (2) Breaking your data and your Dask data collections into chunks is what you want when your data set is very large and will not fit in memory. You should keep in mind that breaking your data into chunks may slow down processing. Here is something that shows some of those differences:

In [57]: df1 = dd.read_csv('tmp5.csv', blocksize=1000000)
In [58]: df2 = dd.read_csv('tmp5.csv', blocksize=8000)
In [59]:
In [59]: df1.npartitions
Out[59]: 1
In [60]: df2.npartitions
Out[60]: 12
In [61]: df1.get_partition(0).size.compute()
Out[61]: 5000
In [62]: df2.get_partition(0).size.compute()
Out[62]: 450
In [63]:
In [63]: time df1.A.mean().compute()
CPU times: user 15.8 ms, sys: 7.5 ms, total: 23.3 ms
Wall time: 22.3 ms
Out[63]: 0.02893067882172706
In [64]: time df2.A.mean().compute()
CPU times: user 167 ms, sys: 9.85 ms, total: 177 ms
Wall time: 164 ms
Out[64]: 0.028930678821727045
In [65]:

Notes:

  • We create df1 with a single partition (or chunk) and df2 with multiple partitions (in this case 12).
  • The size of a single partition of df1 is much larger than the first partition of df2 (5000 vs 450).
  • Computing the mean of a single column of df1 takes significantly less time than the same operation on df2.

Synchronous processing on the local machine -- The default scheduler does that.

Let's figure out how to do that in parallel, for example, we'll try to compute the mean of each of the columns of our dataframe (four columns: "A", "B", "C", and "D") in parallel.

Here are two functions. One computes the mean for each column in our DataFrame, one column after another. The other attempts to use dask.distributed to schedule these four tasks so that they make use of more than one CPU core:

def compute_means_sequential(df):
    """
    Sequentially compute the means of columns of dataframe.

    Args:
        df (dask.dataframe.DataFrame) -- A dataframe containing columns
            A, B, C, and D.

    Return:
        The means
    """
    meanA = df.A.mean().compute()
    meanB = df.B.mean().compute()
    meanC = df.C.mean().compute()
    meanD = df.D.mean().compute()
    return meanA, meanB, meanC, meanD

def compute_means_parallel(client, df):
    """
    Compute in parallel the means of columns of dataframe.

    Args:
        client (dask.distributed.Client) -- The client to schedule
            the computation.
        df (dask.dataframe.DataFrame) -- A dataframe containing columns
            A, B, C, and D.

    Return:
        The means
    """
    meanA = client.submit(df.A.mean().compute)
    meanB = client.submit(df.B.mean().compute)
    meanC = client.submit(df.C.mean().compute)
    meanD = client.submit(df.D.mean().compute)
    client.gather((meanA, meanB, meanC, meanD))
    return meanA.result(), meanB.result(), meanC.result(), meanD.result()

You can find a file containing these snippets here: snippets.py.

Here is a test that uses the above on a 2-core machine:

In [17]: time snippets.compute_means_sequential(df1)
CPU times: user 167 ms, sys: 21.3 ms, total: 189 ms
Wall time: 379 ms
Out[17]:
(0.02893067882172706,
 -0.05704419047235241,
 -0.03281851829891229,
 -0.029845199428518945)
In [18]: time snippets.compute_means_parallel(client, df1)
CPU times: user 189 ms, sys: 16.9 ms, total: 206 ms
Wall time: 281 ms
Out[18]:
(0.02893067882172706,
 -0.05704419047235241,
 -0.03281851829891229,
 -0.029845199428518945)

Here is a test that uses the above on a 4-core machine:

In [15]: time snippets.compute_means_sequential(df1)
CPU times: user 160 ms, sys: 9.5 ms, total: 169 ms
Wall time: 303 ms
Out[15]:
(0.02893067882172706,
 -0.05704419047235241,
 -0.03281851829891229,
 -0.029845199428518945)
In [16]:
In [16]: time snippets.compute_means_parallel(client, df1)
CPU times: user 164 ms, sys: 5.03 ms, total: 169 ms
Wall time: 224 ms
Out[16]:
(0.02893067882172706,
 -0.05704419047235241,
 -0.03281851829891229,
 -0.029845199428518945)

Notes:

  • Parallel execution on a 4-core machine takes measurably less time. On a large data structure, this might be significant and noticeable.
  • My original test had four calls to print() in each of the above two functions. That partially masked the time difference between calls to these functions.
  • As with any work on optimization, you will need to test with your data, your machine, your configuration, etc. YMMV (your mileage my vary).

5.3   Cython

See: http://cython.org/.

Cython enables us to write or produce C code while writing code in the style of Python. There's more to it than that, but you get the idea. We can write code that looks a lot like Python code, and then use Cython to turn it into C code.

Cython has another important use -- Because (1) Cython gives us easy access to libraries of compiled C code and (2) it is easy to write functions in Cython that can be called from Python, we can use it to easily "wrap" C functions for use in Python. In fact, if you look inside some Python packages, for example Lxml, you will see wrappers for underlying C code that were produced with Cython; Lxml makes calls into the libxml XML libraries provided by http://www.xmlsoft.org.

Here is a bit more description from http://cython.org/:

"Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language (based on Pyrex). It makes writing C extensions for Python as easy as Python itself.

"Cython gives you the combined power of Python and C to let you
  • write Python code that calls back and forth from and to C or C++ code natively at any point.
  • easily tune readable Python code into plain C performance by adding static type declarations.
  • use combined source code level debugging to find bugs in your Python, Cython and C code.
  • interact efficiently with large data sets, e.g. using multi-dimensional NumPy arrays.
  • quickly build your applications within the large, mature and widely used CPython ecosystem.
  • integrate natively with existing code and data from legacy, low-level or high-performance libraries and applications."

6   Machine learning

6.1   Scikit-Learn

And, the scikit-learn documentation page is here: http://scikit-learn.org/stable/user_guide.html.

EliteDataScience has an introduction to machine learning here: https://elitedatascience.com/learn-machine-learning

EliteDataScience has provided a Scikit-Learn tutorial here: https://elitedatascience.com/python-machine-learning-tutorial-scikit-learn.

6.2   tensorflow

Question: Is there support for tensorflow in Anaconda? Answer: Yes, but currently, installing it is tricky. For example, see this: https://gist.github.com/johndpope/187b0dd996d16152ace2f842d43e3990

7   Multiprocessing and parallization

7.2   Dask and Dask schedulers

See: https://dask.pydata.org/

Also see the section on Dask elsewhere in the current document: Dask for optimized (and parallel) computing.

8   Data store -- HDF5, h5py, Pytables, asdf, etc

8.1   HDF5

8.1.1   h5py

You can store Panda DataFrames and Dask DataFrames in HDF5 archives with h5py. You can read about h5py here:

Also see: https://dask.pydata.org/en/doc-test-build/array-overview.html#construct

Here is an example that saves and retrieves a Dask DataFrame:

In [62]: df1, df2 = snippets.read_csv_files('tmp5.csv')
In [63]: df1.to_hdf('tmp01.hdf5', '/Version1/tmp5')
Out[63]: ['tmp01.hdf5']
In [64]:
In [64]: df1a = dd.read_hdf('tmp01.hdf5', '/Version1/tmp5')
In [65]:
In [65]: df1.A.mean().compute()
Out[65]: 0.02893067882172706
In [66]: df1a.A.mean().compute()
Out[66]: 0.02893067882172706

In [68]: df2.to_hdf('tmp01.hdf5', '/Version1/tmp5_2')
Out[68]:
['tmp01.hdf5',
 'tmp01.hdf5',
 'tmp01.hdf5',
 'tmp01.hdf5',
 'tmp01.hdf5',
 'tmp01.hdf5',
 'tmp01.hdf5',
 'tmp01.hdf5',
 'tmp01.hdf5',
 'tmp01.hdf5',
 'tmp01.hdf5',
 'tmp01.hdf5']
In [69]:
In [69]: df2a = dd.read_hdf('tmp01.hdf5', '/Version1/tmp5_2')
In [70]:
In [70]: df2.npartitions
Out[70]: 12
In [71]: df2a.npartitions
Out[71]: 1
In [72]: df2.B.su
df2.B.sub df2.B.sum
In [72]: df2.B.sum().compute()
Out[72]: -57.04419047235241
In [73]: df2a.B.sum().compute()
Out[73]: -57.04419047235241

Notes:

  • We load a Dask DataFrame (df1), then read it back into a separate variable (df1a).

  • We compute the mean of column A of both DataFrames so as to show that the one we wrote to HDF5 and the one we read back in from HDF5 contain the same data.

  • Notice that in the case of df2 and df2a, read_hdf function did not preserve the chunk size and number of partitions. However, the read_hdf function has an optional parameter that enables you to read a DataFrame from HDF5 creating multiple partitions and a smaller chunk size. Example:

    In [80]: df2b = dd.read_hdf('tmp01.hdf5', '/Version1/tmp5_2')
    In [81]: df2b.npartitions
    Out[81]: 1
    In [82]: df2c = dd.read_hdf('tmp01.hdf5', '/Version1/tmp5_2', chunksize=100)
    In [83]: df2c.npartitions
    Out[83]: 10
    

8.1.2   h5serv

There is also an HTTP server for HDF5 archives. It presents a REST-ful interface that enables you to add, list, and retrieve data objects from HDF5 archives on a remote machine. The data returned in response to a retrieval request is formatted as JSON.

Yot can learn more about h5serv here: http://h5serv.readthedocs.io/en/latest/.

And, you can learn about the JSON representation of HDF5 here: http://hdf5-json.readthedocs.io/en/latest/index.html.