================================================
A summary of tools for data science for Python
================================================
:author: Dave Kuhlman
:contact: dkuhlman (at) davekuhlman (dot) org
:address: http://www.davekuhlman.org
:revision: 1.0.1
:date: |date|
.. |date| date:: %B %d, %Y
:Copyright: Copyright (c) 2018 Dave Kuhlman. All Rights Reserved.
This software is subject to the provisions of the MIT License
http://www.opensource.org/licenses/mit-license.php.
:Abstract: This document attempts to give a survey of data science
tools for Python programming, along with brief
introductions to help getting started with some of those
tools.
.. sectnum::
.. contents::
Introduction and preliminaries
================================
In this document I'll try to describe and summarize some significant
tools that are available to Python programmers for data science,
numerical processing, statistics, and visualizing numerical data.
For each tool or package, I'll also try to give a brief overview of:
- What the tool does.
- What to use it for, along with a few use cases.
- How to do a few common things that the tool supports.
- When appropriate, a comparison with other similar tools.
All these packages are available in the Anaconda distribution of
Python, which makes Anaconda a very good option for data analytics
and visualization. See:
- https://docs.anaconda.com/anaconda/
- https://docs.anaconda.com/anaconda/packages/pkg-docs
It's likely that they are also available at http://pypi.python.org
and can be installed with ``pip``. If you plan on doing some
exploration (and do not want to use the Anaconda distribution), you
will want to consider using ``virtualenv``
(https://virtualenv.pypa.io/en/stable/) and, for even more
convenience in trying out various packages and configurations, look
at ``virtualenvwrapper``
(https://virtualenvwrapper.readthedocs.io/en/latest/).
More information:
- There is another summary of Python packages for data science here:
https://elitedatascience.com/r-vs-python-for-data-science.
Includes tools for the R programming language, too.
Many on the examples in this document use the somewhat standard
import statements, for example::
import numpy as np
import scipy as sp
import pandas as pd
Some helpers
==============
ipython
-------------
IPython is an enhanced interactive Python shell. It has tab
completion, gives more convenient access to help for Python modules
and objects, enables you to edit and rerun previous commands, and
much more.
For more information, see: https://ipython.org.
Anaconda ships with QtConsole that contains IPython for even
more convenience.
IPython profiles
~~~~~~~~~~~~~~~~~~
If you use IPython, then consider creating a data science
profile. Use something like this::
$ ipython profile create datasci
Then, consider putting something like the following in
``~/.ipython/profile_datasci/startup/50-config.py``::
import sys
import numpy as np
import scipy as sp
def pdir(obj):
"""Print information about obj, including `dir(obj)`."""
if isinstance(obj, type):
print('class: {}'.format(obj.__name__))
else:
print('instance class name: {}'.format(obj.__class__.__name__))
if obj.__doc__:
print('doc string: {}'.format(obj.__doc__))
else:
print('doc string: no doc string')
print(dir(obj))
def read_file_contents(filename):
with open(filename, 'r') as infile:
content = infile.read()
return content
You can have multiple startup files. See the ``startup/README``
file in your profile directory.
Also, consider doing some customization in
``~/.ipython/profile_datasci/ipython_config.py``.
And, in order to use that profile, start IPython with this::
$ ipython --profile=datasci
You can find more help with profiles by running something like the
following::
$ ipython help profile
Or, see the following:
http://ipython.readthedocs.io/en/stable/config/intro.html#profiles
Getting (interactive) help and docs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Inside the standard Python interactive shell, you can get help on
``some_object`` with this::
>>> help(some_object)
Inside the IPython interactive shell, you can use the above, or you
can do::
In [9]: import scipy.fftpack
In [10]: scipy.fftpack?
In [11]:
In [11]: from scipy import fftpack
In [12]: fftpack?
In [13]: fftpack.fft?
You can use ``pydoc`` to get help at the command line. For example::
$ pydoc numpy.arange
You can also use ``pydoc`` to run an HTTP server, and view the
documentation in a Web browser. Do the following for help with
that::
$ pydoc --help
And, of course, documentation is available for the Scipy suite of
tools at: http://www.scipy.org.
Installing the tools
----------------------
Unless otherwise noted, each of the tools described in this document
can be described with ``pip install ...`` (the standard Python
install tool) or, for those who are using the Anaconda Python
distribution, with ``conda install ...``.
``pip`` and ``virtualenv``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you use ``pip``, I'd recommend using ``virtualenv``, at the
least, and even ``virtualenvwrapper``, for extra convenience and
flexibility. ``virtualenv`` enables you to install Python packages
(and therefor, the tools discussed in this document) in a separate
environment, separate from your standard Python installation, and
without polluting that standard installation. Since that separate
installation is in its own directory, you can remove it by simply
deleting that directory. ``virtualenvwrapper`` extends
``virtualenv`` by enabling you to create, manage, and switch between
different ``virtualenv`` environments easily. For example, you
might want to create and switch (1) between one ``virtualenv`` for
text processing and another for data science or (2) between one
installation for Python 2 and another for Python 3. See:
- ``virtualenv`` -- https://pypi.python.org/pypi/virtualenv
- ``virtualenvwrapper`` -- https://virtualenvwrapper.readthedocs.io/en/latest/
Anaconda
~~~~~~~~~~~~~~
The Anaconda installation of Python provides most of the tools
discussed in this document in the standard Anaconda installation.
Additional tools can be installed with ``conda install ...``, and
the installation can be kept up-to-date with ``conda update --all``.
In the event that you need a Python package that is not provided by
Anaconda, you can use ``pip``.
- The Anaconda distribution of Python -- https://continuum.io/
- ``conda``, the package manager for Anaconda --
https://conda.io/docs/index.html
Other Python distributions for data science
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For more options on installing Python with a slant toward data
science and scientific programming (but much else besides), see:
https://www.scipy.org/install.html.
Analytics
===========
Numpy
--------
Help with Numpy:
- See the documentation page: http://www.numpy.org.
- A tutorial: https://docs.scipy.org/doc/numpy-dev/user/quickstart.html
- Some lecture notes: http://www.scipy-lectures.org/intro/numpy/numpy.html
There are (at least) two aspects to Numpy:
- Primitive Numpy numeric types or scalars, for example:
``np.int32``, ``np.int64``, ``np.float32``, ``np.float64``, etc.
See the following for information on these primitive types and
others:
https://docs.scipy.org/doc/numpy/reference/arrays.scalars.html.
- Array objects (instances of ``np.ndarray``) along with ways to
deal with them.
- Operations on Numpy arrays -- For information on these, see the
Numpy reference manual:
https://docs.scipy.org/doc/numpy/reference/index.html. Here is a
quick summary:
- Array creation routines -- Create arrays of different kinds,
e.g. all ones, all zeros, identity, from an existing array, as a
copy of an array, etc.
- Array manipulation routines -- Routines that reshape an array,
transpose an array, change the number of dimensions, join
(concatenate, stack, etc), tiling arrays (create by repeating an
array), etc.
split arrays, etc.
- Binary operations -- Logical binary operations on arrays,
packing arrays into bits, bit-shifting operations, etc.
- String operations
- C-Types Foreign Function Interface (numpy.ctypeslib)
- Datetime Support Functions
- Data type routines
- Optionally Scipy-accelerated routines (numpy.dual)
- Mathematical functions with automatic domain (numpy.emath) --
Routines possibly accelerated by Scipy, but available in Numpy
if Scipy is not installed. For example, routines for
eigenvalues, Fourier transforms, solving linear equations, etc.
Use::
>>> from numpy import dual
- Floating point error handling
- Discrete Fourier Transform (numpy.fft) -- Use::
>>> from numpy import fft
Or, just::
>>> np.fft.fft( ... ) # etc.
- Financial functions -- Loan, payment, and interest calculations.
- Functional programming -- Routines and classes that assist with
doing functional programming. For example, ``np.vectorize``
creates a "vectorized" function; ``np.frompyfunc`` creates a
Numpy ``ufunc``. (Note that vectorized functions and universal
functions can be applied to arrays. For help with the
difference between vectorized and universal functions, see:
https://stackoverflow.com/questions/6768245/difference-between-frompyfunc-and-vectorize-in-numpy.)
Also, remember to look at ``functools`` and ``itertools`` in the
standard Python library: https://docs.python.org/3/library/functional.html
And, if you need parallelism across multiple CPUs and cores,
look at ``ipyparallel``: https://ipyparallel.readthedocs.io/en/latest/
- Numpy-specific help functions -- Functions for getting
information about objects and help with Numpy. (Also, if you
are using IPython, the "?" operator gives help with a function
or object, for example, ``enumerate?`` gives help on the
``enumerate`` function.)
- Indexing routines
- Input and output -- Routines for saving and loading arrays.
(But, you may also want to explore HDF5 and ``h5py`` or
``pytables``. Both ``h5py`` and ``pytables`` are in the
Anaconda Python distribution.) Also, routines for formatting
arrays as strings, converting arrays to and from strings, etc..
- Linear algebra (numpy.linalg) -- Routines for the following:
- Matrix and vector products
- Decompositions
- Matrix eigenvalues
- Norms and other numbers
- Solving equations and inverting matrices
- Exceptions
- Linear algebra on several matrices at once
- Logic functions -- Functions for performing various tests on
elements of Numpy arrays.
- Masked array operations -- Support for creating and using masked
arrays. A masked array is an array with a mask that marks some
elements of the array as invalid. You can find some help with
masked arrays in this document:
http://www.scipy-lectures.org/intro/numpy/numpy.html.
- Mathematical functions -- Functions for:
- Trigonometric functions
- Hyperbolic functions
- Rounding
- Sums, products, differences
- Exponents and logarithms
- Other special functions
- Floating point routines
- Arithmetic operations
- Handling complex numbers
- etc
- Matrix library (numpy.matlib) -- Functions for creating and
using matrices, as opposed to ``numpy.ndarry``. Use ``from
numpy import matlib``. See this for a bit of help on the
differences between arrays and matrices in Numpy:
https://stackoverflow.com/questions/4151128/what-are-the-differences-between-numpy-arrays-and-matrices-which-one-should-i-u
- Miscellaneous routines
- Padding Arrays
- Polynomials
- Random sampling (numpy.random)
- Set routines
- Sorting, searching, and counting
- Statistics
- Test Support (numpy.testing)
- Window functions
Scipy
-------
Note that Scipy, Numpy, Pandas, Matplotlib,
IPython, and Sympy are all under the Scipy umbrella.
For information about any of these, see: https://www.scipy.org/.
What is Scipy? (1) It is many things to many people. But more
seriously, (2) it is a large collection of functions for performing
operations on arrays of numerical data. Think of it this way: Numpy
(and Pandas) give you ways to structure and manipulate
multi-dimensional arrays of numbers; Scipy gives you many functions
that perform operations on those multi-dimensional arrays of
numbers.
What kinds of operations? Here are some categories with
descriptions:
- Basic functions
- Special functions (scipy.special)
Integration (scipy.integrate)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For help with this set of functions, do the following::
>>> from scipy import integrate
>>> help(integrate)
Or, in IPython, do ``integrate?``
Here is the list you will see:
- Integrating functions, given function object
- quad -- General purpose integration
- dblquad -- General purpose double integration
- tplquad -- General purpose triple integration
- nquad -- General purpose n-dimensional integration
- fixed_quad -- Integrate func(x) using Gaussian quadrature of order n
- quadrature -- Integrate with given tolerance using Gaussian quadrature
- romberg -- Integrate func using Romberg integration
- quad_explain -- Print information for use of quad
- newton_cotes -- Weights and error coefficient for Newton-Cotes integration
IntegrationWarning -- Warning on issues during integration
- Integrating functions, given fixed samples
- trapz -- Use trapezoidal rule to compute integral.
- cumtrapz -- Use trapezoidal rule to cumulatively compute integral.
- simps -- Use Simpson's rule to compute integral from samples.
- romb -- Use Romberg Integration to compute integral from (2**k + 1) evenly-spaced samples.
- Solving initial value problems for ODE systems
The solvers are implemented as individual classes which can be used directly
(low-level usage) or through a convenience function.
- solve_ivp -- Convenient function for ODE integration.
- RK23 -- Explicit Runge-Kutta solver of order 3(2).
- RK45 -- Explicit Runge-Kutta solver of order 5(4).
- Radau -- Implicit Runge-Kutta solver of order 5.
- BDF -- Implicit multi-step variable order (1 to 5) solver.
- LSODA -- LSODA solver from ODEPACK Fortran package.
- OdeSolver -- Base class for ODE solvers.
- DenseOutput -- Local interpolant for computing a dense output.
- OdeSolution -- Class which represents a continuous ODE solution.
Optimization (scipy.optimize)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Remember that for each the following (or any) functions, you can get
help in the usual ways: ``help(some_func)`` or (in IPython)
``some_func?``.
- Local Optimization:
- minimize -- Unified interface for minimizers of multivariate functions
- minimize_scalar -- Unified interface for minimizers of univariate functions
- OptimizeResult -- The optimization result returned by some optimizers
- OptimizeWarning -- The optimization encountered problems
- General-purpose multivariate methods:
- fmin -- Nelder-Mead Simplex algorithm
- fmin_powell -- Powell's (modified) level set method
- fmin_cg -- Non-linear (Polak-Ribiere) conjugate gradient algorithm
- fmin_bfgs -- Quasi-Newton method (Broydon-Fletcher-Goldfarb-Shanno)
- fmin_ncg -- Line-search Newton Conjugate Gradient
- Constrained multivariate methods:
- fmin_l_bfgs_b -- Zhu, Byrd, and Nocedal's constrained optimizer
- fmin_tnc -- Truncated Newton code
- fmin_cobyla -- Constrained optimization by linear approximation
- fmin_slsqp -- Minimization using sequential least-squares programming
- differential_evolution -- stochastic minimization using differential evolution
- Univariate (scalar) minimization methods:
- fminbound -- Bounded minimization of a scalar function
- brent -- 1-D function minimization using Brent method
- golden -- 1-D function minimization using Golden Section method
- Equation (Local) Minimizers:
- leastsq -- Minimize the sum of squares of M equations in N unknowns
- least_squares -- Feature-rich least-squares minimization.
- nnls -- Linear least-squares problem with non-negativity constraint
- lsq_linear -- Linear least-squares problem with bound constraints
- Global Optimization:
- basinhopping -- Basinhopping stochastic optimizer
- brute -- Brute force searching optimizer
- differential_evolution -- stochastic minimization using differential evolution
- Rosenbrock function:
- rosen -- The Rosenbrock function.
- rosen_der -- The derivative of the Rosenbrock function.
- rosen_hess -- The Hessian matrix of the Rosenbrock function.
- rosen_hess_prod -- Product of the Rosenbrock Hessian with a vector.
- Fitting:
- curve_fit -- Fit curve to a set of points
- Root finding -- Scalar functions:
- brentq -- quadratic interpolation Brent method
- brenth -- Brent method, modified by Harris with hyperbolic extrapolation
- ridder -- Ridder's method
- bisect -- Bisection method
- newton -- Secant method or Newton's method
- Fixed point finding:
- fixed_point -- Single-variable fixed-point solver
- General nonlinear solvers:
- root -- Unified interface for nonlinear solvers of multivariate functions
- fsolve -- Non-linear multi-variable equation solver
- broyden1 -- Broyden's first method
- broyden2 -- Broyden's second method
- Large-scale nonlinear solvers:
- newton_krylov
- anderson
- Simple iterations:
- excitingmixing
- linearmixing
- diagbroyden
Additional information on the nonlinear solvers can be obtained from
the help on ``scipy.optimize.nonlin``.
- Linear Programming -- General linear programming solver:
linprog -- Unified interface for minimizers of linear programming problems
- The simplex method supports callback functions, such as:
linprog_verbose_callback -- Sample callback function for linprog (simplex)
- Assignment problems:
- linear_sum_assignment -- Solves the linear-sum assignment problem
- Utilities:
- approx_fprime -- Approximate the gradient of a scalar function
- bracket -- Bracket a minimum, given two starting points
- check_grad -- Check the supplied derivative using finite differences
- line_search -- Return a step that satisfies the strong Wolfe conditions
- show_options -- Show specific options optimization solvers
- LbfgsInvHessProduct -- Linear operator for L-BFGS approximate inverse Hessian
Interpolation (scipy.interpolate)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sub-package for objects used in interpolation.
As listed below, this sub-package contains spline functions and classes,
one-dimensional and multi-dimensional (univariate and multivariate)
interpolation classes, Lagrange and Taylor polynomial interpolators, and
wrappers for `FITPACK `__
and DFITPACK functions.
- Univariate interpolation
- interp1d
- BarycentricInterpolator
- KroghInterpolator
- PchipInterpolator
- barycentric_interpolate
- krogh_interpolate
- pchip_interpolate
- Akima1DInterpolator
- CubicSpline
- PPoly
- BPoly
- Multivariate interpolation
- Unstructured data:
- griddata
- LinearNDInterpolator
- NearestNDInterpolator
- CloughTocher2DInterpolator
- Rbf
- interp2d
- For data on a grid:
- interpn
- RegularGridInterpolator
- RectBivariateSpline
See also: `scipy.ndimage.map_coordinates`
- Tensor product polynomials:
- NdPPoly
- 1-D Splines
- BSpline
- make_interp_spline
- make_lsq_spline
- Functional interface to FITPACK routines:
- splrep
- splprep
- splev
- splint
- sproot
- spalde
- splder
- splantider
- insert
- Object-oriented FITPACK interface:
- UnivariateSpline
- InterpolatedUnivariateSpline
- LSQUnivariateSpline
- 2-D Splines
- For data on a grid:
- RectBivariateSpline
- RectSphereBivariateSpline
- For unstructured data:
- BivariateSpline
- SmoothBivariateSpline
- SmoothSphereBivariateSpline
- LSQBivariateSpline
- LSQSphereBivariateSpline
- Low-level interface to FITPACK functions:
- bisplrep
- bisplev
- Additional tools
- lagrange
- approximate_taylor_polynomial
- pade
See also:
- `scipy.ndimage.map_coordinates`,
- `scipy.ndimage.spline_filter`,
- `scipy.signal.resample`,
- `scipy.signal.bspline`,
- `scipy.signal.gauss_spline`,
- `scipy.signal.qspline1d`,
- `scipy.signal.cspline1d`,
- `scipy.signal.qspline1d_eval`,
- `scipy.signal.cspline1d_eval`,
- `scipy.signal.qspline2d`,
- `scipy.signal.cspline2d`.
- Functions existing for backward compatibility (should not be used in
new code):
- ``spleval``
- ``spline``
- ``splmake``
- ``spltopp``
- ``pchip``
Fourier Transforms (``scipy.fftpack``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There is help and a number of examples here:
https://docs.scipy.org/doc/scipy/reference/tutorial/fftpack.html.
Here is an example, copied from the documentation in the above
link::
import numpy as np
from scipy.fftpack import fft
def test():
# Number of sample points
N = 600
# sample spacing
T = 1.0 / 800.0
x = np.linspace(0.0, N * T, N)
y = np.sin(50.0 * 2.0 * np.pi * x) + 0.5 * np.sin(80.0 * 2.0 * np.pi * x)
yf = fft(y)
from scipy.signal import blackman
w = blackman(N)
ywf = fft(y * w)
xf = np.linspace(0.0, 1.0 / (2.0 * T), N / 2)
import matplotlib.pyplot as plt
plt.semilogy(xf[1:N // 2], 2.0 / N * np.abs(yf[1:N // 2]), '-b')
plt.semilogy(xf[1:N // 2], 2.0 / N * np.abs(ywf[1:N // 2]), '-r')
plt.legend(['FFT', 'FFT w. window'])
plt.grid()
plt.show()
test()
Here is a summary of the Discrete Fourier transforms support in
``scipy.fftpack``:
- Fast Fourier Transforms (FFTs)
- ``fft`` - Fast (discrete) Fourier Transform (FFT)
- ``ifft`` - Inverse FFT
- ``fft2`` - Two dimensional FFT
- ``ifft2`` - Two dimensional inverse FFT
- ``fftn`` - n-dimensional FFT
- ``ifftn`` - n-dimensional inverse FFT
- ``rfft`` - FFT of strictly real-valued sequence
- ``irfft`` - Inverse of rfft
- ``dct`` - Discrete cosine transform
- ``idct`` - Inverse discrete cosine transform
- ``dctn`` - n-dimensional Discrete cosine transform
- ``idctn`` - n-dimensional Inverse discrete cosine transform
- ``dst`` - Discrete sine transform
- ``idst`` - Inverse discrete sine transform
- ``dstn`` - n-dimensional Discrete sine transform
- ``idstn`` - n-dimensional Inverse discrete sine transform
- Differential and pseudo-differential operators
- ``diff`` - Differentiation and integration of periodic sequences
- ``tilbert`` - Tilbert transform: cs_diff(x,h,h)
- ``itilbert`` - Inverse Tilbert transform: sc_diff(x,h,h)
- ``hilbert`` - Hilbert transform: cs_diff(x,inf,inf)
- ``ihilbert`` - Inverse Hilbert transform: sc_diff(x,inf,inf)
- ``cs_diff`` - cosh/sinh pseudo-derivative of periodic sequences
- ``sc_diff`` - sinh/cosh pseudo-derivative of periodic sequences
- ``ss_diff`` - sinh/sinh pseudo-derivative of periodic sequences
- ``cc_diff`` - cosh/cosh pseudo-derivative of periodic sequences
- ``shift`` - Shift periodic sequences
- Helper functions
- ``fftshift`` - Shift the zero-frequency component to the center of the spectrum
- ``ifftshift`` - The inverse of `fftshift`
- ``fftfreq`` - Return the Discrete Fourier Transform sample frequencies
- ``rfftfreq`` - DFT sample frequencies (for usage with rfft, irfft)
- ``next_fast_len`` - Find the optimal length to zero-pad an FFT for speed
- Convolutions (``scipy.fftpack.convolve``)
- ``convolve``
- ``convolve_z``
- ``init_convolution_kernel``
- ``destroy_convolve_cache``
Signal Processing (scipy.signal)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Use this module with either of the following::
>>> import scipy.signal
>>> from scipy import signal
Here is some summary:
- Convolution
- convolve -- N-dimensional convolution.
- correlate -- N-dimensional correlation.
- fftconvolve -- N-dimensional convolution using the FFT.
- convolve2d -- 2-dimensional convolution (more options).
- correlate2d -- 2-dimensional correlation (more options).
- sepfir2d -- Convolve with a 2-D separable FIR filter.
- choose_conv_method -- Chooses faster of FFT and direct convolution methods.
- B-splines
- bspline -- B-spline basis function of order n.
- cubic -- B-spline basis function of order 3.
- quadratic -- B-spline basis function of order 2.
- gauss_spline -- Gaussian approximation to the B-spline basis function.
- cspline1d -- Coefficients for 1-D cubic (3rd order) B-spline.
- qspline1d -- Coefficients for 1-D quadratic (2nd order) B-spline.
- cspline2d -- Coefficients for 2-D cubic (3rd order) B-spline.
- qspline2d -- Coefficients for 2-D quadratic (2nd order) B-spline.
- cspline1d_eval -- Evaluate a cubic spline at the given points.
- qspline1d_eval -- Evaluate a quadratic spline at the given points.
- spline_filter -- Smoothing spline (cubic) filtering of a rank-2 array.
- Filtering
- order_filter -- N-dimensional order filter.
- medfilt -- N-dimensional median filter.
- medfilt2d -- 2-dimensional median filter (faster).
- wiener -- N-dimensional wiener filter.
- symiirorder1 -- 2nd-order IIR filter (cascade of first-order systems).
- symiirorder2 -- 4th-order IIR filter (cascade of second-order systems).
- lfilter -- 1-dimensional FIR and IIR digital linear filtering.
- lfiltic -- Construct initial conditions for `lfilter`.
- lfilter_zi -- Compute an initial state zi for the lfilter
function that corresponds to the steady state of the step
response.
- filtfilt -- A forward-backward filter.
- savgol_filter -- Filter a signal using the Savitzky-Golay filter.
- deconvolve -- 1-d deconvolution using lfilter.
- sosfilt -- 1-dimensional IIR digital linear filtering
using a second-order sections filter representation.
- sosfilt_zi -- Compute an initial state zi for the sosfilt
function that corresponds to the steady state of the step
response.
- sosfiltfilt -- A forward-backward filter for second-order sections.
- hilbert -- Compute 1-D analytic signal, using the Hilbert transform.
- hilbert2 -- Compute 2-D analytic signal, using the Hilbert transform.
- decimate -- Downsample a signal.
- detrend -- Remove linear and/or constant trends from data.
- resample -- Resample using Fourier method.
- resample_poly -- Resample using polyphase filtering method.
- upfirdn -- Upsample, apply FIR filter, downsample.
- Filter design
- bilinear -- Digital filter from an analog filter using the
bilinear transform.
- findfreqs -- Find array of frequencies for computing filter response.
- firls -- FIR filter design using least-squares error minimization.
- firwin -- Windowed FIR filter design, with frequency
response defined as pass and stop bands.
- firwin2 -- Windowed FIR filter design, with arbitrary
frequency response.
- freqs -- Analog filter frequency response from TF coefficients.
- freqs_zpk -- Analog filter frequency response from ZPK coefficients.
- freqz -- Digital filter frequency response from TF coefficients.
- freqz_zpk -- Digital filter frequency response from ZPK coefficients.
- sosfreqz -- Digital filter frequency response for SOS format filter.
- group_delay -- Digital filter group delay.
- iirdesign -- IIR filter design given bands and gains.
- iirfilter -- IIR filter design given order and critical frequencies.
- kaiser_atten -- Compute the attenuation of a Kaiser FIR filter,
given the number of taps and the transition width at
discontinuities in the frequency response.
- kaiser_beta -- Compute the Kaiser parameter beta, given the
desired FIR filter attenuation.
- kaiserord -- Design a Kaiser window to limit ripple and
width of transition region.
- minimum_phase -- Convert a linear phase FIR filter to minimum phase.
- savgol_coeffs -- Compute the FIR filter coefficients for a
Savitzky-Golay filter.
- remez -- Optimal FIR filter design.
- unique_roots -- Unique roots and their multiplicities.
- residue -- Partial fraction expansion of b(s) / a(s).
- residuez -- Partial fraction expansion of b(z) / a(z).
- invres -- Inverse partial fraction expansion for analog filter.
- invresz -- Inverse partial fraction expansion for digital filter.
- BadCoefficients -- Warning on badly conditioned filter coefficients
- Lower-level filter design functions:
- abcd_normalize -- Check state-space matrices and ensure they are rank-2.
- band_stop_obj -- Band Stop Objective Function for order minimization.
- besselap -- Return (z,p,k) for analog prototype of Bessel filter.
- buttap -- Return (z,p,k) for analog prototype of Butterworth filter.
- cheb1ap -- Return (z,p,k) for type I Chebyshev filter.
- cheb2ap -- Return (z,p,k) for type II Chebyshev filter.
- cmplx_sort -- Sort roots based on magnitude.
- ellipap -- Return (z,p,k) for analog prototype of elliptic filter.
- lp2bp -- Transform a lowpass filter prototype to a bandpass filter.
- lp2bs -- Transform a lowpass filter prototype to a bandstop filter.
- lp2hp -- Transform a lowpass filter prototype to a highpass filter.
- lp2lp -- Transform a lowpass filter prototype to a lowpass filter.
- normalize -- Normalize polynomial representation of a transfer function.
- Matlab-style IIR filter design
- butter -- Butterworth
- buttord
- cheby1 -- Chebyshev Type I
- cheb1ord
- cheby2 -- Chebyshev Type II
- cheb2ord
- ellip -- Elliptic (Cauer)
- ellipord
- bessel -- Bessel (no order selection available -- try butterod)
- iirnotch -- Design second-order IIR notch digital filter.
- iirpeak -- Design second-order IIR peak (resonant) digital filter.
- Continuous-Time Linear Systems
- lti -- Continuous-time linear time invariant system base class.
- StateSpace -- Linear time invariant system in state space form.
- TransferFunction -- Linear time invariant system in transfer function form.
- ZerosPolesGain -- Linear time invariant system in zeros, poles, gain form.
- lsim -- continuous-time simulation of output to linear system.
- lsim2 -- like lsim, but `scipy.integrate.odeint` is used.
- impulse -- impulse response of linear, time-invariant (LTI) system.
- impulse2 -- like impulse, but `scipy.integrate.odeint` is used.
- step -- step response of continous-time LTI system.
- step2 -- like step, but `scipy.integrate.odeint` is used.
- freqresp -- frequency response of a continuous-time LTI system.
- bode -- Bode magnitude and phase data (continuous-time LTI).
- Discrete-Time Linear Systems
- dlti -- Discrete-time linear time invariant system base class.
- StateSpace -- Linear time invariant system in state space form.
- TransferFunction -- Linear time invariant system in transfer function form.
- ZerosPolesGain -- Linear time invariant system in zeros, poles, gain form.
- dlsim -- simulation of output to a discrete-time linear system.
- dimpulse -- impulse response of a discrete-time LTI system.
- dstep -- step response of a discrete-time LTI system.
- dfreqresp -- frequency response of a discrete-time LTI system.
- dbode -- Bode magnitude and phase data (discrete-time LTI).
- LTI Representations
- tf2zpk -- transfer function to zero-pole-gain.
- tf2sos -- transfer function to second-order sections.
- tf2ss -- transfer function to state-space.
- zpk2tf -- zero-pole-gain to transfer function.
- zpk2sos -- zero-pole-gain to second-order sections.
- zpk2ss -- zero-pole-gain to state-space.
- ss2tf -- state-pace to transfer function.
- ss2zpk -- state-space to pole-zero-gain.
- sos2zpk -- second-order sections to zero-pole-gain.
- sos2tf -- second-order sections to transfer function.
- cont2discrete -- continuous-time to discrete-time LTI conversion.
- place_poles -- pole placement.
- Waveforms
- chirp -- Frequency swept cosine signal, with several freq functions.
- gausspulse -- Gaussian modulated sinusoid
- max_len_seq -- Maximum length sequence
- sawtooth -- Periodic sawtooth
- square -- Square wave
- sweep_poly -- Frequency swept cosine signal; freq is arbitrary polynomial
- unit_impulse -- Discrete unit impulse
- Window functions
- get_window -- Return a window of a given length and type.
- barthann -- Bartlett-Hann window
- bartlett -- Bartlett window
- blackman -- Blackman window
- blackmanharris -- Minimum 4-term Blackman-Harris window
- bohman -- Bohman window
- boxcar -- Boxcar window
- chebwin -- Dolph-Chebyshev window
- cosine -- Cosine window
- exponential -- Exponential window
- flattop -- Flat top window
- gaussian -- Gaussian window
- general_gaussian -- Generalized Gaussian window
- hamming -- Hamming window
- hann -- Hann window
- hanning -- Hann window
- kaiser -- Kaiser window
- nuttall -- Nuttall's minimum 4-term Blackman-Harris window
- parzen -- Parzen window
- slepian -- Slepian window
- triang -- Triangular window
- tukey -- Tukey window
- Wavelets
- cascade -- compute scaling function and wavelet from coefficients
- daub -- return low-pass
- morlet -- Complex Morlet wavelet.
- qmf -- return quadrature mirror filter from low-pass
- ricker -- return ricker wavelet
- cwt -- perform continuous wavelet transform
- Peak finding
- find_peaks_cwt -- Attempt to find the peaks in the given 1-D array
- argrelmin -- Calculate the relative minima of data
- argrelmax -- Calculate the relative maxima of data
- argrelextrema -- Calculate the relative extrema of data
- Spectral Analysis
- periodogram -- Compute a (modified) periodogram
- welch -- Compute a periodogram using Welch's method
- csd -- Compute the cross spectral density, using Welch's method
- coherence -- Compute the magnitude squared coherence, using Welch's method
- spectrogram -- Compute the spectrogram
- lombscargle -- Computes the Lomb-Scargle periodogram
- vectorstrength -- Computes the vector strength
- stft -- Compute the Short Time Fourier Transform
- istft -- Compute the Inverse Short Time Fourier Transform
- check_COLA -- Check the COLA constraint for iSTFT reconstruction
Linear Algebra (scipy.linalg)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Use this module with either of the following::
>>> import scipy.linalg
>>> from scipy import linalg
Here is some summary:
- Basics
- inv -- Find the inverse of a square matrix
- solve -- Solve a linear system of equations
- solve_banded -- Solve a banded linear system
- solveh_banded -- Solve a Hermitian or symmetric banded system
- solve_circulant -- Solve a circulant system
- solve_triangular -- Solve a triangular matrix
- solve_toeplitz -- Solve a toeplitz matrix
- det -- Find the determinant of a square matrix
- norm -- Matrix and vector norm
- lstsq -- Solve a linear least-squares problem
- pinv -- Pseudo-inverse (Moore-Penrose) using lstsq
- pinv2 -- Pseudo-inverse using svd
- pinvh -- Pseudo-inverse of hermitian matrix
- kron -- Kronecker product of two arrays
- tril -- Construct a lower-triangular matrix from a given matrix
- triu -- Construct an upper-triangular matrix from a given matrix
orthogonal_procrustes -- Solve an orthogonal Procrustes problem
matrix_balance -- Balance matrix entries with a similarity transformation
subspace_angles -- Compute the subspace angles between two matrices
- LinAlgError -- Generic Python-exception-derived object raised by linalg functions.
- Eigenvalue Problems
- eig -- Find the eigenvalues and eigenvectors of a square matrix
- eigvals -- Find just the eigenvalues of a square matrix
- eigh -- Find the e-vals and e-vectors of a Hermitian or symmetric matrix
- eigvalsh -- Find just the eigenvalues of a Hermitian or symmetric matrix
- eig_banded -- Find the eigenvalues and eigenvectors of a banded matrix
- eigvals_banded -- Find just the eigenvalues of a banded matrix
- eigh_tridiagonal -- Find the eigenvalues and eigenvectors of a tridiagonal matrix
- eigvalsh_tridiagonal -- Find just the eigenvalues of a tridiagonal matrix
- Decompositions
- lu -- LU decomposition of a matrix
- lu_factor -- LU decomposition returning unordered matrix and pivots
- lu_solve -- Solve Ax=b using back substitution with output of lu_factor
- svd -- Singular value decomposition of a matrix
- svdvals -- Singular values of a matrix
- diagsvd -- Construct matrix of singular values from output of svd
- orth -- Construct orthonormal basis for the range of A using svd
- cholesky -- Cholesky decomposition of a matrix
- cholesky_banded -- Cholesky decomp. of a sym. or Hermitian banded matrix
- cho_factor -- Cholesky decomposition for use in solving a linear system
- cho_solve -- Solve previously factored linear system
- cho_solve_banded -- Solve previously factored banded linear system
- polar -- Compute the polar decomposition.
- qr -- QR decomposition of a matrix
- qr_multiply -- QR decomposition and multiplication by Q
- qr_update -- Rank k QR update
- qr_delete -- QR downdate on row or column deletion
- qr_insert -- QR update on row or column insertion
- rq -- RQ decomposition of a matrix
- qz -- QZ decomposition of a pair of matrices
- ordqz -- QZ decomposition of a pair of matrices with reordering
- schur -- Schur decomposition of a matrix
- rsf2csf -- Real to complex Schur form
- hessenberg -- Hessenberg form of a matrix
See also: scipy.linalg.interpolative -- Interpolative matrix decompositions
- Matrix Functions
- expm -- Matrix exponential
- logm -- Matrix logarithm
- cosm -- Matrix cosine
- sinm -- Matrix sine
- tanm -- Matrix tangent
- coshm -- Matrix hyperbolic cosine
- sinhm -- Matrix hyperbolic sine
- tanhm -- Matrix hyperbolic tangent
- signm -- Matrix sign
- sqrtm -- Matrix square root
- funm -- Evaluating an arbitrary matrix function
- expm_frechet -- Frechet derivative of the matrix exponential
- expm_cond -- Relative condition number of expm in the Frobenius norm
- fractional_matrix_power -- Fractional matrix power
- Matrix Equation Solvers
- solve_sylvester -- Solve the Sylvester matrix equation
- solve_continuous_are -- Solve the continuous-time algebraic Riccati equation
- solve_discrete_are -- Solve the discrete-time algebraic Riccati equation
- solve_continuous_lyapunov -- Solve the continous-time Lyapunov equation
- solve_discrete_lyapunov -- Solve the discrete-time Lyapunov equation
- Sketches and Random Projections
- clarkson_woodruff_transform -- Applies the Clarkson Woodruff Sketch (a.k.a CountMin Sketch)
- Special Matrices
- block_diag -- Construct a block diagonal matrix from submatrices
- circulant -- Circulant matrix
- companion -- Companion matrix
- dft -- Discrete Fourier transform matrix
- hadamard -- Hadamard matrix of order 2**n
- hankel -- Hankel matrix
- helmert -- Helmert matrix
- hilbert -- Hilbert matrix
- invhilbert -- Inverse Hilbert matrix
- leslie -- Leslie matrix
- pascal -- Pascal matrix
- invpascal -- Inverse Pascal matrix
- toeplitz -- Toeplitz matrix
- tri -- Construct a matrix filled with ones at and below a given diagonal
- Low-level routines
- get_blas_funcs
- get_lapack_funcs
- find_best_blas_type
- See also:
- scipy.linalg.blas -- Low-level BLAS functions
- scipy.linalg.lapack -- Low-level LAPACK functions
- scipy.linalg.cython_blas -- Low-level BLAS functions for Cython
- scipy.linalg.cython_lapack -- Low-level LAPACK functions for Cython
Sparse Eigenvalue Problems with ARPACK
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There are examples in the Scipy documentation, here:
https://docs.scipy.org/doc/scipy/reference/tutorial/arpack.html
And, here is a summary copied from that document:
"ARPACK is a Fortran package which provides routines for quickly
finding a few eigenvalues/eigenvectors of large sparse matrices.
In order to find these solutions, it requires only
left-multiplication by the matrix in question. This operation is
performed through a reverse-communication interface. The result
of this structure is that ARPACK is able to find eigenvalues and
eigenvectors of any linear function mapping a vector to a
vector.
"All of the functionality provided in ARPACK is contained within
the two high-level interfaces scipy.sparse.linalg.eigs and
scipy.sparse.linalg.eigsh. eigs provides interfaces to find the
eigenvalues/vectors of real or complex nonsymmetric square
matrices, while eigsh provides interfaces for real-symmetric or
complex-hermitian matrices."
Compressed Sparse Graph Routines (scipy.sparse.csgraph)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There is an example that implements a search for the shortest path
between two words (of equal) length in a word ladder (i.e. changing
just one letter in each step) in the Scipy documentation. You can
find it here:
https://docs.scipy.org/doc/scipy/reference/tutorial/csgraph.html.
You can get documentation with the following::
$ pydoc scipy.sparse.csgraph
And, in IPython, do something like this::
In [41]: from scipy.sparse import csgraph
In [42]: csgraph.connected_components?
Here is a summary of the contents:
- connected_components -- determine connected components of a graph.
- laplacian -- compute the laplacian of a graph.
- shortest_path -- compute the shortest path between points on a positive graph.
- dijkstra -- use Dijkstra's algorithm for shortest path.
- floyd_warshall -- use the Floyd-Warshall algorithm for shortest path.
- bellman_ford -- use the Bellman-Ford algorithm for shortest path.
- johnson -- use Johnson's algorithm for shortest path.
- breadth_first_order -- compute a breadth-first order of nodes.
- depth_first_order -- compute a depth-first order of nodes.
- breadth_first_tree -- construct the breadth-first tree from a given node.
- depth_first_tree -- construct a depth-first tree from a given node.
- minimum_spanning_tree -- construct the minimum spanning tree of a graph.
- reverse_cuthill_mckee -- compute permutation for reverse Cuthill-McKee ordering.
- maximum_bipartite_matching -- compute permutation to make diagonal zero free.
- structural_rank -- compute the structural rank of a graph.
- construct_dist_matrix -- Construct distance matrix from a predecessor matrix.
- csgraph_from_dense -- Construct a CSR-format sparse graph from a dense matrix.
- csgraph_from_masked -- Construct a CSR-format graph from a masked array.
- csgraph_masked_from_dense -- Construct a CSR-format sparse graph from a dense matrix.
- csgraph_to_dense -- Convert a sparse graph representation to a dense representation.
- csgraph_to_masked -- Convert a sparse graph representation to a masked array representation.
- reconstruct_path -- Construct a tree from a graph and a predecessor list.
- NegativeCycleError -- Common base class for all non-exit exceptions
Note that there are other sparse graph libraries for Python. One is
Another Python Graph Library: https://pythonhosted.org/apgl/index.html.
Spatial data structures and algorithms (scipy.spatial)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Provides spatial algorithms and data structures.
Here is an example, copied from the documentation::
import numpy as np
from scipy.spatial import Delaunay
import matplotlib.pyplot as plt
def test():
points = np.array([[0, 0], [0, 1.1], [1, 0], [1, 1]])
tri = Delaunay(points)
#
# We can visualize it:
plt.triplot(points[:, 0], points[:, 1], tri.simplices.copy())
plt.plot(points[:, 0], points[:, 1], 'o')
#
# And add some further decorations:
for j, p in enumerate(points):
# label the points
plt.text(p[0] - 0.03, p[1] + 0.03, j, ha='right')
for j, s in enumerate(tri.simplices):
p = points[s].mean(axis=0)
# label triangles
plt.text(p[0], p[1], '#%d' % j, ha='center')
plt.xlim(-0.5, 1.5)
plt.ylim(-0.5, 1.5)
plt.show()
#
# The structure of the triangulation is encoded in the following way: the
# simplices attribute contains the indices of the points in the
# points array
# that make up the triangle. For instance:
i = 1
tri.simplices[i, :]
points[tri.simplices[i, :]]
return tri, points
Here is a summary of the contents of ``scipy.spatial`` (obtained by
doing ``$ pydoc scipy.spatial``):
- Nearest-neighbor Queries:
- KDTree -- class for efficient nearest-neighbor queries
- cKDTree -- class for efficient nearest-neighbor queries (faster impl.)
- distance -- module containing many different distance measures
- Rectangle -- Hyperrectangle class. Represents a Cartesian product of intervals.
- Delaunay Triangulation, Convex Hulls, and Voronoi Diagrams:
- Delaunay -- compute Delaunay triangulation of input points
- ConvexHull -- compute a convex hull for input points
- Voronoi -- compute a Voronoi diagram hull from input points
- SphericalVoronoi -- compute a Voronoi diagram from input points on the surface of a sphere
- HalfspaceIntersection -- compute the intersection points of input halfspaces
- Plotting Helpers:
- delaunay_plot_2d -- plot 2-D triangulation
- convex_hull_plot_2d -- plot 2-D convex hull
- voronoi_plot_2d -- plot 2-D voronoi diagram
- Simplex representation:
The simplices (triangles, tetrahedra, ...) appearing in the Delaunay
tesselation (N-dim simplices), convex hull facets, and Voronoi ridges
(N-1 dim simplices) are represented in the following scheme::
tess = Delaunay(points)
hull = ConvexHull(points)
voro = Voronoi(points)
# coordinates of the j-th vertex of the i-th simplex
tess.points[tess.simplices[i, j], :] # tesselation element
hull.points[hull.simplices[i, j], :] # convex hull facet
voro.vertices[voro.ridge_vertices[i, j], :] # ridge between Voronoi cells
For Delaunay triangulations and convex hulls, the neighborhood
structure of the simplices satisfies the condition:
``tess.neighbors[i,j]`` is the neighboring simplex of the i-th
simplex, opposite to the j-vertex. It is -1 in case of no
neighbor.
Convex hull facets also define a hyperplane equation::
(hull.equations[i,:-1] * coord).sum() + hull.equations[i,-1] == 0
Similar hyperplane equations for the Delaunay triangulation correspond
to the convex hull facets on the corresponding N+1 dimensional
paraboloid.
The Delaunay triangulation objects offer a method for locating the
simplex containing a given point, and barycentric coordinate
computations.
- Functions:
- tsearch
- distance_matrix
- minkowski_distance
- minkowski_distance_p
- procrustes
Statistics (scipy.stats)
~~~~~~~~~~~~~~~~~~~~~~~~~~
This module contains a large number of probability distributions as well as a
growing library of statistical functions.
Each univariate distribution is an instance of a subclass of ``rv_continuous``
(``rv_discrete`` for discrete distributions):
- rv_continuous
- rv_discrete
- rv_histogram
Here is a summary of the items in ``scipy.stats``:
- Continuous distributions
- alpha -- Alpha
- anglit -- Anglit
- arcsine -- Arcsine
- argus -- Argus
- beta -- Beta
- betaprime -- Beta Prime
- bradford -- Bradford
- burr -- Burr (Type III)
- burr12 -- Burr (Type XII)
- cauchy -- Cauchy
- chi -- Chi
- chi2 -- Chi-squared
- cosine -- Cosine
- crystalball -- Crystalball
- dgamma -- Double Gamma
- dweibull -- Double Weibull
- erlang -- Erlang
- expon -- Exponential
- exponnorm -- Exponentially Modified Normal
- exponweib -- Exponentiated Weibull
- exponpow -- Exponential Power
- f -- F (Snecdor F)
- fatiguelife -- Fatigue Life (Birnbaum-Saunders)
- fisk -- Fisk
- foldcauchy -- Folded Cauchy
- foldnorm -- Folded Normal
- frechet_r -- Deprecated. Alias for weibull_min
- frechet_l -- Deprecated. Alias for weibull_max
- genlogistic -- Generalized Logistic
- gennorm -- Generalized normal
- genpareto -- Generalized Pareto
- genexpon -- Generalized Exponential
- genextreme -- Generalized Extreme Value
- gausshyper -- Gauss Hypergeometric
- gamma -- Gamma
- gengamma -- Generalized gamma
- genhalflogistic -- Generalized Half Logistic
- gilbrat -- Gilbrat
- gompertz -- Gompertz (Truncated Gumbel)
- gumbel_r -- Right Sided Gumbel, Log-Weibull, Fisher-Tippett, Extreme Value Type I
- gumbel_l -- Left Sided Gumbel, etc.
- halfcauchy -- Half Cauchy
- halflogistic -- Half Logistic
- halfnorm -- Half Normal
- halfgennorm -- Generalized Half Normal
- hypsecant -- Hyperbolic Secant
- invgamma -- Inverse Gamma
- invgauss -- Inverse Gaussian
- invweibull -- Inverse Weibull
- johnsonsb -- Johnson SB
- johnsonsu -- Johnson SU
- kappa4 -- Kappa 4 parameter
- kappa3 -- Kappa 3 parameter
- ksone -- Kolmogorov-Smirnov one-sided (no stats)
- kstwobign -- Kolmogorov-Smirnov two-sided test for Large N (no stats)
- laplace -- Laplace
- levy -- Levy
- levy_l
- levy_stable
- logistic -- Logistic
- loggamma -- Log-Gamma
- loglaplace -- Log-Laplace (Log Double Exponential)
- lognorm -- Log-Normal
- lomax -- Lomax (Pareto of the second kind)
- maxwell -- Maxwell
- mielke -- Mielke's Beta-Kappa
- nakagami -- Nakagami
- ncx2 -- Non-central chi-squared
- ncf -- Non-central F
- nct -- Non-central Student's T
- norm -- Normal (Gaussian)
- pareto -- Pareto
- pearson3 -- Pearson type III
- powerlaw -- Power-function
- powerlognorm -- Power log normal
- powernorm -- Power normal
- rdist -- R-distribution
- reciprocal -- Reciprocal
- rayleigh -- Rayleigh
- rice -- Rice
- recipinvgauss -- Reciprocal Inverse Gaussian
- semicircular -- Semicircular
- skewnorm -- Skew normal
- t -- Student's T
- trapz -- Trapezoidal
- triang -- Triangular
- truncexpon -- Truncated Exponential
- truncnorm -- Truncated Normal
- tukeylambda -- Tukey-Lambda
- uniform -- Uniform
- vonmises -- Von-Mises (Circular)
- vonmises_line -- Von-Mises (Line)
- wald -- Wald
- weibull_min -- Minimum Weibull (see Frechet)
- weibull_max -- Maximum Weibull (see Frechet)
- wrapcauchy -- Wrapped Cauchy
- Multivariate distributions
- multivariate_normal -- Multivariate normal distribution
- matrix_normal -- Matrix normal distribution
- dirichlet -- Dirichlet
- wishart -- Wishart
- invwishart -- Inverse Wishart
- multinomial -- Multinomial distribution
- special_ortho_group -- SO(N) group
- ortho_group -- O(N) group
- unitary_group -- U(N) gropu
- random_correlation -- random correlation matrices
- Discrete distributions
- bernoulli -- Bernoulli
- binom -- Binomial
- boltzmann -- Boltzmann (Truncated Discrete Exponential)
- dlaplace -- Discrete Laplacian
- geom -- Geometric
- hypergeom -- Hypergeometric
- logser -- Logarithmic (Log-Series, Series)
- nbinom -- Negative Binomial
- planck -- Planck (Discrete Exponential)
- poisson -- Poisson
- randint -- Discrete Uniform
- skellam -- Skellam
- zipf -- Zipf
- Statistical functions -- Several of these functions have a similar
version in scipy.stats.mstats which work for masked arrays.
- describe -- Descriptive statistics
- gmean -- Geometric mean
- hmean -- Harmonic mean
- kurtosis -- Fisher or Pearson kurtosis
- kurtosistest -- Test whether a dataset has normal kurtosis.
- mode -- Modal value
- moment -- Central moment
- normaltest --
- skew -- Skewness
- skewtest --
- kstat --
- kstatvar --
- tmean -- Truncated arithmetic mean
- tvar -- Truncated variance
- tmin --
- tmax --
- tstd --
- tsem --
- variation -- Coefficient of variation
- find_repeats
- trim_mean
- cumfreq
- itemfreq
- percentileofscore
- scoreatpercentile
- relfreq
- binned_statistic -- Compute a binned statistic for a set of data.
- binned_statistic_2d -- Compute a 2-D binned statistic for a set of data.
- binned_statistic_dd -- Compute a d-D binned statistic for a set of data.
- obrientransform
- bayes_mvs
- mvsdist
- sem
- zmap
- zscore
- iqr
- sigmaclip
- trimboth
- trim1
- f_oneway
- pearsonr
- spearmanr
- pointbiserialr
- kendalltau
- weightedtau
- linregress
- theilslopes
- ttest_1samp
- ttest_ind
- ttest_ind_from_stats
- ttest_rel
- kstest
- chisquare
- power_divergence
- ks_2samp
- mannwhitneyu
- tiecorrect
- rankdata
- ranksums
- wilcoxon
- kruskal
- friedmanchisquare
- combine_pvalues
- jarque_bera
- ansari
- bartlett
- levene
- shapiro
- anderson
- anderson_ksamp
- binom_test
- fligner
- median_test
- mood
- boxcox
- boxcox_normmax
- boxcox_llf
- entropy
- wasserstein_distance
- energy_distance
- Circular statistical functions
- circmean
- circvar
- circstd
- Contingency table functions
- chi2_contingency
- contingency expected_freq
- contingency margins
- fisher_exact
- Plot-tests
- ppcc_max
- ppcc_plot
- probplot
- boxcox_normplot
- Masked statistics functions -- Module ``scipy.stats.mstats``
contains statistical functions for masked arrays.
For more information in IPython, do::
In [1]: from scipy.stats import mstats
In [2]: mstats?
Or, from the command line do ``$ pydoc scipy.stats.mstats``.
- Univariate and multivariate kernel density estimation (``scipy.stats.kde``)
- gaussian_kde -- Representation of a kernel-density estimate using Gaussian
kernels.
Kernel density estimation is a way to estimate the probability density
function (PDF) of a random variable in a non-parametric way.
`gaussian_kde` works for both uni-variate and multi-variate data. It
includes automatic bandwidth determination. The estimation works best for
a unimodal distribution; bimodal or multi-modal distributions tend to be
oversmoothed.
For many more stat related functions install the software R and the
interface package `rpy``.
Multidimensional image processing (``scipy.ndimage``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The module ``scipy.ndimage`` contains various functions for
multi-dimensional image processing.
For information on these functions, do (for example, in IPython)::
In [6]: from scipy import ndimage
In [7]: ndimage?
In [8]: ndimage.convolve?
Or, from the command line, do: ``$ pydoc scipy.ndimage.convolve``.
Here is an example -- It computes the multi-dimensional convolution
of an Numpy ``ndarray``::
import numpy as np
from scipy import ndimage
def test():
a = np.array([[1, 2, 0, 0],
[5, 3, 0, 4],
[0, 0, 0, 7],
[9, 3, 0, 0]])
k = np.array([[1, 1, 1], [1, 1, 0], [1, 0, 0]])
result = ndimage.convolve(a, k, mode='constant', cval=0.0)
return result
Here is a summary of the contents of ``scipy.ndimage``:
- Filters
- convolve -- Multi-dimensional convolution
- convolve1d -- 1-D convolution along the given axis
- correlate -- Multi-dimensional correlation
- correlate1d -- 1-D correlation along the given axis
- gaussian_filter -
- gaussian_filter1d -
- gaussian_gradient_magnitude -
- gaussian_laplace -
- generic_filter -- Multi-dimensional filter using a given function
- generic_filter1d -- 1-D generic filter along the given axis
- generic_gradient_magnitude
- generic_laplace
- laplace -- n-D Laplace filter based on approximate second derivatives
- maximum_filter
- maximum_filter1d
- median_filter -- Calculates a multi-dimensional median filter
- minimum_filter
- minimum_filter1d
- percentile_filter -- Calculates a multi-dimensional percentile filter
- prewitt
- rank_filter -- Calculates a multi-dimensional rank filter
- sobel
- uniform_filter -- Multi-dimensional uniform filter
- uniform_filter1d -- 1-D uniform filter along the given axis
- Fourier filters
- fourier_ellipsoid
- fourier_gaussian
- fourier_shift
- fourier_uniform
- Interpolation
- affine_transform -- Apply an affine transformation
- geometric_transform -- Apply an arbritrary geometric transform
- map_coordinates -- Map input array to new coordinates by interpolation
- rotate -- Rotate an array
- shift -- Shift an array
- spline_filter
- spline_filter1d
- zoom -- Zoom an array
- Measurements
- center_of_mass -- The center of mass of the values of an array at labels
- extrema -- Min's and max's of an array at labels, with their positions
- find_objects -- Find objects in a labeled array
- histogram -- Histogram of the values of an array, optionally at labels
- label -- Label features in an array
- labeled_comprehension
- maximum
- maximum_position
- mean -- Mean of the values of an array at labels
- median
- minimum
- minimum_position
- standard_deviation -- Standard deviation of an n-D image array
- sum -- Sum of the values of the array
- variance -- Variance of the values of an n-D image array
- watershed_ift
- Morphology
- binary_closing
- binary_dilation
- binary_erosion
- binary_fill_holes
- binary_hit_or_miss
- binary_opening
- binary_propagation
- black_tophat
- distance_transform_bf
- distance_transform_cdt
- distance_transform_edt
- generate_binary_structure
- grey_closing
- grey_dilation
- grey_erosion
- grey_opening
- iterate_structure
- morphological_gradient
- morphological_laplace
- white_tophat
- Utility
- imread -- Load an image from a file
File IO (scipy.io)
~~~~~~~~~~~~~~~~~~~~
Scipy provides routines to read/write a number of special file
formats. Here are some of them:
- MATLAB® files:
- loadmat -- Read a MATLAB style mat file (version 4 through 7.1)
- savemat -- Write a MATLAB style mat file (version 4 through 7.1)
- whosmat -- List contents of a MATLAB style mat file (version 4 through 7.1)
- IDL® files:
- readsav -- Read an IDL 'save' file
- Matrix Market files:
- mminfo -- Query matrix info from Matrix Market formatted file
- mmread -- Read matrix from Matrix Market formatted file
- mmwrite -- Write matrix to Matrix Market formatted file
- Unformatted Fortran files:
- FortranFile -- A file object for unformatted sequential Fortran files
- Netcdf:
- netcdf_file -- A file object for NetCDF data
- netcdf_variable -- A data object for the netcdf module
- Harwell-Boeing files:
- hb_read -- read H-B file
- hb_write -- write H-B file
- Wav sound files (`scipy.io.wavfile`):
- read -- Return the sample rate (in samples/sec) and data from a WAV file.
- write -- Write a numpy array as a WAV file.
- WavFileWarning -- Base class for warnings generated by user code.
- Arff files (`scipy.io.arff`):
- loadarff -- Read an arff file.
- MetaData -- Small container to keep useful information on a ARFF dataset.
- ArffError -- Base class for I/O related errors.
- ParseArffError -- Base class for I/O related errors.
Pandas
--------
Pandas vs. Numpy -- Pandas raises Numpy data structures to a higher
level. In particular, see the ``DataFrame`` object.
For documentation on Pandas, see:
http://pandas.pydata.org/pandas-docs/stable/. There are tutorials,
get-started guides, cookbook docs, and more.
`10 Minutes to pandas
`_
seems especially helpful, although it does contain an lot more than
10 minutes worth of material. It gives basic instructions on how to
use Pandas data types.
And, be sure to look at the various `Pandas tutorials
`_.
There are also cookbooks full of code snippets:
- http://pandas.pydata.org/pandas-docs/stable/cookbook.html
- http://pandas.pydata.org/pandas-docs/stable/tutorials.html#pandas-cookbook
Perhaps it's advisable to view Pandas as just as much about learning
techniques for (1) cleaning up your data; (2) exploring and finding
significant aspects of your data, and (3) viewing and displaying
your data, as it is about performing calculations and analysis on
it. Panda contains and provides such a rich set of techniques for
working with your data that you should expect to take a reasonable
amount of time learning to do the tasks you need, rather than just
quickly learn some small set of functions.
Create Pandas data structures
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Here is an example that creates several of the Pandas data
structures that are used in the "10 Minutes to pandas" document
referenced above::
def make_sample_dataframe():
"""Make sample dates and DataFrame. Returns (dates, df)."""
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
return dates, df
And, here is an example of the use of the above function::
In [117]: import utils01
In [118]: dates, df = utils01.make_sample_dataframe()
In [119]:
In [119]: dates
Out[119]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [120]:
In [120]: df
Out[120]:
A B C D
2013-01-01 0.521515 1.006002 -1.408913 -0.218981
2013-01-02 -0.517541 -0.190499 0.397701 0.895858
2013-01-03 0.068253 0.499286 -1.098401 -1.323183
2013-01-04 -0.086779 0.025269 0.459892 0.588754
2013-01-05 1.384825 -1.141312 0.097294 0.169665
2013-01-06 -0.391738 -0.072600 0.196514 0.799174
View Pandas data structures
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
View the first and last rows of a ``DataFrame``::
In [34]: df.head(n=2)
Out[34]:
A B C D
2013-01-01 -0.557541 1.016474 0.933149 -0.524661
2013-01-02 1.682318 -1.605635 -0.324727 2.057636
In [35]:
In [35]: df.tail(n=3)
Out[35]:
A B C D
2013-01-04 0.696414 0.538999 1.131596 -0.960681
2013-01-05 -0.175765 -0.494210 1.111779 -0.670209
2013-01-06 -1.615098 0.018027 0.584815 -1.508152
Get the shape, column (labels), and actual data from a ``DataFrame``::
In [38]: df.shape
Out[38]: (6, 4)
In [39]: df.columns
Out[39]: Index(['A', 'B', 'C', 'D'], dtype='object')
In [40]: df.values
Out[40]:
array([[-0.55754086, 1.01647419, 0.93314867, -0.52466119],
[ 1.68231758, -1.60563477, -0.32472655, 2.05763649],
[-0.45481149, -0.09087637, -1.1383327 , -0.7950994 ],
[ 0.69641379, 0.53899898, 1.13159619, -0.96068123],
[-0.17576451, -0.49421043, 1.11177912, -0.67020918],
[-1.61509837, 0.01802738, 0.58481469, -1.50815216]])
In [41]: type(df.values)
Out[41]: numpy.ndarray
Note that ``df.values`` returns an ``ndarray``.
Access the contents of a ``DataFrame``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Access a row or range of rows -- Use ``.iloc`` with a single index
or a slice. Examples::
In [72]: df.iloc[1]
Out[72]:
A 0.721339
B 0.733763
C -1.153457
D -1.345582
Name: 2013-01-02 00:00:00, dtype: float64
In [73]: df.iloc[1:2]
Out[73]:
A B C D
2013-01-02 0.721339 0.733763 -1.153457 -1.345582
In [74]: df.iloc[1:4]
Out[74]:
A B C D
2013-01-02 0.721339 0.733763 -1.153457 -1.345582
2013-01-03 2.047318 0.406103 -1.893892 0.065913
2013-01-04 0.737643 -1.539155 0.410927 0.038682
Access a row or range of rows -- Use ``.loc`` with index
labels. Examples::
In [64]: df.loc[dates[1]]
Out[64]:
A 0.721339
B 0.733763
C -1.153457
D -1.345582
Name: 2013-01-02 00:00:00, dtype: float64
In [65]: df.loc[dates[1]:dates[2]]
Out[65]:
A B C D
2013-01-02 0.721339 0.733763 -1.153457 -1.345582
2013-01-03 2.047318 0.406103 -1.893892 0.065913
In [66]: df.loc[dates[1]:dates[1]]
Out[66]:
A B C D
2013-01-02 0.721339 0.733763 -1.153457 -1.345582
In [67]: df.loc['2013-01-01']
Out[67]:
A 1.373992
B -0.080698
C -0.018425
D -0.424205
Name: 2013-01-01 00:00:00, dtype: float64
In [68]: df.loc['2013-01-01':'2013-01-03']
Out[68]:
A B C D
2013-01-01 1.373992 -0.080698 -0.018425 -0.424205
2013-01-02 0.721339 0.733763 -1.153457 -1.345582
2013-01-03 2.047318 0.406103 -1.893892 0.065913
Notes:
- ``dates`` was used to create the index for ``df``::
def make_sample_dataframe1():
"""Make sample dates and DataFrame. Returns (dates, df)."""
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(
np.random.randn(6, 4),
index=dates,
columns=list('ABCD'))
return dates, df
Access the rows where the content of a item (column) in that row
satisfies a condition or test::
In [10]: df.loc[df.B > 0].head()
Out[10]:
Unnamed: 0 A B C D
2 2013-01-03 0.986316 1.870495 -1.598345 -2.551315
5 2013-01-06 1.385534 1.328005 1.741578 -0.409209
7 2013-01-08 -0.820344 0.318531 0.278434 -0.898119
9 2013-01-10 -2.342766 0.048417 -0.352930 -0.134832
20 2013-01-21 -0.567319 1.784550 -0.114723 0.315661
Or::
In [9]: df.loc[df.B.apply(lambda x: x > 0)].head()
Out[9]:
Unnamed: 0 A B C D
2 2013-01-03 0.986316 1.870495 -1.598345 -2.551315
5 2013-01-06 1.385534 1.328005 1.741578 -0.409209
7 2013-01-08 -0.820344 0.318531 0.278434 -0.898119
9 2013-01-10 -2.342766 0.048417 -0.352930 -0.134832
20 2013-01-21 -0.567319 1.784550 -0.114723 0.315661
Notes:
- The use of ``.apply()`` along with ``lambda`` (or a named Python
function) enables us to select rows based on an arbitrarily
complex condition.
- Also, consider using ``functools.partial()``. The following
selects rows where the value in column B is in the range -0.1 to
0.1::
In [25]: import functools
In [26]: f = functools.partial(lambda x, y, z: z > x and z < y, -0.1, 0.1)
In [27]:
In [27]: df.loc[df.B.apply(f)].head()
Out[27]:
Unnamed: 0 A B C D
9 2013-01-10 -2.342766 0.048417 -0.352930 -0.134832
27 2013-01-28 -0.673330 0.075427 -0.477715 -0.475463
33 2013-02-03 -0.776301 0.015220 0.518606 -0.286090
38 2013-02-08 0.894722 0.005027 -0.763636 -0.150279
44 2013-02-14 -0.403519 -0.059570 0.929560 -1.065283
Access a column or several columns -- Use the Python indexing
operator (``[]``), with a column label or iterable of column labels.
Or, for a single column, use dot notation. Examples::
In [98]: df['B']
Out[98]:
2013-01-01 -0.080698
2013-01-02 0.733763
2013-01-03 0.406103
2013-01-04 -1.539155
2013-01-05 -0.963585
2013-01-06 0.934215
Freq: D, Name: B, dtype: float64
In [99]: df[['B', 'D']]
Out[99]:
B D
2013-01-01 -0.080698 -0.424205
2013-01-02 0.733763 -1.345582
2013-01-03 0.406103 0.065913
2013-01-04 -1.539155 0.038682
2013-01-05 -0.963585 -0.449162
2013-01-06 0.934215 1.473294
In [100]:
In [100]: df.C
Out[100]:
2013-01-01 -0.018425
2013-01-02 -1.153457
2013-01-03 -1.893892
2013-01-04 0.410927
2013-01-05 -1.627970
2013-01-06 0.240306
Freq: D, Name: C, dtype: float64
Access individual elements by index relative to zero -- Use
``.iloc[r, c]``::
In [42]: df.iloc[0]
Out[42]:
A 1.373992
B -0.080698
C -0.018425
D -0.424205
Name: 2013-01-01 00:00:00, dtype: float64
In [43]: df.iloc[0, 1]
Out[43]: -0.08069801201343964
In [44]: df.iloc[0, 1:3]
Out[44]:
B -0.080698
C -0.018425
Name: 2013-01-01 00:00:00, dtype: float64
In [45]: df.iloc[0:4, 1]
Out[45]:
2013-01-01 -0.080698
2013-01-02 0.733763
2013-01-03 0.406103
2013-01-04 -1.539155
Freq: D, Name: B, dtype: float64
In [46]: df.iloc[0:4, 1:-1]
Out[46]:
B C
2013-01-01 -0.080698 -0.018425
2013-01-02 0.733763 -1.153457
2013-01-03 0.406103 -1.893892
2013-01-04 -1.539155 0.410927
In [47]: df.iloc[0:4, 1:]
Out[47]:
B C D
2013-01-01 -0.080698 -0.018425 -0.424205
2013-01-02 0.733763 -1.153457 -1.345582
2013-01-03 0.406103 -1.893892 0.065913
2013-01-04 -1.539155 0.410927 0.038682
Iterate over a ``DataFrame``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There are several ways to do this. Here are some examples::
import utils01
def test():
dates, df = utils01.make_sample_dataframe1()
# iterate over column labels.
print("*\n* column labels --\n*")
print([x for x in df])
# iterate over items
print("*\n* items --\n*")
print([x for x in df.head(n=2).iteritems()])
# iterate over rows
print("*\n* rows --\n*")
print([x for x in df.head(n=2).iterrows()])
# iterate over rows as named tuples.
print("*\n* named tuples --\n*")
print([x for x in df.head(n=2).itertuples()])
# iterate over rows as named tuples returning one column from each tuple.
print("*\n* column \"B\" from named tuple --\n*")
print([x.B for x in df.head(n=2).itertuples()])
Here is the output from the above function::
In [67]: test()
*
* column labels --
*
['A', 'B', 'C', 'D']
*
* items --
*
[('A', 2013-01-01 -2.443710
2013-01-02 -1.003475
Freq: D, Name: A, dtype: float64), ('B', 2013-01-01 -0.320540
2013-01-02 -1.020769
Freq: D, Name: B, dtype: float64), ('C', 2013-01-01 0.010302
2013-01-02 0.115615
Freq: D, Name: C, dtype: float64), ('D', 2013-01-01 0.935831
2013-01-02 -0.514601
Freq: D, Name: D, dtype: float64)]
*
* rows --
*
[(Timestamp('2013-01-01 00:00:00', freq='D'), A -2.443710
B -0.320540
C 0.010302
D 0.935831
Name: 2013-01-01 00:00:00, dtype: float64), (Timestamp('2013-01-02 00:00:00', freq='D'), A -1.003475
B -1.020769
C 0.115615
D -0.514601
Name: 2013-01-02 00:00:00, dtype: float64)]
*
* named tuples --
*
[Pandas(Index=Timestamp('2013-01-01 00:00:00', freq='D'), A=-2.4437103289150857, B=-0.32054023603910436, C=0.01030189942471081, D=0.9358311337233644), Pandas(Index=Timestamp('2013-01-02 00:00:00', freq='D'), A=-1.0034752077816913, B=-1.0207687970125863, C=0.11561494820245698, D=-0.5146012044818192)]
*
* column "B" from named tuple --
*
[-0.32054023603910436, -1.0207687970125863]
While iterating over a ``pandas.DataFrame`` produces the column
label, which can be used to access the columns of the ``DataFrame``.
Example::
In [92]: for column in df:
...: print("{}[0]: {:7.3f}".format(column, getattr(df, column)[0]))
...:
A[0]: -0.368
B[0]: 1.122
C[0]: -0.890
D[0]: 0.076
An easier (and cleaner?) way to access a column would be: ``df[column]``.
In contrast, iterating over a ``pandas.Series``, produces the items
in the ``Series``. Example (note that ``dates`` is a ``Series``)::
In [112]: for date in dates:
...: print('date: {}/{}/{}'.format(date.month, date.day, date.year))
...:
date: 1/1/2013
date: 1/2/2013
date: 1/3/2013
date: 1/4/2013
date: 1/5/2013
date: 1/6/2013
Here is a simple bit of code that iterates over each of the items
(cells) in a Pandas DataFrame. This function prints out elements
column by column::
def show_df(df):
for idx1, label in enumerate(df):
print('{}. Column: {}'.format(idx1, label))
for idx2, item in enumerate(df[label]):
print(' {}.{}. {:+6.4f}'.format(idx1, idx2, item))
And, here is what the above (function ``show_df``) might display::
In [78]: show_df(df.head(n=2))
0. Column: A
0.0. +0.9590
0.1. -3.6568
1. Column: B
1.0. +1.1409
1.1. -0.4395
2. Column: C
2.0. +1.2634
2.1. -0.3644
3. Column: D
3.0. +0.0824
3.1. +1.1789
And, here is a function that prints out elements row by row (i.e.
one row after another)::
def show_df_by_rows(df):
columns = df.columns
for row, index in enumerate(df.index):
print('{}. Row: {}'.format(row, index))
for idx, item in enumerate(df.loc[index]):
print(' {}.{}. {:+6.4f}'.format(idx, columns[idx], item))
Here is a sample printout from the above function::
0. Row: 2013-01-01 00:00:00
0.A. +0.9590
1.B. +1.1409
2.C. +1.2634
3.D. +0.0824
1. Row: 2013-01-02 00:00:00
0.A. -3.6568
1.B. -0.4395
2.C. -0.3644
3.D. +1.1789
You can do something analogous with list comprehensions or generator
expressions. For example, consider this code::
def show_dataframe(df):
generator = ((index, b.items()) for (index, b) in
((index, df.loc[index]) for index in df.index))
for date, data in generator:
print('date: {}'.format(date))
for col, item in data:
print(' col: {} item: {:12.4f}'.format(col, item))
When we run the above, calling ``show_dataframe``, we might see::
In [90]: show_dataframe(df.tail(2))
date: 2013-01-05 00:00:00
col: A item: 0.2175
col: B item: 0.1573
col: C item: -0.2240
col: D item: 0.2395
date: 2013-01-06 00:00:00
col: A item: 0.1440
col: B item: -0.9796
col: C item: -2.2432
col: D item: -0.7182
Notes:
- In the above example, we produced generator expressions. Note the
parentheses around the outer expression and inner expression used
to produce ``generator``. If we had used square brackets instead
of parentheses, that expression would have produced lists.
- The function ``show_items`` contains a nested loop whose outer
loop iterates over the outer generator expression and within that
outer loop, an inner loop iterates over each nested inner
generator expression.
Grouping items in a DataFrame
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can group items in a DataFrame according to some criteria, then
process only items in that group. For example::
In [363]: dates, df = utils01.make_sample_dataframe1()
In [364]: df
Out[364]:
A B C D
2013-01-01 0.286823 -0.490076 1.876985 0.900970
2013-01-02 0.338896 -0.111205 -1.516295 1.344511
2013-01-03 -1.045215 -0.155277 -0.238831 0.763586
2013-01-04 0.911923 0.383383 -1.838096 -0.233212
2013-01-05 -0.424031 -0.396694 -1.260573 1.912463
2013-01-06 1.198149 -0.729439 1.578052 -1.139293
In [365]: f1 = lambda x: 0 if x < 0.0 else 1
In [366]: df["E"] = [f1(x) for x in df.A]
In [367]: df
Out[367]:
A B C D E
2013-01-01 0.286823 -0.490076 1.876985 0.900970 1
2013-01-02 0.338896 -0.111205 -1.516295 1.344511 1
2013-01-03 -1.045215 -0.155277 -0.238831 0.763586 0
2013-01-04 0.911923 0.383383 -1.838096 -0.233212 1
2013-01-05 -0.424031 -0.396694 -1.260573 1.912463 0
2013-01-06 1.198149 -0.729439 1.578052 -1.139293 1
In [368]: groups = df.groupby("E")
In [369]:
In [369]: len(groups)
Out[369]: 2
In [371]: groups.get_group(0)
Out[371]:
A B C D E
2013-01-03 -1.045215 -0.155277 -0.238831 0.763586 0
2013-01-05 -0.424031 -0.396694 -1.260573 1.912463 0
In [372]:
In [372]: groups.get_group(1)
Out[372]:
A B C D E
2013-01-01 0.286823 -0.490076 1.876985 0.900970 1
2013-01-02 0.338896 -0.111205 -1.516295 1.344511 1
2013-01-04 0.911923 0.383383 -1.838096 -0.233212 1
2013-01-06 1.198149 -0.729439 1.578052 -1.139293 1
Notes:
- We use the function/lambda ``f1`` to distinguish between values
that are less than zero and those that are greater than or equal
to zero.
- We create a list of keys depending on the values in column "A".
- We create a new column in our DataFrame containing these keys.
- We group the DataFrame depending on the values in this new column.
- Next we can determine the number of groups (using ``len(df)``).
- And we can access each group individually (with
``df.get_group(n)``).
- Notice that all the items in the first group have negative values
in column "A", and all the items in the second group have positive
values in column "A".
An alternative way to do the above task would pass a *function* to
the ``.groupby`` method. That function could assign or select rows
in arbitrarily complex ways. For example, the following function
could assign items to two groups depending on whether the value in
column "A" is negative or positive::
In [33]: def f1(index):
...: return 1 if df.loc[index].A < 0.0 else 0
...:
...:
In [34]:
In [34]: a = df.groupby(f1)
In [35]:
In [35]: len(a)
Out[35]: 2
In [36]:
In [36]: a.get_group(0)
Out[36]:
A B C D E
2013-01-01 0.823745 1.259863 0.099038 2.401296 0
2013-01-03 1.067624 1.106958 1.616902 0.939021 0
2013-01-04 1.152899 0.190998 -0.062540 -1.786131 0
2013-01-06 0.680271 1.307369 -0.024296 -0.973855 0
In [37]:
In [37]: a.get_group(1)
Out[37]:
A B C D E
2013-01-02 -0.358235 -1.920455 -0.553173 0.580201 1
2013-01-05 -0.226727 0.180529 0.900700 -1.835082 1
Applying functions to a DataFrame
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can do this in a variety of ways:
- Element-wise -- Use ``.map`` for ``Series`` and ``.applymap`` for
``DataFrame``::
In [171]: dates.map(lambda x: x.day)
Out[171]: Int64Index([1, 2, 3, 4, 5, 6], dtype='int64')
In [172]: df.applymap(lambda x: 0.0 if x < 0.0 else x * 10.0)
Out[172]:
A B C D
2013-01-01 0.000000 11.222224 0.000000 0.764820
2013-01-02 8.165304 0.000000 8.425176 0.000000
2013-01-03 0.000000 7.066568 10.162480 0.000000
2013-01-04 7.097722 0.000000 10.544352 2.593139
2013-01-05 0.000000 0.000000 10.031058 6.354610
2013-01-06 5.629199 1.180783 0.000000 0.000000
- Row-wise and column-wise -- Use one of:
- ``df.apply(fn)`` -- Apply function to each column.
- ``df.apply(fn, axis=1`` -- Apply function to each row.
- For functions that take and return a ``DataFrame`` or that take
and return a ``Series``, use ``.pipe``. Example::
In [197]: fn = lambda x: np.abs(x)
In [198]: df.pipe(fn)
Out[198]:
A B C D
2013-01-01 0.368409 1.122222 0.889764 0.076482
2013-01-02 0.816530 0.963447 0.842518 1.371106
2013-01-03 0.164827 0.706657 1.016248 0.474849
2013-01-04 0.709772 1.695648 1.054435 0.259314
2013-01-05 0.057673 0.713738 1.003106 0.635461
2013-01-06 0.562920 0.118078 1.904701 0.149196
And, remember that there may be use cases where it is useful to
create a "vectorized" function with ``numpy.vectorize``.
Sorting a ``DataFrame`` or a ``Series``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can sort by index, value, etc. See:
http://pandas.pydata.org/pandas-docs/stable/basics.html#sorting.
Statistical analysis
~~~~~~~~~~~~~~~~~~~~~~
You can do preliminary and rudimentary statistical analysis. See:
http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics.
For more complex work, consider using the Scipy tools.
Examples::
In [65]: df.describe()
Out[65]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.255717 -0.067143 0.211290 -0.127586
std 1.102925 0.651381 0.663725 0.691202
min -0.746677 -1.277578 -0.445694 -1.101834
25% -0.415984 -0.110226 -0.142937 -0.473979
50% -0.111748 0.004162 -0.060588 -0.210746
75% 0.545268 0.374949 0.470344 0.363150
max 2.257601 0.516208 1.357676 0.765088
In [66]:
In [66]: sp.mean(df.A)
Out[66]: 0.2557174574376679
In [67]:
In [67]: sp.std(df.A, ddof=1)
Out[67]: 1.102925321931004
Visualization and graphing
============================
``Matplotlib``
----------------
See: http://matplotlib.org/
Bokeh
-------
See: https://bokeh.pydata.org/en/latest/
Here are Bokeh examples taken from the documentaion::
#!/usr/bin/env python
from bokeh.plotting import figure, output_file, show
def test01():
# prepare some data
x = [1, 2, 3, 4, 5]
y = [6, 7, 2, 4, 5]
# output to static HTML file
output_file("lines.html")
# create a new plot with a title and axis labels
p = figure(title="simple line example", x_axis_label='x', y_axis_label='y')
# add a line renderer with legend and line thickness
p.line(x, y, legend="Temp.", line_width=2)
# show the results
show(p)
def test02():
# prepare some data
x = [0.1, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0]
y0 = [i**2 for i in x]
y1 = [10**i for i in x]
y2 = [10**(i**2) for i in x]
# output to static HTML file
output_file("log_lines.html")
# create a new plot
p = figure(
tools="pan,box_zoom,reset,save",
y_axis_type="log", y_range=[0.001, 10**11], title="log axis example",
x_axis_label='sections', y_axis_label='particles'
)
# add some renderers
p.line(x, x, legend="y=x")
p.circle(x, x, legend="y=x", fill_color="white", size=8)
p.line(x, y0, legend="y=x^2", line_width=3)
p.line(x, y1, legend="y=10^x", line_color="red")
p.circle(
x, y1,
legend="y=10^x",
fill_color="red", line_color="red",
size=6)
p.line(x, y2, legend="y=10^x^2", line_color="orange", line_dash="4 4")
# show the results
#show(p, browser="firefox")
show(p)
def main():
test01()
test02()
if __name__ == '__main__':
main()
There are more examples in the Bokeh "Quickstart" document:
https://bokeh.pydata.org/en/latest/docs/user_guide/quickstart.html#userguide-quickstart
Altair
--------
See: https://pypi.python.org/pypi/altair
Note that ``Altair`` is not in the ``Anaconda`` distribution, but is
easy to install with ``pip``.
Optimization, parallel processing, access to C/C++, etc.
==========================================================
Numba
-------
See: http://numba.pydata.org/numba-doc/dev/index.html.
And, here is a interesting article related to Numba: https://www.anaconda.com/blog/developer-blog/parallel-python-with-numba-and-parallelaccelerator/.
From the Numba docs:
From the Numba user manual::
Numba is a compiler for Python array and numerical functions
that gives you the power to speed up your applications with high
performance functions written directly in Python.
Numba generates optimized machine code from pure Python code
using the LLVM compiler infrastructure. With a few simple
annotations, array-oriented and math-heavy Python code can be
just-in-time optimized to performance similar as C, C++ and
Fortran, without having to switch languages or Python
interpreters.
Numba’s main features are:
* on-the-fly code generation (at import time or runtime, at the
user’s preference)
* native code generation for the CPU (default) and GPU hardware
* integration with the Python scientific software stack (thanks
to Numpy)
Here is some sample test code, copied from the Numba documentation::
# file: numba_test01.py
import numba
@numba.jit
def sum2d(arr):
M, N = arr.shape
result = 0.0
for i in range(M):
for j in range(N):
result += arr[i, j]
return result
def plain_sum2d(arr):
M, N = arr.shape
result = 0.0
for i in range(M):
for j in range(N):
result += arr[i, j]
return result
And, here is an example that calls the two above functions, one
optimized by Numba and the other not. Notice the timings. The
Numba optimized version is more than two orders of magnitude
faster::
In [30]: import numba_test01 as nt
In [31]: a = np.ones((1000, 1200))
In [32]: time nt.plain_sum2d(a)
CPU times: user 621 ms, sys: 0 ns, total: 621 ms
Wall time: 622 ms
Out[32]: 1200000.0
In [33]: time nt.sum2d(a)
CPU times: user 3.68 ms, sys: 0 ns, total: 3.68 ms
Wall time: 3.7 ms
Out[33]: 1200000.0
There is lots more that can be done with Numba in the way of
optimizing code. See the docs.
Dask
------
The documentation on Dask can be found here:
http://dask.pydata.org/en/latest/docs.html.
This summary of Dask is from the Dask documentation::
Dask is a flexible parallel computing library for analytic computing.
Dask is composed of two components:
1. Dynamic task scheduling optimized for computation. This is similar to
Airflow, Luigi, Celery, or Make, but optimized for interactive
computational workloads.
2. “Big Data” collections like parallel arrays, dataframes, and lists
that extend common interfaces like NumPy, Pandas, or Python iterators
to larger-than-memory or distributed environments. These parallel
collectiont
run on top of the dynamic task schedulers.
If you are beginning to learn Dask, you might want some sample data:
- The dask tutorial contains a script for generating sample data
files. You can find the tutorial repository here:
https://github.com/dask/dask-tutorial.
- And, here is a script that will generate a few HDF5 files. I
copied it from the Dask Web site
(http://dask.pydata.org/en/latest/examples/dataframe-hdf5.html),
and made a few minor modifications::
#!/usr/bin/env python
"""
synopsis:
generate sample dask data files.
usage:
python generate_dask_data.py
options:
-h, --help
Display this help.
"""
import sys
import string
import random
import pandas as pd
import numpy as np
def generate(prefix):
# dict to keep track of hdf5 filename and each key
fileKeys = {}
for i in range(10):
# randomly pick letter as dataset key
groupkey = random.choice(list(string.ascii_lowercase))
# randomly pick a number as hdf5 filename
filename = prefix + str(np.random.randint(100)) + '.h5'
# Make a dataframe; 26 rows, 2 columns
df = pd.DataFrame({'x': np.random.randint(1, 1000, 26),
'y': np.random.randint(1, 1000, 26)},
index=list(string.ascii_lowercase))
# Write hdf5 to current directory
df.to_hdf(filename, key='/' + groupkey, format='table')
fileKeys[filename] = groupkey
# prints hdf5 filenames and keys for each
print(fileKeys)
def main():
args = sys.argv[1:]
if len(args) != 1:
sys.exit(__doc__)
if args[0] in ('-h', '--help'):
sys.exit(__doc__)
prefix = args[0]
generate(prefix)
if __name__ == '__main__':
main()
I used the above script to build sample data files as follows::
$ ./generate_dask_data.py "data02/sample_"
Then I read these HDF5 files into a Dask DataFrame by using the
following::
In [38]: df = dd.read_hdf('./data02/sample_*.h5', key='/*')
In [39]: df
Out[39]:
Dask DataFrame Structure:
x y
npartitions=10
int64 int64
... ...
... ... ...
... ...
... ...
Dask Name: concat, 22 tasks
In [40]:
After which, I can do the following, for example::
In [40]: df.x.mean().compute()
Out[40]: 501.53076923076924
We can do something that indicates how our data has been broken down
into separate partitions. I can use this function::
def test(df):
results = []
for idx in range(df.npartitions):
mean = df.get_partition(idx).x.mean().compute()
print('partition: {} mean: {}'.format(idx, mean))
results.append((idx, mean))
return results
Which produces something like the following::
In [10]: test(df)
idx: 0 mean: 473.7692307692308
idx: 1 mean: 436.5769230769231
idx: 2 mean: 501.2692307692308
idx: 3 mean: 565.4230769230769
idx: 4 mean: 516.8846153846154
idx: 5 mean: 501.34615384615387
idx: 6 mean: 531.3076923076923
idx: 7 mean: 428.61538461538464
idx: 8 mean: 565.2307692307693
idx: 9 mean: 494.88461538461536
Out[10]:
[(0, 473.7692307692308),
(1, 436.5769230769231),
(2, 501.2692307692308),
(3, 565.4230769230769),
(4, 516.8846153846154),
(5, 501.34615384615387),
(6, 531.3076923076923),
(7, 428.61538461538464),
(8, 565.2307692307693),
(9, 494.88461538461536)]
Dask for big data
~~~~~~~~~~~~~~~~~~~
Dask enables you to divide a large data structure or data set, for
example, a Pandas DataFrame, into smaller structures, for example,
smaller DataFrames, then load those smaller chunks from disk and
process them.
Example:
1. First we'll create a data set, a Pandas DataFrame, that we can
divide up into smaller chunks. Here is a Python script that we
can use to create a sample CSV (comma separated values) file::
#!/usr/bin/env python
# file: write_csv.py
"""
synopsis:
Write sample CSV file from Pandas DataFrame.
usage:
python write_csv.py
example:
python write_csv.py test_data.csv 200
"""
import sys
import numpy as np
import pandas as pd
def make_sample_dataframe(periods):
"""Make sample dates and DataFrame. Returns (dates, df)."""
dates = pd.date_range('20130101', periods=periods)
df = pd.DataFrame(
np.random.randn(periods, 4),
index=dates,
columns=list('ABCD'))
return dates, df
def create_data(outfilename, count):
dates, df = make_sample_dataframe(count)
df.to_csv(outfilename)
def main():
args = sys.argv[1:]
if len(args) != 2:
sys.exit(__doc__)
outfilename = args[0]
count = int(args[1])
create_data(outfilename, count)
if __name__ == '__main__':
main()
And, from within IPython, we can run it to create a CSV file as
follows::
In [113]: %run write_csv.py tmp2.csv 200
Now, we can read that file to create a Dask DataFrame with the
following::
In [115]: import dask.dataframe as dd
In [116]: daskdf = dd.read_csv('tmp2.csv')
2. We can look at our data with ``df.head()`` and ``df.tail()``::
In [117]: daskdf.head()
Out[117]:
Unnamed: 0 A B C D
0 2013-01-01 1.719008 0.168998 -0.582670 -0.199597
1 2013-01-02 0.947192 1.449137 -0.701263 0.342353
2 2013-01-03 1.321397 0.035692 0.147275 1.551782
3 2013-01-04 -0.286258 0.592772 1.770504 1.752572
4 2013-01-05 1.695924 0.159782 2.150698 -0.060106
In [118]: daskdf.tail()
Out[118]:
Unnamed: 0 A B C D
195 2013-07-15 0.303020 0.710051 -0.904407 -0.451793
196 2013-07-16 -0.703248 -0.973423 -0.830585 0.183094
197 2013-07-17 0.886046 1.530008 1.319875 -0.318807
198 2013-07-18 0.021749 2.570984 0.572013 1.249558
199 2013-07-19 -0.570810 -0.240768 2.203662 -0.014111
Also see the Pandas section for ways to view structures, for
example: `View Pandas data structures`_
3. Next, we'll divide it up -- This is an important capability of
Dask; it enables us to process Dataframes/arrays that are either
too large to fit comfortably in memory or which we are only
interested in sub-slices. In this case, we'll specify a block
size (or a partition size) when we read the CSV file and create a
Dask DataFrame::
In [58]: %run write_csv.py tmp4.csv 500
In [59]:
In [59]: df3 = dd.read_csv('tmp3.csv', blocksize=600)
In [60]:
In [60]: df3.head()
Out[60]:
Unnamed: 0 A B C D
0 2013-01-01 1.907704 0.317188 0.779075 0.327731
1 2013-01-02 -0.936242 -0.679869 -0.817254 -0.810020
2 2013-01-03 -1.465717 -0.775163 -0.621830 -0.171773
3 2013-01-04 0.878534 -0.910678 -0.363762 0.462970
4 2013-01-05 -0.182779 0.174225 -1.483841 -0.062528
In [61]: df3.tail()
Out[61]:
Unnamed: 0 A B C D
0 2013-07-15 0.426699 -2.126057 -0.784172 0.780982
1 2013-07-16 -0.727647 -1.552699 0.750276 -0.788475
2 2013-07-17 0.452168 -0.525214 0.003892 -0.029953
3 2013-07-18 -1.135117 0.626181 -0.895456 2.096875
4 2013-07-19 1.365505 -0.208806 0.115254 -1.210855
In [62]:
In [62]: df3.A.mean().compute()
Out[62]: 0.04365032375682896
In [63]:
4. And, now, we'll process that data chunk by chunk::
In [63]: for idx in range(df3.npartitions):
...: data = df3.get_partition(idx)
...: mean = data.A.mean().compute()
...: print('partition: {} mean: {}'.format(idx, mean))
...:
partition: 0 mean: 0.1307434691610682
partition: 1 mean: -0.10723637021736673
partition: 2 mean: 0.47059788011488657
partition: 3 mean: -0.029706498960742605
partition: 4 mean: 0.06754303873144374
partition: 5 mean: 0.1604556981338858
partition: 6 mean: -0.4161510144675041
partition: 7 mean: 0.6799116374415602
partition: 8 mean: 0.6303390153859068
partition: 9 mean: 0.6517677726166038
partition: 10 mean: -0.02111769936010994
o
o
o
In [64]:
Notes:
- Keep in mind that Dask is capable of "parallelizing" the above
operation. It can process multiple partitions in parallel on a
multi-core/multi-CPU machine. See the next section for help
with that.
Dask for optimized (and parallel) computing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dask enables you to describe a complex process in terms of an
execution graph: a digraph (directed graph) whose nodes are
sub-processes. The valuable thing about being able to do so is that
Dask can schedule the execution of that larger process so that some
sub-processes are executed in parallel. On multi-CPU/multi-core
hardware, this can be a big win.
Dask supports parallel processing on both a single machine and one
multiple, distributed machines. In what follows, however, I will
discuss parallel computation on a single machine.
To learn more about this, you will want to read the following:
- `Scheduling `_ --
http://dask.pydata.org/en/latest/scheduling.html
- `Single Machine with Dask.distributed
`_ --
http://dask.pydata.org/en/latest/setup/single-distributed.html
- `Dask.distributed
`_ --
https://distributed.readthedocs.io/en/latest/index.html
Controlling parallelism in Dask requires understanding Dask
schedulers, how they are used by Dask, and how to use them.
Note that Dask has default schedulers. If you do nothing to change
or set the scheduler, you will be using the default, which is most
ofter what you want. The notes that follow will attempt to help you
determine when and under what conditions you might want to use a
different scheduler and how to do that.
Also, keep in mind two concepts that are both related to
optimization in Dask: (1) Parallelism is what you want when you have
multiple tasks and want to speed them up by running/computing them
in parallel. (2) Breaking your data and your Dask data collections
into chunks is what you want when your data set is very large and
will not fit in memory. You should keep in mind that breaking your
data into chunks may slow down processing. Here is something that
shows some of those differences::
In [57]: df1 = dd.read_csv('tmp5.csv', blocksize=1000000)
In [58]: df2 = dd.read_csv('tmp5.csv', blocksize=8000)
In [59]:
In [59]: df1.npartitions
Out[59]: 1
In [60]: df2.npartitions
Out[60]: 12
In [61]: df1.get_partition(0).size.compute()
Out[61]: 5000
In [62]: df2.get_partition(0).size.compute()
Out[62]: 450
In [63]:
In [63]: time df1.A.mean().compute()
CPU times: user 15.8 ms, sys: 7.5 ms, total: 23.3 ms
Wall time: 22.3 ms
Out[63]: 0.02893067882172706
In [64]: time df2.A.mean().compute()
CPU times: user 167 ms, sys: 9.85 ms, total: 177 ms
Wall time: 164 ms
Out[64]: 0.028930678821727045
In [65]:
Notes:
- We create ``df1`` with a single partition (or chunk) and ``df2``
with multiple partitions (in this case 12).
- The size of a single partition of ``df1`` is much larger than the
first partition of ``df2`` (5000 vs 450).
- Computing the mean of a single column of ``df1`` takes
significantly less time than the same operation on ``df2``.
Synchronous processing on the local machine -- The default scheduler
does that.
Let's figure out how to do that in parallel, for example, we'll try
to compute the mean of each of the columns of our dataframe (four
columns: "A", "B", "C", and "D") in parallel.
Here are two functions. One computes the mean for each column in our
DataFrame, one column after another. The other attempts to use
``dask.distributed`` to schedule these four tasks so that they make
use of more than one CPU core::
def compute_means_sequential(df):
"""
Sequentially compute the means of columns of dataframe.
Args:
df (dask.dataframe.DataFrame) -- A dataframe containing columns
A, B, C, and D.
Return:
The means
"""
meanA = df.A.mean().compute()
meanB = df.B.mean().compute()
meanC = df.C.mean().compute()
meanD = df.D.mean().compute()
return meanA, meanB, meanC, meanD
def compute_means_parallel(client, df):
"""
Compute in parallel the means of columns of dataframe.
Args:
client (dask.distributed.Client) -- The client to schedule
the computation.
df (dask.dataframe.DataFrame) -- A dataframe containing columns
A, B, C, and D.
Return:
The means
"""
meanA = client.submit(df.A.mean().compute)
meanB = client.submit(df.B.mean().compute)
meanC = client.submit(df.C.mean().compute)
meanD = client.submit(df.D.mean().compute)
client.gather((meanA, meanB, meanC, meanD))
return meanA.result(), meanB.result(), meanC.result(), meanD.result()
You can find a file containing these snippets here:
`snippets.py <{filename}static/snippets.py>`_.
Here is a test that uses the above on a 2-core machine::
In [17]: time snippets.compute_means_sequential(df1)
CPU times: user 167 ms, sys: 21.3 ms, total: 189 ms
Wall time: 379 ms
Out[17]:
(0.02893067882172706,
-0.05704419047235241,
-0.03281851829891229,
-0.029845199428518945)
In [18]: time snippets.compute_means_parallel(client, df1)
CPU times: user 189 ms, sys: 16.9 ms, total: 206 ms
Wall time: 281 ms
Out[18]:
(0.02893067882172706,
-0.05704419047235241,
-0.03281851829891229,
-0.029845199428518945)
Here is a test that uses the above on a 4-core machine::
In [15]: time snippets.compute_means_sequential(df1)
CPU times: user 160 ms, sys: 9.5 ms, total: 169 ms
Wall time: 303 ms
Out[15]:
(0.02893067882172706,
-0.05704419047235241,
-0.03281851829891229,
-0.029845199428518945)
In [16]:
In [16]: time snippets.compute_means_parallel(client, df1)
CPU times: user 164 ms, sys: 5.03 ms, total: 169 ms
Wall time: 224 ms
Out[16]:
(0.02893067882172706,
-0.05704419047235241,
-0.03281851829891229,
-0.029845199428518945)
Notes:
- Parallel execution on a 4-core machine takes measurably less
time. On a large data structure, this might be significant and
noticeable.
- My original test had four calls to ``print()`` in each of the
above two functions. That partially masked the time difference
between calls to these functions.
- As with any work on optimization, you will need to test with your
data, your machine, your configuration, etc. YMMV (your mileage
my vary).
Cython
--------
See: http://cython.org/.
Cython enables us to write or produce C code while writing code in
the style of Python. There's more to it than that, but you get the
idea. We can write code that looks a lot like Python code,
and then use Cython to turn it into C code.
Cython has another important use -- Because (1) Cython gives us easy
access to libraries of compiled C code and (2) it is easy to write
functions in Cython that can be called from Python, we can use it to
easily "wrap" C functions for use in Python. In fact, if you look
inside some Python packages, for example Lxml, you will see wrappers
for underlying C code that were produced with Cython; Lxml makes
calls into the ``libxml`` XML libraries provided by
http://www.xmlsoft.org.
Here is a bit more description from http://cython.org/:
"Cython is an optimising static compiler for both the Python programming
language and the extended Cython programming language (based on Pyrex). It
makes writing C extensions for Python as easy as Python itself.
"Cython gives you the combined power of Python and C to let you
* write Python code that calls back and forth from and to C or C++ code
natively at any point.
* easily tune readable Python code into plain C performance by
adding static type declarations.
* use combined source code level debugging to find bugs in your Python,
Cython and C code.
* interact efficiently with large data sets, e.g. using multi-dimensional
NumPy arrays.
* quickly build your applications within the large, mature and widely used
CPython ecosystem.
* integrate natively with existing code and data from legacy, low-level or
high-performance libraries and applications."
Machine learning
==================
Scikit-Learn
--------------
And, the ``scikit-learn`` documentation page is here:
http://scikit-learn.org/stable/user_guide.html.
EliteDataScience has an introduction to machine learning here:
https://elitedatascience.com/learn-machine-learning
EliteDataScience has provided a Scikit-Learn tutorial here:
https://elitedatascience.com/python-machine-learning-tutorial-scikit-learn.
tensorflow
------------
Question: Is there support for tensorflow in Anaconda? Answer:
Yes, but currently, installing it is tricky. For example, see this:
https://gist.github.com/johndpope/187b0dd996d16152ace2f842d43e3990
Multiprocessing and parallization
===================================
``ipyparallel``
-----------------
See: https://ipyparallel.readthedocs.io/en/latest/
Dask and Dask schedulers
--------------------------
See: https://dask.pydata.org/
Also see the section on Dask elsewhere in the current document:
`Dask for optimized (and parallel) computing`_.
Data store -- HDF5, h5py, Pytables, asdf, etc
===============================================
HDF5
------
h5py
~~~~~~
You can store Panda DataFrames and Dask DataFrames in HDF5 archives
with ``h5py``. You can read about ``h5py`` here:
- https://www.h5py.org/
- http://docs.h5py.org/en/latest/
- http://shop.oreilly.com/product/0636920030249.do -- a book.
Also see: https://dask.pydata.org/en/doc-test-build/array-overview.html#construct
Here is an example that saves and retrieves a Dask
DataFrame::
In [62]: df1, df2 = snippets.read_csv_files('tmp5.csv')
In [63]: df1.to_hdf('tmp01.hdf5', '/Version1/tmp5')
Out[63]: ['tmp01.hdf5']
In [64]:
In [64]: df1a = dd.read_hdf('tmp01.hdf5', '/Version1/tmp5')
In [65]:
In [65]: df1.A.mean().compute()
Out[65]: 0.02893067882172706
In [66]: df1a.A.mean().compute()
Out[66]: 0.02893067882172706
In [68]: df2.to_hdf('tmp01.hdf5', '/Version1/tmp5_2')
Out[68]:
['tmp01.hdf5',
'tmp01.hdf5',
'tmp01.hdf5',
'tmp01.hdf5',
'tmp01.hdf5',
'tmp01.hdf5',
'tmp01.hdf5',
'tmp01.hdf5',
'tmp01.hdf5',
'tmp01.hdf5',
'tmp01.hdf5',
'tmp01.hdf5']
In [69]:
In [69]: df2a = dd.read_hdf('tmp01.hdf5', '/Version1/tmp5_2')
In [70]:
In [70]: df2.npartitions
Out[70]: 12
In [71]: df2a.npartitions
Out[71]: 1
In [72]: df2.B.su
df2.B.sub df2.B.sum
In [72]: df2.B.sum().compute()
Out[72]: -57.04419047235241
In [73]: df2a.B.sum().compute()
Out[73]: -57.04419047235241
Notes:
- We load a Dask DataFrame (``df1``), then read it back into a
separate variable (``df1a``).
- We compute the mean of column A of both DataFrames so as to show
that the one we wrote to HDF5 and the one we read back in from
HDF5 contain the same data.
- Notice that in the case of ``df2`` and ``df2a``, ``read_hdf``
function did not preserve the chunk size and number of partitions.
However, the ``read_hdf`` function has an optional parameter that
enables you to read a DataFrame from HDF5 creating multiple
partitions and a smaller chunk size. Example::
In [80]: df2b = dd.read_hdf('tmp01.hdf5', '/Version1/tmp5_2')
In [81]: df2b.npartitions
Out[81]: 1
In [82]: df2c = dd.read_hdf('tmp01.hdf5', '/Version1/tmp5_2', chunksize=100)
In [83]: df2c.npartitions
Out[83]: 10
h5serv
~~~~~~~~
There is also an HTTP server for HDF5 archives. It presents a
REST-ful interface that enables you to add, list, and retrieve data
objects from HDF5 archives on a remote machine. The data returned
in response to a retrieval request is formatted as JSON.
Yot
can learn more about ``h5serv`` here:
http://h5serv.readthedocs.io/en/latest/.
And, you can learn about the JSON representation of HDF5 here:
http://hdf5-json.readthedocs.io/en/latest/index.html.
Pytables
~~~~~~~~~~
asdf
------
The documentation is here: https://asdf.readthedocs.io/en/latest/.
And, a bit more documentation:
https://www.sciencedirect.com/science/article/pii/S2213133715000645
CSV -- comma separated values
-------------------------------
A CSV module is in the Python standart library. See:
https://docs.python.org/3/library/csv.html
.. vim: ft=rst :