================================================ A summary of tools for data science for Python ================================================ :author: Dave Kuhlman :contact: dkuhlman (at) davekuhlman (dot) org :address: http://www.davekuhlman.org :revision: 1.0.1 :date: |date| .. |date| date:: %B %d, %Y :Copyright: Copyright (c) 2018 Dave Kuhlman. All Rights Reserved. This software is subject to the provisions of the MIT License http://www.opensource.org/licenses/mit-license.php. :Abstract: This document attempts to give a survey of data science tools for Python programming, along with brief introductions to help getting started with some of those tools. .. sectnum:: .. contents:: Introduction and preliminaries ================================ In this document I'll try to describe and summarize some significant tools that are available to Python programmers for data science, numerical processing, statistics, and visualizing numerical data. For each tool or package, I'll also try to give a brief overview of: - What the tool does. - What to use it for, along with a few use cases. - How to do a few common things that the tool supports. - When appropriate, a comparison with other similar tools. All these packages are available in the Anaconda distribution of Python, which makes Anaconda a very good option for data analytics and visualization. See: - https://docs.anaconda.com/anaconda/ - https://docs.anaconda.com/anaconda/packages/pkg-docs It's likely that they are also available at http://pypi.python.org and can be installed with ``pip``. If you plan on doing some exploration (and do not want to use the Anaconda distribution), you will want to consider using ``virtualenv`` (https://virtualenv.pypa.io/en/stable/) and, for even more convenience in trying out various packages and configurations, look at ``virtualenvwrapper`` (https://virtualenvwrapper.readthedocs.io/en/latest/). More information: - There is another summary of Python packages for data science here: https://elitedatascience.com/r-vs-python-for-data-science. Includes tools for the R programming language, too. Many on the examples in this document use the somewhat standard import statements, for example:: import numpy as np import scipy as sp import pandas as pd Some helpers ============== ipython ------------- IPython is an enhanced interactive Python shell. It has tab completion, gives more convenient access to help for Python modules and objects, enables you to edit and rerun previous commands, and much more. For more information, see: https://ipython.org. Anaconda ships with QtConsole that contains IPython for even more convenience. IPython profiles ~~~~~~~~~~~~~~~~~~ If you use IPython, then consider creating a data science profile. Use something like this:: $ ipython profile create datasci Then, consider putting something like the following in ``~/.ipython/profile_datasci/startup/50-config.py``:: import sys import numpy as np import scipy as sp def pdir(obj): """Print information about obj, including `dir(obj)`.""" if isinstance(obj, type): print('class: {}'.format(obj.__name__)) else: print('instance class name: {}'.format(obj.__class__.__name__)) if obj.__doc__: print('doc string: {}'.format(obj.__doc__)) else: print('doc string: no doc string') print(dir(obj)) def read_file_contents(filename): with open(filename, 'r') as infile: content = infile.read() return content You can have multiple startup files. See the ``startup/README`` file in your profile directory. Also, consider doing some customization in ``~/.ipython/profile_datasci/ipython_config.py``. And, in order to use that profile, start IPython with this:: $ ipython --profile=datasci You can find more help with profiles by running something like the following:: $ ipython help profile Or, see the following: http://ipython.readthedocs.io/en/stable/config/intro.html#profiles Getting (interactive) help and docs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Inside the standard Python interactive shell, you can get help on ``some_object`` with this:: >>> help(some_object) Inside the IPython interactive shell, you can use the above, or you can do:: In [9]: import scipy.fftpack In [10]: scipy.fftpack? In [11]: In [11]: from scipy import fftpack In [12]: fftpack? In [13]: fftpack.fft? You can use ``pydoc`` to get help at the command line. For example:: $ pydoc numpy.arange You can also use ``pydoc`` to run an HTTP server, and view the documentation in a Web browser. Do the following for help with that:: $ pydoc --help And, of course, documentation is available for the Scipy suite of tools at: http://www.scipy.org. Installing the tools ---------------------- Unless otherwise noted, each of the tools described in this document can be described with ``pip install ...`` (the standard Python install tool) or, for those who are using the Anaconda Python distribution, with ``conda install ...``. ``pip`` and ``virtualenv`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you use ``pip``, I'd recommend using ``virtualenv``, at the least, and even ``virtualenvwrapper``, for extra convenience and flexibility. ``virtualenv`` enables you to install Python packages (and therefor, the tools discussed in this document) in a separate environment, separate from your standard Python installation, and without polluting that standard installation. Since that separate installation is in its own directory, you can remove it by simply deleting that directory. ``virtualenvwrapper`` extends ``virtualenv`` by enabling you to create, manage, and switch between different ``virtualenv`` environments easily. For example, you might want to create and switch (1) between one ``virtualenv`` for text processing and another for data science or (2) between one installation for Python 2 and another for Python 3. See: - ``virtualenv`` -- https://pypi.python.org/pypi/virtualenv - ``virtualenvwrapper`` -- https://virtualenvwrapper.readthedocs.io/en/latest/ Anaconda ~~~~~~~~~~~~~~ The Anaconda installation of Python provides most of the tools discussed in this document in the standard Anaconda installation. Additional tools can be installed with ``conda install ...``, and the installation can be kept up-to-date with ``conda update --all``. In the event that you need a Python package that is not provided by Anaconda, you can use ``pip``. - The Anaconda distribution of Python -- https://continuum.io/ - ``conda``, the package manager for Anaconda -- https://conda.io/docs/index.html Other Python distributions for data science ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For more options on installing Python with a slant toward data science and scientific programming (but much else besides), see: https://www.scipy.org/install.html. Analytics =========== Numpy -------- Help with Numpy: - See the documentation page: http://www.numpy.org. - A tutorial: https://docs.scipy.org/doc/numpy-dev/user/quickstart.html - Some lecture notes: http://www.scipy-lectures.org/intro/numpy/numpy.html There are (at least) two aspects to Numpy: - Primitive Numpy numeric types or scalars, for example: ``np.int32``, ``np.int64``, ``np.float32``, ``np.float64``, etc. See the following for information on these primitive types and others: https://docs.scipy.org/doc/numpy/reference/arrays.scalars.html. - Array objects (instances of ``np.ndarray``) along with ways to deal with them. - Operations on Numpy arrays -- For information on these, see the Numpy reference manual: https://docs.scipy.org/doc/numpy/reference/index.html. Here is a quick summary: - Array creation routines -- Create arrays of different kinds, e.g. all ones, all zeros, identity, from an existing array, as a copy of an array, etc. - Array manipulation routines -- Routines that reshape an array, transpose an array, change the number of dimensions, join (concatenate, stack, etc), tiling arrays (create by repeating an array), etc. split arrays, etc. - Binary operations -- Logical binary operations on arrays, packing arrays into bits, bit-shifting operations, etc. - String operations - C-Types Foreign Function Interface (numpy.ctypeslib) - Datetime Support Functions - Data type routines - Optionally Scipy-accelerated routines (numpy.dual) - Mathematical functions with automatic domain (numpy.emath) -- Routines possibly accelerated by Scipy, but available in Numpy if Scipy is not installed. For example, routines for eigenvalues, Fourier transforms, solving linear equations, etc. Use:: >>> from numpy import dual - Floating point error handling - Discrete Fourier Transform (numpy.fft) -- Use:: >>> from numpy import fft Or, just:: >>> np.fft.fft( ... ) # etc. - Financial functions -- Loan, payment, and interest calculations. - Functional programming -- Routines and classes that assist with doing functional programming. For example, ``np.vectorize`` creates a "vectorized" function; ``np.frompyfunc`` creates a Numpy ``ufunc``. (Note that vectorized functions and universal functions can be applied to arrays. For help with the difference between vectorized and universal functions, see: https://stackoverflow.com/questions/6768245/difference-between-frompyfunc-and-vectorize-in-numpy.) Also, remember to look at ``functools`` and ``itertools`` in the standard Python library: https://docs.python.org/3/library/functional.html And, if you need parallelism across multiple CPUs and cores, look at ``ipyparallel``: https://ipyparallel.readthedocs.io/en/latest/ - Numpy-specific help functions -- Functions for getting information about objects and help with Numpy. (Also, if you are using IPython, the "?" operator gives help with a function or object, for example, ``enumerate?`` gives help on the ``enumerate`` function.) - Indexing routines - Input and output -- Routines for saving and loading arrays. (But, you may also want to explore HDF5 and ``h5py`` or ``pytables``. Both ``h5py`` and ``pytables`` are in the Anaconda Python distribution.) Also, routines for formatting arrays as strings, converting arrays to and from strings, etc.. - Linear algebra (numpy.linalg) -- Routines for the following: - Matrix and vector products - Decompositions - Matrix eigenvalues - Norms and other numbers - Solving equations and inverting matrices - Exceptions - Linear algebra on several matrices at once - Logic functions -- Functions for performing various tests on elements of Numpy arrays. - Masked array operations -- Support for creating and using masked arrays. A masked array is an array with a mask that marks some elements of the array as invalid. You can find some help with masked arrays in this document: http://www.scipy-lectures.org/intro/numpy/numpy.html. - Mathematical functions -- Functions for: - Trigonometric functions - Hyperbolic functions - Rounding - Sums, products, differences - Exponents and logarithms - Other special functions - Floating point routines - Arithmetic operations - Handling complex numbers - etc - Matrix library (numpy.matlib) -- Functions for creating and using matrices, as opposed to ``numpy.ndarry``. Use ``from numpy import matlib``. See this for a bit of help on the differences between arrays and matrices in Numpy: https://stackoverflow.com/questions/4151128/what-are-the-differences-between-numpy-arrays-and-matrices-which-one-should-i-u - Miscellaneous routines - Padding Arrays - Polynomials - Random sampling (numpy.random) - Set routines - Sorting, searching, and counting - Statistics - Test Support (numpy.testing) - Window functions Scipy ------- Note that Scipy, Numpy, Pandas, Matplotlib, IPython, and Sympy are all under the Scipy umbrella. For information about any of these, see: https://www.scipy.org/. What is Scipy? (1) It is many things to many people. But more seriously, (2) it is a large collection of functions for performing operations on arrays of numerical data. Think of it this way: Numpy (and Pandas) give you ways to structure and manipulate multi-dimensional arrays of numbers; Scipy gives you many functions that perform operations on those multi-dimensional arrays of numbers. What kinds of operations? Here are some categories with descriptions: - Basic functions - Special functions (scipy.special) Integration (scipy.integrate) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For help with this set of functions, do the following:: >>> from scipy import integrate >>> help(integrate) Or, in IPython, do ``integrate?`` Here is the list you will see: - Integrating functions, given function object - quad -- General purpose integration - dblquad -- General purpose double integration - tplquad -- General purpose triple integration - nquad -- General purpose n-dimensional integration - fixed_quad -- Integrate func(x) using Gaussian quadrature of order n - quadrature -- Integrate with given tolerance using Gaussian quadrature - romberg -- Integrate func using Romberg integration - quad_explain -- Print information for use of quad - newton_cotes -- Weights and error coefficient for Newton-Cotes integration IntegrationWarning -- Warning on issues during integration - Integrating functions, given fixed samples - trapz -- Use trapezoidal rule to compute integral. - cumtrapz -- Use trapezoidal rule to cumulatively compute integral. - simps -- Use Simpson's rule to compute integral from samples. - romb -- Use Romberg Integration to compute integral from (2**k + 1) evenly-spaced samples. - Solving initial value problems for ODE systems The solvers are implemented as individual classes which can be used directly (low-level usage) or through a convenience function. - solve_ivp -- Convenient function for ODE integration. - RK23 -- Explicit Runge-Kutta solver of order 3(2). - RK45 -- Explicit Runge-Kutta solver of order 5(4). - Radau -- Implicit Runge-Kutta solver of order 5. - BDF -- Implicit multi-step variable order (1 to 5) solver. - LSODA -- LSODA solver from ODEPACK Fortran package. - OdeSolver -- Base class for ODE solvers. - DenseOutput -- Local interpolant for computing a dense output. - OdeSolution -- Class which represents a continuous ODE solution. Optimization (scipy.optimize) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Remember that for each the following (or any) functions, you can get help in the usual ways: ``help(some_func)`` or (in IPython) ``some_func?``. - Local Optimization: - minimize -- Unified interface for minimizers of multivariate functions - minimize_scalar -- Unified interface for minimizers of univariate functions - OptimizeResult -- The optimization result returned by some optimizers - OptimizeWarning -- The optimization encountered problems - General-purpose multivariate methods: - fmin -- Nelder-Mead Simplex algorithm - fmin_powell -- Powell's (modified) level set method - fmin_cg -- Non-linear (Polak-Ribiere) conjugate gradient algorithm - fmin_bfgs -- Quasi-Newton method (Broydon-Fletcher-Goldfarb-Shanno) - fmin_ncg -- Line-search Newton Conjugate Gradient - Constrained multivariate methods: - fmin_l_bfgs_b -- Zhu, Byrd, and Nocedal's constrained optimizer - fmin_tnc -- Truncated Newton code - fmin_cobyla -- Constrained optimization by linear approximation - fmin_slsqp -- Minimization using sequential least-squares programming - differential_evolution -- stochastic minimization using differential evolution - Univariate (scalar) minimization methods: - fminbound -- Bounded minimization of a scalar function - brent -- 1-D function minimization using Brent method - golden -- 1-D function minimization using Golden Section method - Equation (Local) Minimizers: - leastsq -- Minimize the sum of squares of M equations in N unknowns - least_squares -- Feature-rich least-squares minimization. - nnls -- Linear least-squares problem with non-negativity constraint - lsq_linear -- Linear least-squares problem with bound constraints - Global Optimization: - basinhopping -- Basinhopping stochastic optimizer - brute -- Brute force searching optimizer - differential_evolution -- stochastic minimization using differential evolution - Rosenbrock function: - rosen -- The Rosenbrock function. - rosen_der -- The derivative of the Rosenbrock function. - rosen_hess -- The Hessian matrix of the Rosenbrock function. - rosen_hess_prod -- Product of the Rosenbrock Hessian with a vector. - Fitting: - curve_fit -- Fit curve to a set of points - Root finding -- Scalar functions: - brentq -- quadratic interpolation Brent method - brenth -- Brent method, modified by Harris with hyperbolic extrapolation - ridder -- Ridder's method - bisect -- Bisection method - newton -- Secant method or Newton's method - Fixed point finding: - fixed_point -- Single-variable fixed-point solver - General nonlinear solvers: - root -- Unified interface for nonlinear solvers of multivariate functions - fsolve -- Non-linear multi-variable equation solver - broyden1 -- Broyden's first method - broyden2 -- Broyden's second method - Large-scale nonlinear solvers: - newton_krylov - anderson - Simple iterations: - excitingmixing - linearmixing - diagbroyden Additional information on the nonlinear solvers can be obtained from the help on ``scipy.optimize.nonlin``. - Linear Programming -- General linear programming solver: linprog -- Unified interface for minimizers of linear programming problems - The simplex method supports callback functions, such as: linprog_verbose_callback -- Sample callback function for linprog (simplex) - Assignment problems: - linear_sum_assignment -- Solves the linear-sum assignment problem - Utilities: - approx_fprime -- Approximate the gradient of a scalar function - bracket -- Bracket a minimum, given two starting points - check_grad -- Check the supplied derivative using finite differences - line_search -- Return a step that satisfies the strong Wolfe conditions - show_options -- Show specific options optimization solvers - LbfgsInvHessProduct -- Linear operator for L-BFGS approximate inverse Hessian Interpolation (scipy.interpolate) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sub-package for objects used in interpolation. As listed below, this sub-package contains spline functions and classes, one-dimensional and multi-dimensional (univariate and multivariate) interpolation classes, Lagrange and Taylor polynomial interpolators, and wrappers for `FITPACK `__ and DFITPACK functions. - Univariate interpolation - interp1d - BarycentricInterpolator - KroghInterpolator - PchipInterpolator - barycentric_interpolate - krogh_interpolate - pchip_interpolate - Akima1DInterpolator - CubicSpline - PPoly - BPoly - Multivariate interpolation - Unstructured data: - griddata - LinearNDInterpolator - NearestNDInterpolator - CloughTocher2DInterpolator - Rbf - interp2d - For data on a grid: - interpn - RegularGridInterpolator - RectBivariateSpline See also: `scipy.ndimage.map_coordinates` - Tensor product polynomials: - NdPPoly - 1-D Splines - BSpline - make_interp_spline - make_lsq_spline - Functional interface to FITPACK routines: - splrep - splprep - splev - splint - sproot - spalde - splder - splantider - insert - Object-oriented FITPACK interface: - UnivariateSpline - InterpolatedUnivariateSpline - LSQUnivariateSpline - 2-D Splines - For data on a grid: - RectBivariateSpline - RectSphereBivariateSpline - For unstructured data: - BivariateSpline - SmoothBivariateSpline - SmoothSphereBivariateSpline - LSQBivariateSpline - LSQSphereBivariateSpline - Low-level interface to FITPACK functions: - bisplrep - bisplev - Additional tools - lagrange - approximate_taylor_polynomial - pade See also: - `scipy.ndimage.map_coordinates`, - `scipy.ndimage.spline_filter`, - `scipy.signal.resample`, - `scipy.signal.bspline`, - `scipy.signal.gauss_spline`, - `scipy.signal.qspline1d`, - `scipy.signal.cspline1d`, - `scipy.signal.qspline1d_eval`, - `scipy.signal.cspline1d_eval`, - `scipy.signal.qspline2d`, - `scipy.signal.cspline2d`. - Functions existing for backward compatibility (should not be used in new code): - ``spleval`` - ``spline`` - ``splmake`` - ``spltopp`` - ``pchip`` Fourier Transforms (``scipy.fftpack``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There is help and a number of examples here: https://docs.scipy.org/doc/scipy/reference/tutorial/fftpack.html. Here is an example, copied from the documentation in the above link:: import numpy as np from scipy.fftpack import fft def test(): # Number of sample points N = 600 # sample spacing T = 1.0 / 800.0 x = np.linspace(0.0, N * T, N) y = np.sin(50.0 * 2.0 * np.pi * x) + 0.5 * np.sin(80.0 * 2.0 * np.pi * x) yf = fft(y) from scipy.signal import blackman w = blackman(N) ywf = fft(y * w) xf = np.linspace(0.0, 1.0 / (2.0 * T), N / 2) import matplotlib.pyplot as plt plt.semilogy(xf[1:N // 2], 2.0 / N * np.abs(yf[1:N // 2]), '-b') plt.semilogy(xf[1:N // 2], 2.0 / N * np.abs(ywf[1:N // 2]), '-r') plt.legend(['FFT', 'FFT w. window']) plt.grid() plt.show() test() Here is a summary of the Discrete Fourier transforms support in ``scipy.fftpack``: - Fast Fourier Transforms (FFTs) - ``fft`` - Fast (discrete) Fourier Transform (FFT) - ``ifft`` - Inverse FFT - ``fft2`` - Two dimensional FFT - ``ifft2`` - Two dimensional inverse FFT - ``fftn`` - n-dimensional FFT - ``ifftn`` - n-dimensional inverse FFT - ``rfft`` - FFT of strictly real-valued sequence - ``irfft`` - Inverse of rfft - ``dct`` - Discrete cosine transform - ``idct`` - Inverse discrete cosine transform - ``dctn`` - n-dimensional Discrete cosine transform - ``idctn`` - n-dimensional Inverse discrete cosine transform - ``dst`` - Discrete sine transform - ``idst`` - Inverse discrete sine transform - ``dstn`` - n-dimensional Discrete sine transform - ``idstn`` - n-dimensional Inverse discrete sine transform - Differential and pseudo-differential operators - ``diff`` - Differentiation and integration of periodic sequences - ``tilbert`` - Tilbert transform: cs_diff(x,h,h) - ``itilbert`` - Inverse Tilbert transform: sc_diff(x,h,h) - ``hilbert`` - Hilbert transform: cs_diff(x,inf,inf) - ``ihilbert`` - Inverse Hilbert transform: sc_diff(x,inf,inf) - ``cs_diff`` - cosh/sinh pseudo-derivative of periodic sequences - ``sc_diff`` - sinh/cosh pseudo-derivative of periodic sequences - ``ss_diff`` - sinh/sinh pseudo-derivative of periodic sequences - ``cc_diff`` - cosh/cosh pseudo-derivative of periodic sequences - ``shift`` - Shift periodic sequences - Helper functions - ``fftshift`` - Shift the zero-frequency component to the center of the spectrum - ``ifftshift`` - The inverse of `fftshift` - ``fftfreq`` - Return the Discrete Fourier Transform sample frequencies - ``rfftfreq`` - DFT sample frequencies (for usage with rfft, irfft) - ``next_fast_len`` - Find the optimal length to zero-pad an FFT for speed - Convolutions (``scipy.fftpack.convolve``) - ``convolve`` - ``convolve_z`` - ``init_convolution_kernel`` - ``destroy_convolve_cache`` Signal Processing (scipy.signal) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use this module with either of the following:: >>> import scipy.signal >>> from scipy import signal Here is some summary: - Convolution - convolve -- N-dimensional convolution. - correlate -- N-dimensional correlation. - fftconvolve -- N-dimensional convolution using the FFT. - convolve2d -- 2-dimensional convolution (more options). - correlate2d -- 2-dimensional correlation (more options). - sepfir2d -- Convolve with a 2-D separable FIR filter. - choose_conv_method -- Chooses faster of FFT and direct convolution methods. - B-splines - bspline -- B-spline basis function of order n. - cubic -- B-spline basis function of order 3. - quadratic -- B-spline basis function of order 2. - gauss_spline -- Gaussian approximation to the B-spline basis function. - cspline1d -- Coefficients for 1-D cubic (3rd order) B-spline. - qspline1d -- Coefficients for 1-D quadratic (2nd order) B-spline. - cspline2d -- Coefficients for 2-D cubic (3rd order) B-spline. - qspline2d -- Coefficients for 2-D quadratic (2nd order) B-spline. - cspline1d_eval -- Evaluate a cubic spline at the given points. - qspline1d_eval -- Evaluate a quadratic spline at the given points. - spline_filter -- Smoothing spline (cubic) filtering of a rank-2 array. - Filtering - order_filter -- N-dimensional order filter. - medfilt -- N-dimensional median filter. - medfilt2d -- 2-dimensional median filter (faster). - wiener -- N-dimensional wiener filter. - symiirorder1 -- 2nd-order IIR filter (cascade of first-order systems). - symiirorder2 -- 4th-order IIR filter (cascade of second-order systems). - lfilter -- 1-dimensional FIR and IIR digital linear filtering. - lfiltic -- Construct initial conditions for `lfilter`. - lfilter_zi -- Compute an initial state zi for the lfilter function that corresponds to the steady state of the step response. - filtfilt -- A forward-backward filter. - savgol_filter -- Filter a signal using the Savitzky-Golay filter. - deconvolve -- 1-d deconvolution using lfilter. - sosfilt -- 1-dimensional IIR digital linear filtering using a second-order sections filter representation. - sosfilt_zi -- Compute an initial state zi for the sosfilt function that corresponds to the steady state of the step response. - sosfiltfilt -- A forward-backward filter for second-order sections. - hilbert -- Compute 1-D analytic signal, using the Hilbert transform. - hilbert2 -- Compute 2-D analytic signal, using the Hilbert transform. - decimate -- Downsample a signal. - detrend -- Remove linear and/or constant trends from data. - resample -- Resample using Fourier method. - resample_poly -- Resample using polyphase filtering method. - upfirdn -- Upsample, apply FIR filter, downsample. - Filter design - bilinear -- Digital filter from an analog filter using the bilinear transform. - findfreqs -- Find array of frequencies for computing filter response. - firls -- FIR filter design using least-squares error minimization. - firwin -- Windowed FIR filter design, with frequency response defined as pass and stop bands. - firwin2 -- Windowed FIR filter design, with arbitrary frequency response. - freqs -- Analog filter frequency response from TF coefficients. - freqs_zpk -- Analog filter frequency response from ZPK coefficients. - freqz -- Digital filter frequency response from TF coefficients. - freqz_zpk -- Digital filter frequency response from ZPK coefficients. - sosfreqz -- Digital filter frequency response for SOS format filter. - group_delay -- Digital filter group delay. - iirdesign -- IIR filter design given bands and gains. - iirfilter -- IIR filter design given order and critical frequencies. - kaiser_atten -- Compute the attenuation of a Kaiser FIR filter, given the number of taps and the transition width at discontinuities in the frequency response. - kaiser_beta -- Compute the Kaiser parameter beta, given the desired FIR filter attenuation. - kaiserord -- Design a Kaiser window to limit ripple and width of transition region. - minimum_phase -- Convert a linear phase FIR filter to minimum phase. - savgol_coeffs -- Compute the FIR filter coefficients for a Savitzky-Golay filter. - remez -- Optimal FIR filter design. - unique_roots -- Unique roots and their multiplicities. - residue -- Partial fraction expansion of b(s) / a(s). - residuez -- Partial fraction expansion of b(z) / a(z). - invres -- Inverse partial fraction expansion for analog filter. - invresz -- Inverse partial fraction expansion for digital filter. - BadCoefficients -- Warning on badly conditioned filter coefficients - Lower-level filter design functions: - abcd_normalize -- Check state-space matrices and ensure they are rank-2. - band_stop_obj -- Band Stop Objective Function for order minimization. - besselap -- Return (z,p,k) for analog prototype of Bessel filter. - buttap -- Return (z,p,k) for analog prototype of Butterworth filter. - cheb1ap -- Return (z,p,k) for type I Chebyshev filter. - cheb2ap -- Return (z,p,k) for type II Chebyshev filter. - cmplx_sort -- Sort roots based on magnitude. - ellipap -- Return (z,p,k) for analog prototype of elliptic filter. - lp2bp -- Transform a lowpass filter prototype to a bandpass filter. - lp2bs -- Transform a lowpass filter prototype to a bandstop filter. - lp2hp -- Transform a lowpass filter prototype to a highpass filter. - lp2lp -- Transform a lowpass filter prototype to a lowpass filter. - normalize -- Normalize polynomial representation of a transfer function. - Matlab-style IIR filter design - butter -- Butterworth - buttord - cheby1 -- Chebyshev Type I - cheb1ord - cheby2 -- Chebyshev Type II - cheb2ord - ellip -- Elliptic (Cauer) - ellipord - bessel -- Bessel (no order selection available -- try butterod) - iirnotch -- Design second-order IIR notch digital filter. - iirpeak -- Design second-order IIR peak (resonant) digital filter. - Continuous-Time Linear Systems - lti -- Continuous-time linear time invariant system base class. - StateSpace -- Linear time invariant system in state space form. - TransferFunction -- Linear time invariant system in transfer function form. - ZerosPolesGain -- Linear time invariant system in zeros, poles, gain form. - lsim -- continuous-time simulation of output to linear system. - lsim2 -- like lsim, but `scipy.integrate.odeint` is used. - impulse -- impulse response of linear, time-invariant (LTI) system. - impulse2 -- like impulse, but `scipy.integrate.odeint` is used. - step -- step response of continous-time LTI system. - step2 -- like step, but `scipy.integrate.odeint` is used. - freqresp -- frequency response of a continuous-time LTI system. - bode -- Bode magnitude and phase data (continuous-time LTI). - Discrete-Time Linear Systems - dlti -- Discrete-time linear time invariant system base class. - StateSpace -- Linear time invariant system in state space form. - TransferFunction -- Linear time invariant system in transfer function form. - ZerosPolesGain -- Linear time invariant system in zeros, poles, gain form. - dlsim -- simulation of output to a discrete-time linear system. - dimpulse -- impulse response of a discrete-time LTI system. - dstep -- step response of a discrete-time LTI system. - dfreqresp -- frequency response of a discrete-time LTI system. - dbode -- Bode magnitude and phase data (discrete-time LTI). - LTI Representations - tf2zpk -- transfer function to zero-pole-gain. - tf2sos -- transfer function to second-order sections. - tf2ss -- transfer function to state-space. - zpk2tf -- zero-pole-gain to transfer function. - zpk2sos -- zero-pole-gain to second-order sections. - zpk2ss -- zero-pole-gain to state-space. - ss2tf -- state-pace to transfer function. - ss2zpk -- state-space to pole-zero-gain. - sos2zpk -- second-order sections to zero-pole-gain. - sos2tf -- second-order sections to transfer function. - cont2discrete -- continuous-time to discrete-time LTI conversion. - place_poles -- pole placement. - Waveforms - chirp -- Frequency swept cosine signal, with several freq functions. - gausspulse -- Gaussian modulated sinusoid - max_len_seq -- Maximum length sequence - sawtooth -- Periodic sawtooth - square -- Square wave - sweep_poly -- Frequency swept cosine signal; freq is arbitrary polynomial - unit_impulse -- Discrete unit impulse - Window functions - get_window -- Return a window of a given length and type. - barthann -- Bartlett-Hann window - bartlett -- Bartlett window - blackman -- Blackman window - blackmanharris -- Minimum 4-term Blackman-Harris window - bohman -- Bohman window - boxcar -- Boxcar window - chebwin -- Dolph-Chebyshev window - cosine -- Cosine window - exponential -- Exponential window - flattop -- Flat top window - gaussian -- Gaussian window - general_gaussian -- Generalized Gaussian window - hamming -- Hamming window - hann -- Hann window - hanning -- Hann window - kaiser -- Kaiser window - nuttall -- Nuttall's minimum 4-term Blackman-Harris window - parzen -- Parzen window - slepian -- Slepian window - triang -- Triangular window - tukey -- Tukey window - Wavelets - cascade -- compute scaling function and wavelet from coefficients - daub -- return low-pass - morlet -- Complex Morlet wavelet. - qmf -- return quadrature mirror filter from low-pass - ricker -- return ricker wavelet - cwt -- perform continuous wavelet transform - Peak finding - find_peaks_cwt -- Attempt to find the peaks in the given 1-D array - argrelmin -- Calculate the relative minima of data - argrelmax -- Calculate the relative maxima of data - argrelextrema -- Calculate the relative extrema of data - Spectral Analysis - periodogram -- Compute a (modified) periodogram - welch -- Compute a periodogram using Welch's method - csd -- Compute the cross spectral density, using Welch's method - coherence -- Compute the magnitude squared coherence, using Welch's method - spectrogram -- Compute the spectrogram - lombscargle -- Computes the Lomb-Scargle periodogram - vectorstrength -- Computes the vector strength - stft -- Compute the Short Time Fourier Transform - istft -- Compute the Inverse Short Time Fourier Transform - check_COLA -- Check the COLA constraint for iSTFT reconstruction Linear Algebra (scipy.linalg) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use this module with either of the following:: >>> import scipy.linalg >>> from scipy import linalg Here is some summary: - Basics - inv -- Find the inverse of a square matrix - solve -- Solve a linear system of equations - solve_banded -- Solve a banded linear system - solveh_banded -- Solve a Hermitian or symmetric banded system - solve_circulant -- Solve a circulant system - solve_triangular -- Solve a triangular matrix - solve_toeplitz -- Solve a toeplitz matrix - det -- Find the determinant of a square matrix - norm -- Matrix and vector norm - lstsq -- Solve a linear least-squares problem - pinv -- Pseudo-inverse (Moore-Penrose) using lstsq - pinv2 -- Pseudo-inverse using svd - pinvh -- Pseudo-inverse of hermitian matrix - kron -- Kronecker product of two arrays - tril -- Construct a lower-triangular matrix from a given matrix - triu -- Construct an upper-triangular matrix from a given matrix orthogonal_procrustes -- Solve an orthogonal Procrustes problem matrix_balance -- Balance matrix entries with a similarity transformation subspace_angles -- Compute the subspace angles between two matrices - LinAlgError -- Generic Python-exception-derived object raised by linalg functions. - Eigenvalue Problems - eig -- Find the eigenvalues and eigenvectors of a square matrix - eigvals -- Find just the eigenvalues of a square matrix - eigh -- Find the e-vals and e-vectors of a Hermitian or symmetric matrix - eigvalsh -- Find just the eigenvalues of a Hermitian or symmetric matrix - eig_banded -- Find the eigenvalues and eigenvectors of a banded matrix - eigvals_banded -- Find just the eigenvalues of a banded matrix - eigh_tridiagonal -- Find the eigenvalues and eigenvectors of a tridiagonal matrix - eigvalsh_tridiagonal -- Find just the eigenvalues of a tridiagonal matrix - Decompositions - lu -- LU decomposition of a matrix - lu_factor -- LU decomposition returning unordered matrix and pivots - lu_solve -- Solve Ax=b using back substitution with output of lu_factor - svd -- Singular value decomposition of a matrix - svdvals -- Singular values of a matrix - diagsvd -- Construct matrix of singular values from output of svd - orth -- Construct orthonormal basis for the range of A using svd - cholesky -- Cholesky decomposition of a matrix - cholesky_banded -- Cholesky decomp. of a sym. or Hermitian banded matrix - cho_factor -- Cholesky decomposition for use in solving a linear system - cho_solve -- Solve previously factored linear system - cho_solve_banded -- Solve previously factored banded linear system - polar -- Compute the polar decomposition. - qr -- QR decomposition of a matrix - qr_multiply -- QR decomposition and multiplication by Q - qr_update -- Rank k QR update - qr_delete -- QR downdate on row or column deletion - qr_insert -- QR update on row or column insertion - rq -- RQ decomposition of a matrix - qz -- QZ decomposition of a pair of matrices - ordqz -- QZ decomposition of a pair of matrices with reordering - schur -- Schur decomposition of a matrix - rsf2csf -- Real to complex Schur form - hessenberg -- Hessenberg form of a matrix See also: scipy.linalg.interpolative -- Interpolative matrix decompositions - Matrix Functions - expm -- Matrix exponential - logm -- Matrix logarithm - cosm -- Matrix cosine - sinm -- Matrix sine - tanm -- Matrix tangent - coshm -- Matrix hyperbolic cosine - sinhm -- Matrix hyperbolic sine - tanhm -- Matrix hyperbolic tangent - signm -- Matrix sign - sqrtm -- Matrix square root - funm -- Evaluating an arbitrary matrix function - expm_frechet -- Frechet derivative of the matrix exponential - expm_cond -- Relative condition number of expm in the Frobenius norm - fractional_matrix_power -- Fractional matrix power - Matrix Equation Solvers - solve_sylvester -- Solve the Sylvester matrix equation - solve_continuous_are -- Solve the continuous-time algebraic Riccati equation - solve_discrete_are -- Solve the discrete-time algebraic Riccati equation - solve_continuous_lyapunov -- Solve the continous-time Lyapunov equation - solve_discrete_lyapunov -- Solve the discrete-time Lyapunov equation - Sketches and Random Projections - clarkson_woodruff_transform -- Applies the Clarkson Woodruff Sketch (a.k.a CountMin Sketch) - Special Matrices - block_diag -- Construct a block diagonal matrix from submatrices - circulant -- Circulant matrix - companion -- Companion matrix - dft -- Discrete Fourier transform matrix - hadamard -- Hadamard matrix of order 2**n - hankel -- Hankel matrix - helmert -- Helmert matrix - hilbert -- Hilbert matrix - invhilbert -- Inverse Hilbert matrix - leslie -- Leslie matrix - pascal -- Pascal matrix - invpascal -- Inverse Pascal matrix - toeplitz -- Toeplitz matrix - tri -- Construct a matrix filled with ones at and below a given diagonal - Low-level routines - get_blas_funcs - get_lapack_funcs - find_best_blas_type - See also: - scipy.linalg.blas -- Low-level BLAS functions - scipy.linalg.lapack -- Low-level LAPACK functions - scipy.linalg.cython_blas -- Low-level BLAS functions for Cython - scipy.linalg.cython_lapack -- Low-level LAPACK functions for Cython Sparse Eigenvalue Problems with ARPACK ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There are examples in the Scipy documentation, here: https://docs.scipy.org/doc/scipy/reference/tutorial/arpack.html And, here is a summary copied from that document: "ARPACK is a Fortran package which provides routines for quickly finding a few eigenvalues/eigenvectors of large sparse matrices. In order to find these solutions, it requires only left-multiplication by the matrix in question. This operation is performed through a reverse-communication interface. The result of this structure is that ARPACK is able to find eigenvalues and eigenvectors of any linear function mapping a vector to a vector. "All of the functionality provided in ARPACK is contained within the two high-level interfaces scipy.sparse.linalg.eigs and scipy.sparse.linalg.eigsh. eigs provides interfaces to find the eigenvalues/vectors of real or complex nonsymmetric square matrices, while eigsh provides interfaces for real-symmetric or complex-hermitian matrices." Compressed Sparse Graph Routines (scipy.sparse.csgraph) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There is an example that implements a search for the shortest path between two words (of equal) length in a word ladder (i.e. changing just one letter in each step) in the Scipy documentation. You can find it here: https://docs.scipy.org/doc/scipy/reference/tutorial/csgraph.html. You can get documentation with the following:: $ pydoc scipy.sparse.csgraph And, in IPython, do something like this:: In [41]: from scipy.sparse import csgraph In [42]: csgraph.connected_components? Here is a summary of the contents: - connected_components -- determine connected components of a graph. - laplacian -- compute the laplacian of a graph. - shortest_path -- compute the shortest path between points on a positive graph. - dijkstra -- use Dijkstra's algorithm for shortest path. - floyd_warshall -- use the Floyd-Warshall algorithm for shortest path. - bellman_ford -- use the Bellman-Ford algorithm for shortest path. - johnson -- use Johnson's algorithm for shortest path. - breadth_first_order -- compute a breadth-first order of nodes. - depth_first_order -- compute a depth-first order of nodes. - breadth_first_tree -- construct the breadth-first tree from a given node. - depth_first_tree -- construct a depth-first tree from a given node. - minimum_spanning_tree -- construct the minimum spanning tree of a graph. - reverse_cuthill_mckee -- compute permutation for reverse Cuthill-McKee ordering. - maximum_bipartite_matching -- compute permutation to make diagonal zero free. - structural_rank -- compute the structural rank of a graph. - construct_dist_matrix -- Construct distance matrix from a predecessor matrix. - csgraph_from_dense -- Construct a CSR-format sparse graph from a dense matrix. - csgraph_from_masked -- Construct a CSR-format graph from a masked array. - csgraph_masked_from_dense -- Construct a CSR-format sparse graph from a dense matrix. - csgraph_to_dense -- Convert a sparse graph representation to a dense representation. - csgraph_to_masked -- Convert a sparse graph representation to a masked array representation. - reconstruct_path -- Construct a tree from a graph and a predecessor list. - NegativeCycleError -- Common base class for all non-exit exceptions Note that there are other sparse graph libraries for Python. One is Another Python Graph Library: https://pythonhosted.org/apgl/index.html. Spatial data structures and algorithms (scipy.spatial) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Provides spatial algorithms and data structures. Here is an example, copied from the documentation:: import numpy as np from scipy.spatial import Delaunay import matplotlib.pyplot as plt def test(): points = np.array([[0, 0], [0, 1.1], [1, 0], [1, 1]]) tri = Delaunay(points) # # We can visualize it: plt.triplot(points[:, 0], points[:, 1], tri.simplices.copy()) plt.plot(points[:, 0], points[:, 1], 'o') # # And add some further decorations: for j, p in enumerate(points): # label the points plt.text(p[0] - 0.03, p[1] + 0.03, j, ha='right') for j, s in enumerate(tri.simplices): p = points[s].mean(axis=0) # label triangles plt.text(p[0], p[1], '#%d' % j, ha='center') plt.xlim(-0.5, 1.5) plt.ylim(-0.5, 1.5) plt.show() # # The structure of the triangulation is encoded in the following way: the # simplices attribute contains the indices of the points in the # points array # that make up the triangle. For instance: i = 1 tri.simplices[i, :] points[tri.simplices[i, :]] return tri, points Here is a summary of the contents of ``scipy.spatial`` (obtained by doing ``$ pydoc scipy.spatial``): - Nearest-neighbor Queries: - KDTree -- class for efficient nearest-neighbor queries - cKDTree -- class for efficient nearest-neighbor queries (faster impl.) - distance -- module containing many different distance measures - Rectangle -- Hyperrectangle class. Represents a Cartesian product of intervals. - Delaunay Triangulation, Convex Hulls, and Voronoi Diagrams: - Delaunay -- compute Delaunay triangulation of input points - ConvexHull -- compute a convex hull for input points - Voronoi -- compute a Voronoi diagram hull from input points - SphericalVoronoi -- compute a Voronoi diagram from input points on the surface of a sphere - HalfspaceIntersection -- compute the intersection points of input halfspaces - Plotting Helpers: - delaunay_plot_2d -- plot 2-D triangulation - convex_hull_plot_2d -- plot 2-D convex hull - voronoi_plot_2d -- plot 2-D voronoi diagram - Simplex representation: The simplices (triangles, tetrahedra, ...) appearing in the Delaunay tesselation (N-dim simplices), convex hull facets, and Voronoi ridges (N-1 dim simplices) are represented in the following scheme:: tess = Delaunay(points) hull = ConvexHull(points) voro = Voronoi(points) # coordinates of the j-th vertex of the i-th simplex tess.points[tess.simplices[i, j], :] # tesselation element hull.points[hull.simplices[i, j], :] # convex hull facet voro.vertices[voro.ridge_vertices[i, j], :] # ridge between Voronoi cells For Delaunay triangulations and convex hulls, the neighborhood structure of the simplices satisfies the condition: ``tess.neighbors[i,j]`` is the neighboring simplex of the i-th simplex, opposite to the j-vertex. It is -1 in case of no neighbor. Convex hull facets also define a hyperplane equation:: (hull.equations[i,:-1] * coord).sum() + hull.equations[i,-1] == 0 Similar hyperplane equations for the Delaunay triangulation correspond to the convex hull facets on the corresponding N+1 dimensional paraboloid. The Delaunay triangulation objects offer a method for locating the simplex containing a given point, and barycentric coordinate computations. - Functions: - tsearch - distance_matrix - minkowski_distance - minkowski_distance_p - procrustes Statistics (scipy.stats) ~~~~~~~~~~~~~~~~~~~~~~~~~~ This module contains a large number of probability distributions as well as a growing library of statistical functions. Each univariate distribution is an instance of a subclass of ``rv_continuous`` (``rv_discrete`` for discrete distributions): - rv_continuous - rv_discrete - rv_histogram Here is a summary of the items in ``scipy.stats``: - Continuous distributions - alpha -- Alpha - anglit -- Anglit - arcsine -- Arcsine - argus -- Argus - beta -- Beta - betaprime -- Beta Prime - bradford -- Bradford - burr -- Burr (Type III) - burr12 -- Burr (Type XII) - cauchy -- Cauchy - chi -- Chi - chi2 -- Chi-squared - cosine -- Cosine - crystalball -- Crystalball - dgamma -- Double Gamma - dweibull -- Double Weibull - erlang -- Erlang - expon -- Exponential - exponnorm -- Exponentially Modified Normal - exponweib -- Exponentiated Weibull - exponpow -- Exponential Power - f -- F (Snecdor F) - fatiguelife -- Fatigue Life (Birnbaum-Saunders) - fisk -- Fisk - foldcauchy -- Folded Cauchy - foldnorm -- Folded Normal - frechet_r -- Deprecated. Alias for weibull_min - frechet_l -- Deprecated. Alias for weibull_max - genlogistic -- Generalized Logistic - gennorm -- Generalized normal - genpareto -- Generalized Pareto - genexpon -- Generalized Exponential - genextreme -- Generalized Extreme Value - gausshyper -- Gauss Hypergeometric - gamma -- Gamma - gengamma -- Generalized gamma - genhalflogistic -- Generalized Half Logistic - gilbrat -- Gilbrat - gompertz -- Gompertz (Truncated Gumbel) - gumbel_r -- Right Sided Gumbel, Log-Weibull, Fisher-Tippett, Extreme Value Type I - gumbel_l -- Left Sided Gumbel, etc. - halfcauchy -- Half Cauchy - halflogistic -- Half Logistic - halfnorm -- Half Normal - halfgennorm -- Generalized Half Normal - hypsecant -- Hyperbolic Secant - invgamma -- Inverse Gamma - invgauss -- Inverse Gaussian - invweibull -- Inverse Weibull - johnsonsb -- Johnson SB - johnsonsu -- Johnson SU - kappa4 -- Kappa 4 parameter - kappa3 -- Kappa 3 parameter - ksone -- Kolmogorov-Smirnov one-sided (no stats) - kstwobign -- Kolmogorov-Smirnov two-sided test for Large N (no stats) - laplace -- Laplace - levy -- Levy - levy_l - levy_stable - logistic -- Logistic - loggamma -- Log-Gamma - loglaplace -- Log-Laplace (Log Double Exponential) - lognorm -- Log-Normal - lomax -- Lomax (Pareto of the second kind) - maxwell -- Maxwell - mielke -- Mielke's Beta-Kappa - nakagami -- Nakagami - ncx2 -- Non-central chi-squared - ncf -- Non-central F - nct -- Non-central Student's T - norm -- Normal (Gaussian) - pareto -- Pareto - pearson3 -- Pearson type III - powerlaw -- Power-function - powerlognorm -- Power log normal - powernorm -- Power normal - rdist -- R-distribution - reciprocal -- Reciprocal - rayleigh -- Rayleigh - rice -- Rice - recipinvgauss -- Reciprocal Inverse Gaussian - semicircular -- Semicircular - skewnorm -- Skew normal - t -- Student's T - trapz -- Trapezoidal - triang -- Triangular - truncexpon -- Truncated Exponential - truncnorm -- Truncated Normal - tukeylambda -- Tukey-Lambda - uniform -- Uniform - vonmises -- Von-Mises (Circular) - vonmises_line -- Von-Mises (Line) - wald -- Wald - weibull_min -- Minimum Weibull (see Frechet) - weibull_max -- Maximum Weibull (see Frechet) - wrapcauchy -- Wrapped Cauchy - Multivariate distributions - multivariate_normal -- Multivariate normal distribution - matrix_normal -- Matrix normal distribution - dirichlet -- Dirichlet - wishart -- Wishart - invwishart -- Inverse Wishart - multinomial -- Multinomial distribution - special_ortho_group -- SO(N) group - ortho_group -- O(N) group - unitary_group -- U(N) gropu - random_correlation -- random correlation matrices - Discrete distributions - bernoulli -- Bernoulli - binom -- Binomial - boltzmann -- Boltzmann (Truncated Discrete Exponential) - dlaplace -- Discrete Laplacian - geom -- Geometric - hypergeom -- Hypergeometric - logser -- Logarithmic (Log-Series, Series) - nbinom -- Negative Binomial - planck -- Planck (Discrete Exponential) - poisson -- Poisson - randint -- Discrete Uniform - skellam -- Skellam - zipf -- Zipf - Statistical functions -- Several of these functions have a similar version in scipy.stats.mstats which work for masked arrays. - describe -- Descriptive statistics - gmean -- Geometric mean - hmean -- Harmonic mean - kurtosis -- Fisher or Pearson kurtosis - kurtosistest -- Test whether a dataset has normal kurtosis. - mode -- Modal value - moment -- Central moment - normaltest -- - skew -- Skewness - skewtest -- - kstat -- - kstatvar -- - tmean -- Truncated arithmetic mean - tvar -- Truncated variance - tmin -- - tmax -- - tstd -- - tsem -- - variation -- Coefficient of variation - find_repeats - trim_mean - cumfreq - itemfreq - percentileofscore - scoreatpercentile - relfreq - binned_statistic -- Compute a binned statistic for a set of data. - binned_statistic_2d -- Compute a 2-D binned statistic for a set of data. - binned_statistic_dd -- Compute a d-D binned statistic for a set of data. - obrientransform - bayes_mvs - mvsdist - sem - zmap - zscore - iqr - sigmaclip - trimboth - trim1 - f_oneway - pearsonr - spearmanr - pointbiserialr - kendalltau - weightedtau - linregress - theilslopes - ttest_1samp - ttest_ind - ttest_ind_from_stats - ttest_rel - kstest - chisquare - power_divergence - ks_2samp - mannwhitneyu - tiecorrect - rankdata - ranksums - wilcoxon - kruskal - friedmanchisquare - combine_pvalues - jarque_bera - ansari - bartlett - levene - shapiro - anderson - anderson_ksamp - binom_test - fligner - median_test - mood - boxcox - boxcox_normmax - boxcox_llf - entropy - wasserstein_distance - energy_distance - Circular statistical functions - circmean - circvar - circstd - Contingency table functions - chi2_contingency - contingency expected_freq - contingency margins - fisher_exact - Plot-tests - ppcc_max - ppcc_plot - probplot - boxcox_normplot - Masked statistics functions -- Module ``scipy.stats.mstats`` contains statistical functions for masked arrays. For more information in IPython, do:: In [1]: from scipy.stats import mstats In [2]: mstats? Or, from the command line do ``$ pydoc scipy.stats.mstats``. - Univariate and multivariate kernel density estimation (``scipy.stats.kde``) - gaussian_kde -- Representation of a kernel-density estimate using Gaussian kernels. Kernel density estimation is a way to estimate the probability density function (PDF) of a random variable in a non-parametric way. `gaussian_kde` works for both uni-variate and multi-variate data. It includes automatic bandwidth determination. The estimation works best for a unimodal distribution; bimodal or multi-modal distributions tend to be oversmoothed. For many more stat related functions install the software R and the interface package `rpy``. Multidimensional image processing (``scipy.ndimage``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The module ``scipy.ndimage`` contains various functions for multi-dimensional image processing. For information on these functions, do (for example, in IPython):: In [6]: from scipy import ndimage In [7]: ndimage? In [8]: ndimage.convolve? Or, from the command line, do: ``$ pydoc scipy.ndimage.convolve``. Here is an example -- It computes the multi-dimensional convolution of an Numpy ``ndarray``:: import numpy as np from scipy import ndimage def test(): a = np.array([[1, 2, 0, 0], [5, 3, 0, 4], [0, 0, 0, 7], [9, 3, 0, 0]]) k = np.array([[1, 1, 1], [1, 1, 0], [1, 0, 0]]) result = ndimage.convolve(a, k, mode='constant', cval=0.0) return result Here is a summary of the contents of ``scipy.ndimage``: - Filters - convolve -- Multi-dimensional convolution - convolve1d -- 1-D convolution along the given axis - correlate -- Multi-dimensional correlation - correlate1d -- 1-D correlation along the given axis - gaussian_filter - - gaussian_filter1d - - gaussian_gradient_magnitude - - gaussian_laplace - - generic_filter -- Multi-dimensional filter using a given function - generic_filter1d -- 1-D generic filter along the given axis - generic_gradient_magnitude - generic_laplace - laplace -- n-D Laplace filter based on approximate second derivatives - maximum_filter - maximum_filter1d - median_filter -- Calculates a multi-dimensional median filter - minimum_filter - minimum_filter1d - percentile_filter -- Calculates a multi-dimensional percentile filter - prewitt - rank_filter -- Calculates a multi-dimensional rank filter - sobel - uniform_filter -- Multi-dimensional uniform filter - uniform_filter1d -- 1-D uniform filter along the given axis - Fourier filters - fourier_ellipsoid - fourier_gaussian - fourier_shift - fourier_uniform - Interpolation - affine_transform -- Apply an affine transformation - geometric_transform -- Apply an arbritrary geometric transform - map_coordinates -- Map input array to new coordinates by interpolation - rotate -- Rotate an array - shift -- Shift an array - spline_filter - spline_filter1d - zoom -- Zoom an array - Measurements - center_of_mass -- The center of mass of the values of an array at labels - extrema -- Min's and max's of an array at labels, with their positions - find_objects -- Find objects in a labeled array - histogram -- Histogram of the values of an array, optionally at labels - label -- Label features in an array - labeled_comprehension - maximum - maximum_position - mean -- Mean of the values of an array at labels - median - minimum - minimum_position - standard_deviation -- Standard deviation of an n-D image array - sum -- Sum of the values of the array - variance -- Variance of the values of an n-D image array - watershed_ift - Morphology - binary_closing - binary_dilation - binary_erosion - binary_fill_holes - binary_hit_or_miss - binary_opening - binary_propagation - black_tophat - distance_transform_bf - distance_transform_cdt - distance_transform_edt - generate_binary_structure - grey_closing - grey_dilation - grey_erosion - grey_opening - iterate_structure - morphological_gradient - morphological_laplace - white_tophat - Utility - imread -- Load an image from a file File IO (scipy.io) ~~~~~~~~~~~~~~~~~~~~ Scipy provides routines to read/write a number of special file formats. Here are some of them: - MATLAB® files: - loadmat -- Read a MATLAB style mat file (version 4 through 7.1) - savemat -- Write a MATLAB style mat file (version 4 through 7.1) - whosmat -- List contents of a MATLAB style mat file (version 4 through 7.1) - IDL® files: - readsav -- Read an IDL 'save' file - Matrix Market files: - mminfo -- Query matrix info from Matrix Market formatted file - mmread -- Read matrix from Matrix Market formatted file - mmwrite -- Write matrix to Matrix Market formatted file - Unformatted Fortran files: - FortranFile -- A file object for unformatted sequential Fortran files - Netcdf: - netcdf_file -- A file object for NetCDF data - netcdf_variable -- A data object for the netcdf module - Harwell-Boeing files: - hb_read -- read H-B file - hb_write -- write H-B file - Wav sound files (`scipy.io.wavfile`): - read -- Return the sample rate (in samples/sec) and data from a WAV file. - write -- Write a numpy array as a WAV file. - WavFileWarning -- Base class for warnings generated by user code. - Arff files (`scipy.io.arff`): - loadarff -- Read an arff file. - MetaData -- Small container to keep useful information on a ARFF dataset. - ArffError -- Base class for I/O related errors. - ParseArffError -- Base class for I/O related errors. Pandas -------- Pandas vs. Numpy -- Pandas raises Numpy data structures to a higher level. In particular, see the ``DataFrame`` object. For documentation on Pandas, see: http://pandas.pydata.org/pandas-docs/stable/. There are tutorials, get-started guides, cookbook docs, and more. `10 Minutes to pandas `_ seems especially helpful, although it does contain an lot more than 10 minutes worth of material. It gives basic instructions on how to use Pandas data types. And, be sure to look at the various `Pandas tutorials `_. There are also cookbooks full of code snippets: - http://pandas.pydata.org/pandas-docs/stable/cookbook.html - http://pandas.pydata.org/pandas-docs/stable/tutorials.html#pandas-cookbook Perhaps it's advisable to view Pandas as just as much about learning techniques for (1) cleaning up your data; (2) exploring and finding significant aspects of your data, and (3) viewing and displaying your data, as it is about performing calculations and analysis on it. Panda contains and provides such a rich set of techniques for working with your data that you should expect to take a reasonable amount of time learning to do the tasks you need, rather than just quickly learn some small set of functions. Create Pandas data structures ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Here is an example that creates several of the Pandas data structures that are used in the "10 Minutes to pandas" document referenced above:: def make_sample_dataframe(): """Make sample dates and DataFrame. Returns (dates, df).""" dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) return dates, df And, here is an example of the use of the above function:: In [117]: import utils01 In [118]: dates, df = utils01.make_sample_dataframe() In [119]: In [119]: dates Out[119]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') In [120]: In [120]: df Out[120]: A B C D 2013-01-01 0.521515 1.006002 -1.408913 -0.218981 2013-01-02 -0.517541 -0.190499 0.397701 0.895858 2013-01-03 0.068253 0.499286 -1.098401 -1.323183 2013-01-04 -0.086779 0.025269 0.459892 0.588754 2013-01-05 1.384825 -1.141312 0.097294 0.169665 2013-01-06 -0.391738 -0.072600 0.196514 0.799174 View Pandas data structures ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ View the first and last rows of a ``DataFrame``:: In [34]: df.head(n=2) Out[34]: A B C D 2013-01-01 -0.557541 1.016474 0.933149 -0.524661 2013-01-02 1.682318 -1.605635 -0.324727 2.057636 In [35]: In [35]: df.tail(n=3) Out[35]: A B C D 2013-01-04 0.696414 0.538999 1.131596 -0.960681 2013-01-05 -0.175765 -0.494210 1.111779 -0.670209 2013-01-06 -1.615098 0.018027 0.584815 -1.508152 Get the shape, column (labels), and actual data from a ``DataFrame``:: In [38]: df.shape Out[38]: (6, 4) In [39]: df.columns Out[39]: Index(['A', 'B', 'C', 'D'], dtype='object') In [40]: df.values Out[40]: array([[-0.55754086, 1.01647419, 0.93314867, -0.52466119], [ 1.68231758, -1.60563477, -0.32472655, 2.05763649], [-0.45481149, -0.09087637, -1.1383327 , -0.7950994 ], [ 0.69641379, 0.53899898, 1.13159619, -0.96068123], [-0.17576451, -0.49421043, 1.11177912, -0.67020918], [-1.61509837, 0.01802738, 0.58481469, -1.50815216]]) In [41]: type(df.values) Out[41]: numpy.ndarray Note that ``df.values`` returns an ``ndarray``. Access the contents of a ``DataFrame`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Access a row or range of rows -- Use ``.iloc`` with a single index or a slice. Examples:: In [72]: df.iloc[1] Out[72]: A 0.721339 B 0.733763 C -1.153457 D -1.345582 Name: 2013-01-02 00:00:00, dtype: float64 In [73]: df.iloc[1:2] Out[73]: A B C D 2013-01-02 0.721339 0.733763 -1.153457 -1.345582 In [74]: df.iloc[1:4] Out[74]: A B C D 2013-01-02 0.721339 0.733763 -1.153457 -1.345582 2013-01-03 2.047318 0.406103 -1.893892 0.065913 2013-01-04 0.737643 -1.539155 0.410927 0.038682 Access a row or range of rows -- Use ``.loc`` with index labels. Examples:: In [64]: df.loc[dates[1]] Out[64]: A 0.721339 B 0.733763 C -1.153457 D -1.345582 Name: 2013-01-02 00:00:00, dtype: float64 In [65]: df.loc[dates[1]:dates[2]] Out[65]: A B C D 2013-01-02 0.721339 0.733763 -1.153457 -1.345582 2013-01-03 2.047318 0.406103 -1.893892 0.065913 In [66]: df.loc[dates[1]:dates[1]] Out[66]: A B C D 2013-01-02 0.721339 0.733763 -1.153457 -1.345582 In [67]: df.loc['2013-01-01'] Out[67]: A 1.373992 B -0.080698 C -0.018425 D -0.424205 Name: 2013-01-01 00:00:00, dtype: float64 In [68]: df.loc['2013-01-01':'2013-01-03'] Out[68]: A B C D 2013-01-01 1.373992 -0.080698 -0.018425 -0.424205 2013-01-02 0.721339 0.733763 -1.153457 -1.345582 2013-01-03 2.047318 0.406103 -1.893892 0.065913 Notes: - ``dates`` was used to create the index for ``df``:: def make_sample_dataframe1(): """Make sample dates and DataFrame. Returns (dates, df).""" dates = pd.date_range('20130101', periods=6) df = pd.DataFrame( np.random.randn(6, 4), index=dates, columns=list('ABCD')) return dates, df Access the rows where the content of a item (column) in that row satisfies a condition or test:: In [10]: df.loc[df.B > 0].head() Out[10]: Unnamed: 0 A B C D 2 2013-01-03 0.986316 1.870495 -1.598345 -2.551315 5 2013-01-06 1.385534 1.328005 1.741578 -0.409209 7 2013-01-08 -0.820344 0.318531 0.278434 -0.898119 9 2013-01-10 -2.342766 0.048417 -0.352930 -0.134832 20 2013-01-21 -0.567319 1.784550 -0.114723 0.315661 Or:: In [9]: df.loc[df.B.apply(lambda x: x > 0)].head() Out[9]: Unnamed: 0 A B C D 2 2013-01-03 0.986316 1.870495 -1.598345 -2.551315 5 2013-01-06 1.385534 1.328005 1.741578 -0.409209 7 2013-01-08 -0.820344 0.318531 0.278434 -0.898119 9 2013-01-10 -2.342766 0.048417 -0.352930 -0.134832 20 2013-01-21 -0.567319 1.784550 -0.114723 0.315661 Notes: - The use of ``.apply()`` along with ``lambda`` (or a named Python function) enables us to select rows based on an arbitrarily complex condition. - Also, consider using ``functools.partial()``. The following selects rows where the value in column B is in the range -0.1 to 0.1:: In [25]: import functools In [26]: f = functools.partial(lambda x, y, z: z > x and z < y, -0.1, 0.1) In [27]: In [27]: df.loc[df.B.apply(f)].head() Out[27]: Unnamed: 0 A B C D 9 2013-01-10 -2.342766 0.048417 -0.352930 -0.134832 27 2013-01-28 -0.673330 0.075427 -0.477715 -0.475463 33 2013-02-03 -0.776301 0.015220 0.518606 -0.286090 38 2013-02-08 0.894722 0.005027 -0.763636 -0.150279 44 2013-02-14 -0.403519 -0.059570 0.929560 -1.065283 Access a column or several columns -- Use the Python indexing operator (``[]``), with a column label or iterable of column labels. Or, for a single column, use dot notation. Examples:: In [98]: df['B'] Out[98]: 2013-01-01 -0.080698 2013-01-02 0.733763 2013-01-03 0.406103 2013-01-04 -1.539155 2013-01-05 -0.963585 2013-01-06 0.934215 Freq: D, Name: B, dtype: float64 In [99]: df[['B', 'D']] Out[99]: B D 2013-01-01 -0.080698 -0.424205 2013-01-02 0.733763 -1.345582 2013-01-03 0.406103 0.065913 2013-01-04 -1.539155 0.038682 2013-01-05 -0.963585 -0.449162 2013-01-06 0.934215 1.473294 In [100]: In [100]: df.C Out[100]: 2013-01-01 -0.018425 2013-01-02 -1.153457 2013-01-03 -1.893892 2013-01-04 0.410927 2013-01-05 -1.627970 2013-01-06 0.240306 Freq: D, Name: C, dtype: float64 Access individual elements by index relative to zero -- Use ``.iloc[r, c]``:: In [42]: df.iloc[0] Out[42]: A 1.373992 B -0.080698 C -0.018425 D -0.424205 Name: 2013-01-01 00:00:00, dtype: float64 In [43]: df.iloc[0, 1] Out[43]: -0.08069801201343964 In [44]: df.iloc[0, 1:3] Out[44]: B -0.080698 C -0.018425 Name: 2013-01-01 00:00:00, dtype: float64 In [45]: df.iloc[0:4, 1] Out[45]: 2013-01-01 -0.080698 2013-01-02 0.733763 2013-01-03 0.406103 2013-01-04 -1.539155 Freq: D, Name: B, dtype: float64 In [46]: df.iloc[0:4, 1:-1] Out[46]: B C 2013-01-01 -0.080698 -0.018425 2013-01-02 0.733763 -1.153457 2013-01-03 0.406103 -1.893892 2013-01-04 -1.539155 0.410927 In [47]: df.iloc[0:4, 1:] Out[47]: B C D 2013-01-01 -0.080698 -0.018425 -0.424205 2013-01-02 0.733763 -1.153457 -1.345582 2013-01-03 0.406103 -1.893892 0.065913 2013-01-04 -1.539155 0.410927 0.038682 Iterate over a ``DataFrame`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There are several ways to do this. Here are some examples:: import utils01 def test(): dates, df = utils01.make_sample_dataframe1() # iterate over column labels. print("*\n* column labels --\n*") print([x for x in df]) # iterate over items print("*\n* items --\n*") print([x for x in df.head(n=2).iteritems()]) # iterate over rows print("*\n* rows --\n*") print([x for x in df.head(n=2).iterrows()]) # iterate over rows as named tuples. print("*\n* named tuples --\n*") print([x for x in df.head(n=2).itertuples()]) # iterate over rows as named tuples returning one column from each tuple. print("*\n* column \"B\" from named tuple --\n*") print([x.B for x in df.head(n=2).itertuples()]) Here is the output from the above function:: In [67]: test() * * column labels -- * ['A', 'B', 'C', 'D'] * * items -- * [('A', 2013-01-01 -2.443710 2013-01-02 -1.003475 Freq: D, Name: A, dtype: float64), ('B', 2013-01-01 -0.320540 2013-01-02 -1.020769 Freq: D, Name: B, dtype: float64), ('C', 2013-01-01 0.010302 2013-01-02 0.115615 Freq: D, Name: C, dtype: float64), ('D', 2013-01-01 0.935831 2013-01-02 -0.514601 Freq: D, Name: D, dtype: float64)] * * rows -- * [(Timestamp('2013-01-01 00:00:00', freq='D'), A -2.443710 B -0.320540 C 0.010302 D 0.935831 Name: 2013-01-01 00:00:00, dtype: float64), (Timestamp('2013-01-02 00:00:00', freq='D'), A -1.003475 B -1.020769 C 0.115615 D -0.514601 Name: 2013-01-02 00:00:00, dtype: float64)] * * named tuples -- * [Pandas(Index=Timestamp('2013-01-01 00:00:00', freq='D'), A=-2.4437103289150857, B=-0.32054023603910436, C=0.01030189942471081, D=0.9358311337233644), Pandas(Index=Timestamp('2013-01-02 00:00:00', freq='D'), A=-1.0034752077816913, B=-1.0207687970125863, C=0.11561494820245698, D=-0.5146012044818192)] * * column "B" from named tuple -- * [-0.32054023603910436, -1.0207687970125863] While iterating over a ``pandas.DataFrame`` produces the column label, which can be used to access the columns of the ``DataFrame``. Example:: In [92]: for column in df: ...: print("{}[0]: {:7.3f}".format(column, getattr(df, column)[0])) ...: A[0]: -0.368 B[0]: 1.122 C[0]: -0.890 D[0]: 0.076 An easier (and cleaner?) way to access a column would be: ``df[column]``. In contrast, iterating over a ``pandas.Series``, produces the items in the ``Series``. Example (note that ``dates`` is a ``Series``):: In [112]: for date in dates: ...: print('date: {}/{}/{}'.format(date.month, date.day, date.year)) ...: date: 1/1/2013 date: 1/2/2013 date: 1/3/2013 date: 1/4/2013 date: 1/5/2013 date: 1/6/2013 Here is a simple bit of code that iterates over each of the items (cells) in a Pandas DataFrame. This function prints out elements column by column:: def show_df(df): for idx1, label in enumerate(df): print('{}. Column: {}'.format(idx1, label)) for idx2, item in enumerate(df[label]): print(' {}.{}. {:+6.4f}'.format(idx1, idx2, item)) And, here is what the above (function ``show_df``) might display:: In [78]: show_df(df.head(n=2)) 0. Column: A 0.0. +0.9590 0.1. -3.6568 1. Column: B 1.0. +1.1409 1.1. -0.4395 2. Column: C 2.0. +1.2634 2.1. -0.3644 3. Column: D 3.0. +0.0824 3.1. +1.1789 And, here is a function that prints out elements row by row (i.e. one row after another):: def show_df_by_rows(df): columns = df.columns for row, index in enumerate(df.index): print('{}. Row: {}'.format(row, index)) for idx, item in enumerate(df.loc[index]): print(' {}.{}. {:+6.4f}'.format(idx, columns[idx], item)) Here is a sample printout from the above function:: 0. Row: 2013-01-01 00:00:00 0.A. +0.9590 1.B. +1.1409 2.C. +1.2634 3.D. +0.0824 1. Row: 2013-01-02 00:00:00 0.A. -3.6568 1.B. -0.4395 2.C. -0.3644 3.D. +1.1789 You can do something analogous with list comprehensions or generator expressions. For example, consider this code:: def show_dataframe(df): generator = ((index, b.items()) for (index, b) in ((index, df.loc[index]) for index in df.index)) for date, data in generator: print('date: {}'.format(date)) for col, item in data: print(' col: {} item: {:12.4f}'.format(col, item)) When we run the above, calling ``show_dataframe``, we might see:: In [90]: show_dataframe(df.tail(2)) date: 2013-01-05 00:00:00 col: A item: 0.2175 col: B item: 0.1573 col: C item: -0.2240 col: D item: 0.2395 date: 2013-01-06 00:00:00 col: A item: 0.1440 col: B item: -0.9796 col: C item: -2.2432 col: D item: -0.7182 Notes: - In the above example, we produced generator expressions. Note the parentheses around the outer expression and inner expression used to produce ``generator``. If we had used square brackets instead of parentheses, that expression would have produced lists. - The function ``show_items`` contains a nested loop whose outer loop iterates over the outer generator expression and within that outer loop, an inner loop iterates over each nested inner generator expression. Grouping items in a DataFrame ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can group items in a DataFrame according to some criteria, then process only items in that group. For example:: In [363]: dates, df = utils01.make_sample_dataframe1() In [364]: df Out[364]: A B C D 2013-01-01 0.286823 -0.490076 1.876985 0.900970 2013-01-02 0.338896 -0.111205 -1.516295 1.344511 2013-01-03 -1.045215 -0.155277 -0.238831 0.763586 2013-01-04 0.911923 0.383383 -1.838096 -0.233212 2013-01-05 -0.424031 -0.396694 -1.260573 1.912463 2013-01-06 1.198149 -0.729439 1.578052 -1.139293 In [365]: f1 = lambda x: 0 if x < 0.0 else 1 In [366]: df["E"] = [f1(x) for x in df.A] In [367]: df Out[367]: A B C D E 2013-01-01 0.286823 -0.490076 1.876985 0.900970 1 2013-01-02 0.338896 -0.111205 -1.516295 1.344511 1 2013-01-03 -1.045215 -0.155277 -0.238831 0.763586 0 2013-01-04 0.911923 0.383383 -1.838096 -0.233212 1 2013-01-05 -0.424031 -0.396694 -1.260573 1.912463 0 2013-01-06 1.198149 -0.729439 1.578052 -1.139293 1 In [368]: groups = df.groupby("E") In [369]: In [369]: len(groups) Out[369]: 2 In [371]: groups.get_group(0) Out[371]: A B C D E 2013-01-03 -1.045215 -0.155277 -0.238831 0.763586 0 2013-01-05 -0.424031 -0.396694 -1.260573 1.912463 0 In [372]: In [372]: groups.get_group(1) Out[372]: A B C D E 2013-01-01 0.286823 -0.490076 1.876985 0.900970 1 2013-01-02 0.338896 -0.111205 -1.516295 1.344511 1 2013-01-04 0.911923 0.383383 -1.838096 -0.233212 1 2013-01-06 1.198149 -0.729439 1.578052 -1.139293 1 Notes: - We use the function/lambda ``f1`` to distinguish between values that are less than zero and those that are greater than or equal to zero. - We create a list of keys depending on the values in column "A". - We create a new column in our DataFrame containing these keys. - We group the DataFrame depending on the values in this new column. - Next we can determine the number of groups (using ``len(df)``). - And we can access each group individually (with ``df.get_group(n)``). - Notice that all the items in the first group have negative values in column "A", and all the items in the second group have positive values in column "A". An alternative way to do the above task would pass a *function* to the ``.groupby`` method. That function could assign or select rows in arbitrarily complex ways. For example, the following function could assign items to two groups depending on whether the value in column "A" is negative or positive:: In [33]: def f1(index): ...: return 1 if df.loc[index].A < 0.0 else 0 ...: ...: In [34]: In [34]: a = df.groupby(f1) In [35]: In [35]: len(a) Out[35]: 2 In [36]: In [36]: a.get_group(0) Out[36]: A B C D E 2013-01-01 0.823745 1.259863 0.099038 2.401296 0 2013-01-03 1.067624 1.106958 1.616902 0.939021 0 2013-01-04 1.152899 0.190998 -0.062540 -1.786131 0 2013-01-06 0.680271 1.307369 -0.024296 -0.973855 0 In [37]: In [37]: a.get_group(1) Out[37]: A B C D E 2013-01-02 -0.358235 -1.920455 -0.553173 0.580201 1 2013-01-05 -0.226727 0.180529 0.900700 -1.835082 1 Applying functions to a DataFrame ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can do this in a variety of ways: - Element-wise -- Use ``.map`` for ``Series`` and ``.applymap`` for ``DataFrame``:: In [171]: dates.map(lambda x: x.day) Out[171]: Int64Index([1, 2, 3, 4, 5, 6], dtype='int64') In [172]: df.applymap(lambda x: 0.0 if x < 0.0 else x * 10.0) Out[172]: A B C D 2013-01-01 0.000000 11.222224 0.000000 0.764820 2013-01-02 8.165304 0.000000 8.425176 0.000000 2013-01-03 0.000000 7.066568 10.162480 0.000000 2013-01-04 7.097722 0.000000 10.544352 2.593139 2013-01-05 0.000000 0.000000 10.031058 6.354610 2013-01-06 5.629199 1.180783 0.000000 0.000000 - Row-wise and column-wise -- Use one of: - ``df.apply(fn)`` -- Apply function to each column. - ``df.apply(fn, axis=1`` -- Apply function to each row. - For functions that take and return a ``DataFrame`` or that take and return a ``Series``, use ``.pipe``. Example:: In [197]: fn = lambda x: np.abs(x) In [198]: df.pipe(fn) Out[198]: A B C D 2013-01-01 0.368409 1.122222 0.889764 0.076482 2013-01-02 0.816530 0.963447 0.842518 1.371106 2013-01-03 0.164827 0.706657 1.016248 0.474849 2013-01-04 0.709772 1.695648 1.054435 0.259314 2013-01-05 0.057673 0.713738 1.003106 0.635461 2013-01-06 0.562920 0.118078 1.904701 0.149196 And, remember that there may be use cases where it is useful to create a "vectorized" function with ``numpy.vectorize``. Sorting a ``DataFrame`` or a ``Series`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can sort by index, value, etc. See: http://pandas.pydata.org/pandas-docs/stable/basics.html#sorting. Statistical analysis ~~~~~~~~~~~~~~~~~~~~~~ You can do preliminary and rudimentary statistical analysis. See: http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics. For more complex work, consider using the Scipy tools. Examples:: In [65]: df.describe() Out[65]: A B C D count 6.000000 6.000000 6.000000 6.000000 mean 0.255717 -0.067143 0.211290 -0.127586 std 1.102925 0.651381 0.663725 0.691202 min -0.746677 -1.277578 -0.445694 -1.101834 25% -0.415984 -0.110226 -0.142937 -0.473979 50% -0.111748 0.004162 -0.060588 -0.210746 75% 0.545268 0.374949 0.470344 0.363150 max 2.257601 0.516208 1.357676 0.765088 In [66]: In [66]: sp.mean(df.A) Out[66]: 0.2557174574376679 In [67]: In [67]: sp.std(df.A, ddof=1) Out[67]: 1.102925321931004 Visualization and graphing ============================ ``Matplotlib`` ---------------- See: http://matplotlib.org/ Bokeh ------- See: https://bokeh.pydata.org/en/latest/ Here are Bokeh examples taken from the documentaion:: #!/usr/bin/env python from bokeh.plotting import figure, output_file, show def test01(): # prepare some data x = [1, 2, 3, 4, 5] y = [6, 7, 2, 4, 5] # output to static HTML file output_file("lines.html") # create a new plot with a title and axis labels p = figure(title="simple line example", x_axis_label='x', y_axis_label='y') # add a line renderer with legend and line thickness p.line(x, y, legend="Temp.", line_width=2) # show the results show(p) def test02(): # prepare some data x = [0.1, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0] y0 = [i**2 for i in x] y1 = [10**i for i in x] y2 = [10**(i**2) for i in x] # output to static HTML file output_file("log_lines.html") # create a new plot p = figure( tools="pan,box_zoom,reset,save", y_axis_type="log", y_range=[0.001, 10**11], title="log axis example", x_axis_label='sections', y_axis_label='particles' ) # add some renderers p.line(x, x, legend="y=x") p.circle(x, x, legend="y=x", fill_color="white", size=8) p.line(x, y0, legend="y=x^2", line_width=3) p.line(x, y1, legend="y=10^x", line_color="red") p.circle( x, y1, legend="y=10^x", fill_color="red", line_color="red", size=6) p.line(x, y2, legend="y=10^x^2", line_color="orange", line_dash="4 4") # show the results #show(p, browser="firefox") show(p) def main(): test01() test02() if __name__ == '__main__': main() There are more examples in the Bokeh "Quickstart" document: https://bokeh.pydata.org/en/latest/docs/user_guide/quickstart.html#userguide-quickstart Altair -------- See: https://pypi.python.org/pypi/altair Note that ``Altair`` is not in the ``Anaconda`` distribution, but is easy to install with ``pip``. Optimization, parallel processing, access to C/C++, etc. ========================================================== Numba ------- See: http://numba.pydata.org/numba-doc/dev/index.html. And, here is a interesting article related to Numba: https://www.anaconda.com/blog/developer-blog/parallel-python-with-numba-and-parallelaccelerator/. From the Numba docs: From the Numba user manual:: Numba is a compiler for Python array and numerical functions that gives you the power to speed up your applications with high performance functions written directly in Python. Numba generates optimized machine code from pure Python code using the LLVM compiler infrastructure. With a few simple annotations, array-oriented and math-heavy Python code can be just-in-time optimized to performance similar as C, C++ and Fortran, without having to switch languages or Python interpreters. Numba’s main features are: * on-the-fly code generation (at import time or runtime, at the user’s preference) * native code generation for the CPU (default) and GPU hardware * integration with the Python scientific software stack (thanks to Numpy) Here is some sample test code, copied from the Numba documentation:: # file: numba_test01.py import numba @numba.jit def sum2d(arr): M, N = arr.shape result = 0.0 for i in range(M): for j in range(N): result += arr[i, j] return result def plain_sum2d(arr): M, N = arr.shape result = 0.0 for i in range(M): for j in range(N): result += arr[i, j] return result And, here is an example that calls the two above functions, one optimized by Numba and the other not. Notice the timings. The Numba optimized version is more than two orders of magnitude faster:: In [30]: import numba_test01 as nt In [31]: a = np.ones((1000, 1200)) In [32]: time nt.plain_sum2d(a) CPU times: user 621 ms, sys: 0 ns, total: 621 ms Wall time: 622 ms Out[32]: 1200000.0 In [33]: time nt.sum2d(a) CPU times: user 3.68 ms, sys: 0 ns, total: 3.68 ms Wall time: 3.7 ms Out[33]: 1200000.0 There is lots more that can be done with Numba in the way of optimizing code. See the docs. Dask ------ The documentation on Dask can be found here: http://dask.pydata.org/en/latest/docs.html. This summary of Dask is from the Dask documentation:: Dask is a flexible parallel computing library for analytic computing. Dask is composed of two components: 1. Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads. 2. “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collectiont run on top of the dynamic task schedulers. If you are beginning to learn Dask, you might want some sample data: - The dask tutorial contains a script for generating sample data files. You can find the tutorial repository here: https://github.com/dask/dask-tutorial. - And, here is a script that will generate a few HDF5 files. I copied it from the Dask Web site (http://dask.pydata.org/en/latest/examples/dataframe-hdf5.html), and made a few minor modifications:: #!/usr/bin/env python """ synopsis: generate sample dask data files. usage: python generate_dask_data.py options: -h, --help Display this help. """ import sys import string import random import pandas as pd import numpy as np def generate(prefix): # dict to keep track of hdf5 filename and each key fileKeys = {} for i in range(10): # randomly pick letter as dataset key groupkey = random.choice(list(string.ascii_lowercase)) # randomly pick a number as hdf5 filename filename = prefix + str(np.random.randint(100)) + '.h5' # Make a dataframe; 26 rows, 2 columns df = pd.DataFrame({'x': np.random.randint(1, 1000, 26), 'y': np.random.randint(1, 1000, 26)}, index=list(string.ascii_lowercase)) # Write hdf5 to current directory df.to_hdf(filename, key='/' + groupkey, format='table') fileKeys[filename] = groupkey # prints hdf5 filenames and keys for each print(fileKeys) def main(): args = sys.argv[1:] if len(args) != 1: sys.exit(__doc__) if args[0] in ('-h', '--help'): sys.exit(__doc__) prefix = args[0] generate(prefix) if __name__ == '__main__': main() I used the above script to build sample data files as follows:: $ ./generate_dask_data.py "data02/sample_" Then I read these HDF5 files into a Dask DataFrame by using the following:: In [38]: df = dd.read_hdf('./data02/sample_*.h5', key='/*') In [39]: df Out[39]: Dask DataFrame Structure: x y npartitions=10 int64 int64 ... ... ... ... ... ... ... ... ... Dask Name: concat, 22 tasks In [40]: After which, I can do the following, for example:: In [40]: df.x.mean().compute() Out[40]: 501.53076923076924 We can do something that indicates how our data has been broken down into separate partitions. I can use this function:: def test(df): results = [] for idx in range(df.npartitions): mean = df.get_partition(idx).x.mean().compute() print('partition: {} mean: {}'.format(idx, mean)) results.append((idx, mean)) return results Which produces something like the following:: In [10]: test(df) idx: 0 mean: 473.7692307692308 idx: 1 mean: 436.5769230769231 idx: 2 mean: 501.2692307692308 idx: 3 mean: 565.4230769230769 idx: 4 mean: 516.8846153846154 idx: 5 mean: 501.34615384615387 idx: 6 mean: 531.3076923076923 idx: 7 mean: 428.61538461538464 idx: 8 mean: 565.2307692307693 idx: 9 mean: 494.88461538461536 Out[10]: [(0, 473.7692307692308), (1, 436.5769230769231), (2, 501.2692307692308), (3, 565.4230769230769), (4, 516.8846153846154), (5, 501.34615384615387), (6, 531.3076923076923), (7, 428.61538461538464), (8, 565.2307692307693), (9, 494.88461538461536)] Dask for big data ~~~~~~~~~~~~~~~~~~~ Dask enables you to divide a large data structure or data set, for example, a Pandas DataFrame, into smaller structures, for example, smaller DataFrames, then load those smaller chunks from disk and process them. Example: 1. First we'll create a data set, a Pandas DataFrame, that we can divide up into smaller chunks. Here is a Python script that we can use to create a sample CSV (comma separated values) file:: #!/usr/bin/env python # file: write_csv.py """ synopsis: Write sample CSV file from Pandas DataFrame. usage: python write_csv.py example: python write_csv.py test_data.csv 200 """ import sys import numpy as np import pandas as pd def make_sample_dataframe(periods): """Make sample dates and DataFrame. Returns (dates, df).""" dates = pd.date_range('20130101', periods=periods) df = pd.DataFrame( np.random.randn(periods, 4), index=dates, columns=list('ABCD')) return dates, df def create_data(outfilename, count): dates, df = make_sample_dataframe(count) df.to_csv(outfilename) def main(): args = sys.argv[1:] if len(args) != 2: sys.exit(__doc__) outfilename = args[0] count = int(args[1]) create_data(outfilename, count) if __name__ == '__main__': main() And, from within IPython, we can run it to create a CSV file as follows:: In [113]: %run write_csv.py tmp2.csv 200 Now, we can read that file to create a Dask DataFrame with the following:: In [115]: import dask.dataframe as dd In [116]: daskdf = dd.read_csv('tmp2.csv') 2. We can look at our data with ``df.head()`` and ``df.tail()``:: In [117]: daskdf.head() Out[117]: Unnamed: 0 A B C D 0 2013-01-01 1.719008 0.168998 -0.582670 -0.199597 1 2013-01-02 0.947192 1.449137 -0.701263 0.342353 2 2013-01-03 1.321397 0.035692 0.147275 1.551782 3 2013-01-04 -0.286258 0.592772 1.770504 1.752572 4 2013-01-05 1.695924 0.159782 2.150698 -0.060106 In [118]: daskdf.tail() Out[118]: Unnamed: 0 A B C D 195 2013-07-15 0.303020 0.710051 -0.904407 -0.451793 196 2013-07-16 -0.703248 -0.973423 -0.830585 0.183094 197 2013-07-17 0.886046 1.530008 1.319875 -0.318807 198 2013-07-18 0.021749 2.570984 0.572013 1.249558 199 2013-07-19 -0.570810 -0.240768 2.203662 -0.014111 Also see the Pandas section for ways to view structures, for example: `View Pandas data structures`_ 3. Next, we'll divide it up -- This is an important capability of Dask; it enables us to process Dataframes/arrays that are either too large to fit comfortably in memory or which we are only interested in sub-slices. In this case, we'll specify a block size (or a partition size) when we read the CSV file and create a Dask DataFrame:: In [58]: %run write_csv.py tmp4.csv 500 In [59]: In [59]: df3 = dd.read_csv('tmp3.csv', blocksize=600) In [60]: In [60]: df3.head() Out[60]: Unnamed: 0 A B C D 0 2013-01-01 1.907704 0.317188 0.779075 0.327731 1 2013-01-02 -0.936242 -0.679869 -0.817254 -0.810020 2 2013-01-03 -1.465717 -0.775163 -0.621830 -0.171773 3 2013-01-04 0.878534 -0.910678 -0.363762 0.462970 4 2013-01-05 -0.182779 0.174225 -1.483841 -0.062528 In [61]: df3.tail() Out[61]: Unnamed: 0 A B C D 0 2013-07-15 0.426699 -2.126057 -0.784172 0.780982 1 2013-07-16 -0.727647 -1.552699 0.750276 -0.788475 2 2013-07-17 0.452168 -0.525214 0.003892 -0.029953 3 2013-07-18 -1.135117 0.626181 -0.895456 2.096875 4 2013-07-19 1.365505 -0.208806 0.115254 -1.210855 In [62]: In [62]: df3.A.mean().compute() Out[62]: 0.04365032375682896 In [63]: 4. And, now, we'll process that data chunk by chunk:: In [63]: for idx in range(df3.npartitions): ...: data = df3.get_partition(idx) ...: mean = data.A.mean().compute() ...: print('partition: {} mean: {}'.format(idx, mean)) ...: partition: 0 mean: 0.1307434691610682 partition: 1 mean: -0.10723637021736673 partition: 2 mean: 0.47059788011488657 partition: 3 mean: -0.029706498960742605 partition: 4 mean: 0.06754303873144374 partition: 5 mean: 0.1604556981338858 partition: 6 mean: -0.4161510144675041 partition: 7 mean: 0.6799116374415602 partition: 8 mean: 0.6303390153859068 partition: 9 mean: 0.6517677726166038 partition: 10 mean: -0.02111769936010994 o o o In [64]: Notes: - Keep in mind that Dask is capable of "parallelizing" the above operation. It can process multiple partitions in parallel on a multi-core/multi-CPU machine. See the next section for help with that. Dask for optimized (and parallel) computing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Dask enables you to describe a complex process in terms of an execution graph: a digraph (directed graph) whose nodes are sub-processes. The valuable thing about being able to do so is that Dask can schedule the execution of that larger process so that some sub-processes are executed in parallel. On multi-CPU/multi-core hardware, this can be a big win. Dask supports parallel processing on both a single machine and one multiple, distributed machines. In what follows, however, I will discuss parallel computation on a single machine. To learn more about this, you will want to read the following: - `Scheduling `_ -- http://dask.pydata.org/en/latest/scheduling.html - `Single Machine with Dask.distributed `_ -- http://dask.pydata.org/en/latest/setup/single-distributed.html - `Dask.distributed `_ -- https://distributed.readthedocs.io/en/latest/index.html Controlling parallelism in Dask requires understanding Dask schedulers, how they are used by Dask, and how to use them. Note that Dask has default schedulers. If you do nothing to change or set the scheduler, you will be using the default, which is most ofter what you want. The notes that follow will attempt to help you determine when and under what conditions you might want to use a different scheduler and how to do that. Also, keep in mind two concepts that are both related to optimization in Dask: (1) Parallelism is what you want when you have multiple tasks and want to speed them up by running/computing them in parallel. (2) Breaking your data and your Dask data collections into chunks is what you want when your data set is very large and will not fit in memory. You should keep in mind that breaking your data into chunks may slow down processing. Here is something that shows some of those differences:: In [57]: df1 = dd.read_csv('tmp5.csv', blocksize=1000000) In [58]: df2 = dd.read_csv('tmp5.csv', blocksize=8000) In [59]: In [59]: df1.npartitions Out[59]: 1 In [60]: df2.npartitions Out[60]: 12 In [61]: df1.get_partition(0).size.compute() Out[61]: 5000 In [62]: df2.get_partition(0).size.compute() Out[62]: 450 In [63]: In [63]: time df1.A.mean().compute() CPU times: user 15.8 ms, sys: 7.5 ms, total: 23.3 ms Wall time: 22.3 ms Out[63]: 0.02893067882172706 In [64]: time df2.A.mean().compute() CPU times: user 167 ms, sys: 9.85 ms, total: 177 ms Wall time: 164 ms Out[64]: 0.028930678821727045 In [65]: Notes: - We create ``df1`` with a single partition (or chunk) and ``df2`` with multiple partitions (in this case 12). - The size of a single partition of ``df1`` is much larger than the first partition of ``df2`` (5000 vs 450). - Computing the mean of a single column of ``df1`` takes significantly less time than the same operation on ``df2``. Synchronous processing on the local machine -- The default scheduler does that. Let's figure out how to do that in parallel, for example, we'll try to compute the mean of each of the columns of our dataframe (four columns: "A", "B", "C", and "D") in parallel. Here are two functions. One computes the mean for each column in our DataFrame, one column after another. The other attempts to use ``dask.distributed`` to schedule these four tasks so that they make use of more than one CPU core:: def compute_means_sequential(df): """ Sequentially compute the means of columns of dataframe. Args: df (dask.dataframe.DataFrame) -- A dataframe containing columns A, B, C, and D. Return: The means """ meanA = df.A.mean().compute() meanB = df.B.mean().compute() meanC = df.C.mean().compute() meanD = df.D.mean().compute() return meanA, meanB, meanC, meanD def compute_means_parallel(client, df): """ Compute in parallel the means of columns of dataframe. Args: client (dask.distributed.Client) -- The client to schedule the computation. df (dask.dataframe.DataFrame) -- A dataframe containing columns A, B, C, and D. Return: The means """ meanA = client.submit(df.A.mean().compute) meanB = client.submit(df.B.mean().compute) meanC = client.submit(df.C.mean().compute) meanD = client.submit(df.D.mean().compute) client.gather((meanA, meanB, meanC, meanD)) return meanA.result(), meanB.result(), meanC.result(), meanD.result() You can find a file containing these snippets here: `snippets.py <{filename}static/snippets.py>`_. Here is a test that uses the above on a 2-core machine:: In [17]: time snippets.compute_means_sequential(df1) CPU times: user 167 ms, sys: 21.3 ms, total: 189 ms Wall time: 379 ms Out[17]: (0.02893067882172706, -0.05704419047235241, -0.03281851829891229, -0.029845199428518945) In [18]: time snippets.compute_means_parallel(client, df1) CPU times: user 189 ms, sys: 16.9 ms, total: 206 ms Wall time: 281 ms Out[18]: (0.02893067882172706, -0.05704419047235241, -0.03281851829891229, -0.029845199428518945) Here is a test that uses the above on a 4-core machine:: In [15]: time snippets.compute_means_sequential(df1) CPU times: user 160 ms, sys: 9.5 ms, total: 169 ms Wall time: 303 ms Out[15]: (0.02893067882172706, -0.05704419047235241, -0.03281851829891229, -0.029845199428518945) In [16]: In [16]: time snippets.compute_means_parallel(client, df1) CPU times: user 164 ms, sys: 5.03 ms, total: 169 ms Wall time: 224 ms Out[16]: (0.02893067882172706, -0.05704419047235241, -0.03281851829891229, -0.029845199428518945) Notes: - Parallel execution on a 4-core machine takes measurably less time. On a large data structure, this might be significant and noticeable. - My original test had four calls to ``print()`` in each of the above two functions. That partially masked the time difference between calls to these functions. - As with any work on optimization, you will need to test with your data, your machine, your configuration, etc. YMMV (your mileage my vary). Cython -------- See: http://cython.org/. Cython enables us to write or produce C code while writing code in the style of Python. There's more to it than that, but you get the idea. We can write code that looks a lot like Python code, and then use Cython to turn it into C code. Cython has another important use -- Because (1) Cython gives us easy access to libraries of compiled C code and (2) it is easy to write functions in Cython that can be called from Python, we can use it to easily "wrap" C functions for use in Python. In fact, if you look inside some Python packages, for example Lxml, you will see wrappers for underlying C code that were produced with Cython; Lxml makes calls into the ``libxml`` XML libraries provided by http://www.xmlsoft.org. Here is a bit more description from http://cython.org/: "Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language (based on Pyrex). It makes writing C extensions for Python as easy as Python itself. "Cython gives you the combined power of Python and C to let you * write Python code that calls back and forth from and to C or C++ code natively at any point. * easily tune readable Python code into plain C performance by adding static type declarations. * use combined source code level debugging to find bugs in your Python, Cython and C code. * interact efficiently with large data sets, e.g. using multi-dimensional NumPy arrays. * quickly build your applications within the large, mature and widely used CPython ecosystem. * integrate natively with existing code and data from legacy, low-level or high-performance libraries and applications." Machine learning ================== Scikit-Learn -------------- And, the ``scikit-learn`` documentation page is here: http://scikit-learn.org/stable/user_guide.html. EliteDataScience has an introduction to machine learning here: https://elitedatascience.com/learn-machine-learning EliteDataScience has provided a Scikit-Learn tutorial here: https://elitedatascience.com/python-machine-learning-tutorial-scikit-learn. tensorflow ------------ Question: Is there support for tensorflow in Anaconda? Answer: Yes, but currently, installing it is tricky. For example, see this: https://gist.github.com/johndpope/187b0dd996d16152ace2f842d43e3990 Multiprocessing and parallization =================================== ``ipyparallel`` ----------------- See: https://ipyparallel.readthedocs.io/en/latest/ Dask and Dask schedulers -------------------------- See: https://dask.pydata.org/ Also see the section on Dask elsewhere in the current document: `Dask for optimized (and parallel) computing`_. Data store -- HDF5, h5py, Pytables, asdf, etc =============================================== HDF5 ------ h5py ~~~~~~ You can store Panda DataFrames and Dask DataFrames in HDF5 archives with ``h5py``. You can read about ``h5py`` here: - https://www.h5py.org/ - http://docs.h5py.org/en/latest/ - http://shop.oreilly.com/product/0636920030249.do -- a book. Also see: https://dask.pydata.org/en/doc-test-build/array-overview.html#construct Here is an example that saves and retrieves a Dask DataFrame:: In [62]: df1, df2 = snippets.read_csv_files('tmp5.csv') In [63]: df1.to_hdf('tmp01.hdf5', '/Version1/tmp5') Out[63]: ['tmp01.hdf5'] In [64]: In [64]: df1a = dd.read_hdf('tmp01.hdf5', '/Version1/tmp5') In [65]: In [65]: df1.A.mean().compute() Out[65]: 0.02893067882172706 In [66]: df1a.A.mean().compute() Out[66]: 0.02893067882172706 In [68]: df2.to_hdf('tmp01.hdf5', '/Version1/tmp5_2') Out[68]: ['tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5'] In [69]: In [69]: df2a = dd.read_hdf('tmp01.hdf5', '/Version1/tmp5_2') In [70]: In [70]: df2.npartitions Out[70]: 12 In [71]: df2a.npartitions Out[71]: 1 In [72]: df2.B.su df2.B.sub df2.B.sum In [72]: df2.B.sum().compute() Out[72]: -57.04419047235241 In [73]: df2a.B.sum().compute() Out[73]: -57.04419047235241 Notes: - We load a Dask DataFrame (``df1``), then read it back into a separate variable (``df1a``). - We compute the mean of column A of both DataFrames so as to show that the one we wrote to HDF5 and the one we read back in from HDF5 contain the same data. - Notice that in the case of ``df2`` and ``df2a``, ``read_hdf`` function did not preserve the chunk size and number of partitions. However, the ``read_hdf`` function has an optional parameter that enables you to read a DataFrame from HDF5 creating multiple partitions and a smaller chunk size. Example:: In [80]: df2b = dd.read_hdf('tmp01.hdf5', '/Version1/tmp5_2') In [81]: df2b.npartitions Out[81]: 1 In [82]: df2c = dd.read_hdf('tmp01.hdf5', '/Version1/tmp5_2', chunksize=100) In [83]: df2c.npartitions Out[83]: 10 h5serv ~~~~~~~~ There is also an HTTP server for HDF5 archives. It presents a REST-ful interface that enables you to add, list, and retrieve data objects from HDF5 archives on a remote machine. The data returned in response to a retrieval request is formatted as JSON. Yot can learn more about ``h5serv`` here: http://h5serv.readthedocs.io/en/latest/. And, you can learn about the JSON representation of HDF5 here: http://hdf5-json.readthedocs.io/en/latest/index.html. Pytables ~~~~~~~~~~ asdf ------ The documentation is here: https://asdf.readthedocs.io/en/latest/. And, a bit more documentation: https://www.sciencedirect.com/science/article/pii/S2213133715000645 CSV -- comma separated values ------------------------------- A CSV module is in the Python standart library. See: https://docs.python.org/3/library/csv.html .. vim: ft=rst :