Author: | Dave Kuhlman |
---|---|
Contact: | dkuhlman (at) davekuhlman (dot) org |
Address: | http://www.davekuhlman.org |
Revision: | 1.0.1 |
Date: | September 06, 2023 |
Copyright: | Copyright (c) 2018 Dave Kuhlman. All Rights Reserved. This software is subject to the provisions of the MIT License http://www.opensource.org/licenses/mit-license.php. |
---|---|
Abstract: | This document attempts to give a survey of data science tools for Python programming, along with brief introductions to help getting started with some of those tools. |
Contents
In this document I'll try to describe and summarize some significant tools that are available to Python programmers for data science, numerical processing, statistics, and visualizing numerical data. For each tool or package, I'll also try to give a brief overview of:
All these packages are available in the Anaconda distribution of Python, which makes Anaconda a very good option for data analytics and visualization. See:
It's likely that they are also available at http://pypi.python.org and can be installed with pip. If you plan on doing some exploration (and do not want to use the Anaconda distribution), you will want to consider using virtualenv (https://virtualenv.pypa.io/en/stable/) and, for even more convenience in trying out various packages and configurations, look at virtualenvwrapper (https://virtualenvwrapper.readthedocs.io/en/latest/).
More information:
Many on the examples in this document use the somewhat standard import statements, for example:
import numpy as np import scipy as sp import pandas as pd
IPython is an enhanced interactive Python shell. It has tab completion, gives more convenient access to help for Python modules and objects, enables you to edit and rerun previous commands, and much more.
For more information, see: https://ipython.org.
Anaconda ships with QtConsole that contains IPython for even more convenience.
If you use IPython, then consider creating a data science profile. Use something like this:
$ ipython profile create datasci
Then, consider putting something like the following in ~/.ipython/profile_datasci/startup/50-config.py:
import sys import numpy as np import scipy as sp def pdir(obj): """Print information about obj, including `dir(obj)`.""" if isinstance(obj, type): print('class: {}'.format(obj.__name__)) else: print('instance class name: {}'.format(obj.__class__.__name__)) if obj.__doc__: print('doc string: {}'.format(obj.__doc__)) else: print('doc string: no doc string') print(dir(obj)) def read_file_contents(filename): with open(filename, 'r') as infile: content = infile.read() return content
You can have multiple startup files. See the startup/README file in your profile directory.
Also, consider doing some customization in ~/.ipython/profile_datasci/ipython_config.py.
And, in order to use that profile, start IPython with this:
$ ipython --profile=datasci
You can find more help with profiles by running something like the following:
$ ipython help profile
Or, see the following: http://ipython.readthedocs.io/en/stable/config/intro.html#profiles
Inside the standard Python interactive shell, you can get help on some_object with this:
>>> help(some_object)
Inside the IPython interactive shell, you can use the above, or you can do:
In [9]: import scipy.fftpack In [10]: scipy.fftpack? In [11]: In [11]: from scipy import fftpack In [12]: fftpack? In [13]: fftpack.fft?
You can use pydoc to get help at the command line. For example:
$ pydoc numpy.arange
You can also use pydoc to run an HTTP server, and view the documentation in a Web browser. Do the following for help with that:
$ pydoc --help
And, of course, documentation is available for the Scipy suite of tools at: http://www.scipy.org.
Unless otherwise noted, each of the tools described in this document can be described with pip install ... (the standard Python install tool) or, for those who are using the Anaconda Python distribution, with conda install ....
If you use pip, I'd recommend using virtualenv, at the least, and even virtualenvwrapper, for extra convenience and flexibility. virtualenv enables you to install Python packages (and therefor, the tools discussed in this document) in a separate environment, separate from your standard Python installation, and without polluting that standard installation. Since that separate installation is in its own directory, you can remove it by simply deleting that directory. virtualenvwrapper extends virtualenv by enabling you to create, manage, and switch between different virtualenv environments easily. For example, you might want to create and switch (1) between one virtualenv for text processing and another for data science or (2) between one installation for Python 2 and another for Python 3. See:
The Anaconda installation of Python provides most of the tools discussed in this document in the standard Anaconda installation. Additional tools can be installed with conda install ..., and the installation can be kept up-to-date with conda update --all. In the event that you need a Python package that is not provided by Anaconda, you can use pip.
For more options on installing Python with a slant toward data science and scientific programming (but much else besides), see: https://www.scipy.org/install.html.
Help with Numpy:
There are (at least) two aspects to Numpy:
Primitive Numpy numeric types or scalars, for example: np.int32, np.int64, np.float32, np.float64, etc. See the following for information on these primitive types and others: https://docs.scipy.org/doc/numpy/reference/arrays.scalars.html.
Array objects (instances of np.ndarray) along with ways to deal with them.
Operations on Numpy arrays -- For information on these, see the Numpy reference manual: https://docs.scipy.org/doc/numpy/reference/index.html. Here is a quick summary:
Array creation routines -- Create arrays of different kinds, e.g. all ones, all zeros, identity, from an existing array, as a copy of an array, etc.
Array manipulation routines -- Routines that reshape an array, transpose an array, change the number of dimensions, join (concatenate, stack, etc), tiling arrays (create by repeating an array), etc. split arrays, etc.
Binary operations -- Logical binary operations on arrays, packing arrays into bits, bit-shifting operations, etc.
String operations
C-Types Foreign Function Interface (numpy.ctypeslib)
Datetime Support Functions
Data type routines
Optionally Scipy-accelerated routines (numpy.dual)
Mathematical functions with automatic domain (numpy.emath) -- Routines possibly accelerated by Scipy, but available in Numpy if Scipy is not installed. For example, routines for eigenvalues, Fourier transforms, solving linear equations, etc. Use:
>>> from numpy import dual
Floating point error handling
Discrete Fourier Transform (numpy.fft) -- Use:
>>> from numpy import fft
Or, just:
>>> np.fft.fft( ... ) # etc.
Financial functions -- Loan, payment, and interest calculations.
Functional programming -- Routines and classes that assist with doing functional programming. For example, np.vectorize creates a "vectorized" function; np.frompyfunc creates a Numpy ufunc. (Note that vectorized functions and universal functions can be applied to arrays. For help with the difference between vectorized and universal functions, see: https://stackoverflow.com/questions/6768245/difference-between-frompyfunc-and-vectorize-in-numpy.)
Also, remember to look at functools and itertools in the standard Python library: https://docs.python.org/3/library/functional.html
And, if you need parallelism across multiple CPUs and cores, look at ipyparallel: https://ipyparallel.readthedocs.io/en/latest/
Numpy-specific help functions -- Functions for getting information about objects and help with Numpy. (Also, if you are using IPython, the "?" operator gives help with a function or object, for example, enumerate? gives help on the enumerate function.)
Indexing routines
Input and output -- Routines for saving and loading arrays. (But, you may also want to explore HDF5 and h5py or pytables. Both h5py and pytables are in the Anaconda Python distribution.) Also, routines for formatting arrays as strings, converting arrays to and from strings, etc..
Linear algebra (numpy.linalg) -- Routines for the following:
Logic functions -- Functions for performing various tests on elements of Numpy arrays.
Masked array operations -- Support for creating and using masked arrays. A masked array is an array with a mask that marks some elements of the array as invalid. You can find some help with masked arrays in this document: http://www.scipy-lectures.org/intro/numpy/numpy.html.
Mathematical functions -- Functions for:
Matrix library (numpy.matlib) -- Functions for creating and using matrices, as opposed to numpy.ndarry. Use from numpy import matlib. See this for a bit of help on the differences between arrays and matrices in Numpy: https://stackoverflow.com/questions/4151128/what-are-the-differences-between-numpy-arrays-and-matrices-which-one-should-i-u
Miscellaneous routines
Padding Arrays
Polynomials
Random sampling (numpy.random)
Set routines
Sorting, searching, and counting
Statistics
Test Support (numpy.testing)
Window functions
Note that Scipy, Numpy, Pandas, Matplotlib, IPython, and Sympy are all under the Scipy umbrella. For information about any of these, see: https://www.scipy.org/.
What is Scipy? (1) It is many things to many people. But more seriously, (2) it is a large collection of functions for performing operations on arrays of numerical data. Think of it this way: Numpy (and Pandas) give you ways to structure and manipulate multi-dimensional arrays of numbers; Scipy gives you many functions that perform operations on those multi-dimensional arrays of numbers.
What kinds of operations? Here are some categories with descriptions:
For help with this set of functions, do the following:
>>> from scipy import integrate >>> help(integrate)
Or, in IPython, do integrate?
Here is the list you will see:
Integrating functions, given function object
Integrating functions, given fixed samples
Solving initial value problems for ODE systems
The solvers are implemented as individual classes which can be used directly (low-level usage) or through a convenience function.
Remember that for each the following (or any) functions, you can get help in the usual ways: help(some_func) or (in IPython) some_func?.
Local Optimization:
General-purpose multivariate methods:
Constrained multivariate methods:
Univariate (scalar) minimization methods:
Equation (Local) Minimizers:
Global Optimization:
Rosenbrock function:
Fitting:
Root finding -- Scalar functions:
Fixed point finding:
General nonlinear solvers:
Large-scale nonlinear solvers:
Simple iterations:
Additional information on the nonlinear solvers can be obtained from the help on scipy.optimize.nonlin.
Linear Programming -- General linear programming solver:
linprog -- Unified interface for minimizers of linear programming problems
The simplex method supports callback functions, such as:
linprog_verbose_callback -- Sample callback function for linprog (simplex)
Assignment problems:
Utilities:
Sub-package for objects used in interpolation.
As listed below, this sub-package contains spline functions and classes, one-dimensional and multi-dimensional (univariate and multivariate) interpolation classes, Lagrange and Taylor polynomial interpolators, and wrappers for FITPACK and DFITPACK functions.
See also: scipy.ndimage.map_coordinates
Tensor product polynomials:
1-D Splines
Functional interface to FITPACK routines:
Object-oriented FITPACK interface:
2-D Splines
Additional tools
See also:
Functions existing for backward compatibility (should not be used in new code):
There is help and a number of examples here: https://docs.scipy.org/doc/scipy/reference/tutorial/fftpack.html.
Here is an example, copied from the documentation in the above link:
import numpy as np from scipy.fftpack import fft def test(): # Number of sample points N = 600 # sample spacing T = 1.0 / 800.0 x = np.linspace(0.0, N * T, N) y = np.sin(50.0 * 2.0 * np.pi * x) + 0.5 * np.sin(80.0 * 2.0 * np.pi * x) yf = fft(y) from scipy.signal import blackman w = blackman(N) ywf = fft(y * w) xf = np.linspace(0.0, 1.0 / (2.0 * T), N / 2) import matplotlib.pyplot as plt plt.semilogy(xf[1:N // 2], 2.0 / N * np.abs(yf[1:N // 2]), '-b') plt.semilogy(xf[1:N // 2], 2.0 / N * np.abs(ywf[1:N // 2]), '-r') plt.legend(['FFT', 'FFT w. window']) plt.grid() plt.show() test()
Here is a summary of the Discrete Fourier transforms support in scipy.fftpack:
Use this module with either of the following:
>>> import scipy.signal >>> from scipy import signal
Here is some summary:
Use this module with either of the following:
>>> import scipy.linalg >>> from scipy import linalg
Here is some summary:
Basics
Eigenvalue Problems
Decompositions
See also: scipy.linalg.interpolative -- Interpolative matrix decompositions
Matrix Functions
Matrix Equation Solvers
Sketches and Random Projections
Special Matrices
Low-level routines
See also:
There are examples in the Scipy documentation, here: https://docs.scipy.org/doc/scipy/reference/tutorial/arpack.html
And, here is a summary copied from that document:
"ARPACK is a Fortran package which provides routines for quickly finding a few eigenvalues/eigenvectors of large sparse matrices. In order to find these solutions, it requires only left-multiplication by the matrix in question. This operation is performed through a reverse-communication interface. The result of this structure is that ARPACK is able to find eigenvalues and eigenvectors of any linear function mapping a vector to a vector.
"All of the functionality provided in ARPACK is contained within the two high-level interfaces scipy.sparse.linalg.eigs and scipy.sparse.linalg.eigsh. eigs provides interfaces to find the eigenvalues/vectors of real or complex nonsymmetric square matrices, while eigsh provides interfaces for real-symmetric or complex-hermitian matrices."
There is an example that implements a search for the shortest path between two words (of equal) length in a word ladder (i.e. changing just one letter in each step) in the Scipy documentation. You can find it here: https://docs.scipy.org/doc/scipy/reference/tutorial/csgraph.html.
You can get documentation with the following:
$ pydoc scipy.sparse.csgraph
And, in IPython, do something like this:
In [41]: from scipy.sparse import csgraph In [42]: csgraph.connected_components?
Here is a summary of the contents:
Note that there are other sparse graph libraries for Python. One is Another Python Graph Library: https://pythonhosted.org/apgl/index.html.
Provides spatial algorithms and data structures.
Here is an example, copied from the documentation:
import numpy as np from scipy.spatial import Delaunay import matplotlib.pyplot as plt def test(): points = np.array([[0, 0], [0, 1.1], [1, 0], [1, 1]]) tri = Delaunay(points) # # We can visualize it: plt.triplot(points[:, 0], points[:, 1], tri.simplices.copy()) plt.plot(points[:, 0], points[:, 1], 'o') # # And add some further decorations: for j, p in enumerate(points): # label the points plt.text(p[0] - 0.03, p[1] + 0.03, j, ha='right') for j, s in enumerate(tri.simplices): p = points[s].mean(axis=0) # label triangles plt.text(p[0], p[1], '#%d' % j, ha='center') plt.xlim(-0.5, 1.5) plt.ylim(-0.5, 1.5) plt.show() # # The structure of the triangulation is encoded in the following way: the # simplices attribute contains the indices of the points in the # points array # that make up the triangle. For instance: i = 1 tri.simplices[i, :] points[tri.simplices[i, :]] return tri, points
Here is a summary of the contents of scipy.spatial (obtained by doing $ pydoc scipy.spatial):
Nearest-neighbor Queries:
Delaunay Triangulation, Convex Hulls, and Voronoi Diagrams:
Plotting Helpers:
Simplex representation:
The simplices (triangles, tetrahedra, ...) appearing in the Delaunay tesselation (N-dim simplices), convex hull facets, and Voronoi ridges (N-1 dim simplices) are represented in the following scheme:
tess = Delaunay(points) hull = ConvexHull(points) voro = Voronoi(points) # coordinates of the j-th vertex of the i-th simplex tess.points[tess.simplices[i, j], :] # tesselation element hull.points[hull.simplices[i, j], :] # convex hull facet voro.vertices[voro.ridge_vertices[i, j], :] # ridge between Voronoi cells
For Delaunay triangulations and convex hulls, the neighborhood structure of the simplices satisfies the condition:
tess.neighbors[i,j] is the neighboring simplex of the i-th simplex, opposite to the j-vertex. It is -1 in case of no neighbor.
Convex hull facets also define a hyperplane equation:
(hull.equations[i,:-1] * coord).sum() + hull.equations[i,-1] == 0
Similar hyperplane equations for the Delaunay triangulation correspond to the convex hull facets on the corresponding N+1 dimensional paraboloid.
The Delaunay triangulation objects offer a method for locating the simplex containing a given point, and barycentric coordinate computations.
Functions:
This module contains a large number of probability distributions as well as a growing library of statistical functions.
Each univariate distribution is an instance of a subclass of rv_continuous (rv_discrete for discrete distributions):
Here is a summary of the items in scipy.stats:
Continuous distributions
Multivariate distributions
Discrete distributions
Statistical functions -- Several of these functions have a similar version in scipy.stats.mstats which work for masked arrays.
Circular statistical functions
Contingency table functions
Plot-tests
Masked statistics functions -- Module scipy.stats.mstats contains statistical functions for masked arrays.
For more information in IPython, do:
In [1]: from scipy.stats import mstats In [2]: mstats?
Or, from the command line do $ pydoc scipy.stats.mstats.
Univariate and multivariate kernel density estimation (scipy.stats.kde)
gaussian_kde -- Representation of a kernel-density estimate using Gaussian kernels.
Kernel density estimation is a way to estimate the probability density function (PDF) of a random variable in a non-parametric way. gaussian_kde works for both uni-variate and multi-variate data. It includes automatic bandwidth determination. The estimation works best for a unimodal distribution; bimodal or multi-modal distributions tend to be oversmoothed.
For many more stat related functions install the software R and the interface package rpy`.
The module scipy.ndimage contains various functions for multi-dimensional image processing.
For information on these functions, do (for example, in IPython):
In [6]: from scipy import ndimage In [7]: ndimage? In [8]: ndimage.convolve?
Or, from the command line, do: $ pydoc scipy.ndimage.convolve.
Here is an example -- It computes the multi-dimensional convolution of an Numpy ndarray:
import numpy as np from scipy import ndimage def test(): a = np.array([[1, 2, 0, 0], [5, 3, 0, 4], [0, 0, 0, 7], [9, 3, 0, 0]]) k = np.array([[1, 1, 1], [1, 1, 0], [1, 0, 0]]) result = ndimage.convolve(a, k, mode='constant', cval=0.0) return result
Here is a summary of the contents of scipy.ndimage:
Scipy provides routines to read/write a number of special file formats. Here are some of them:
Pandas vs. Numpy -- Pandas raises Numpy data structures to a higher level. In particular, see the DataFrame object.
For documentation on Pandas, see: http://pandas.pydata.org/pandas-docs/stable/. There are tutorials, get-started guides, cookbook docs, and more.
10 Minutes to pandas seems especially helpful, although it does contain an lot more than 10 minutes worth of material. It gives basic instructions on how to use Pandas data types.
And, be sure to look at the various Pandas tutorials.
There are also cookbooks full of code snippets:
Perhaps it's advisable to view Pandas as just as much about learning techniques for (1) cleaning up your data; (2) exploring and finding significant aspects of your data, and (3) viewing and displaying your data, as it is about performing calculations and analysis on it. Panda contains and provides such a rich set of techniques for working with your data that you should expect to take a reasonable amount of time learning to do the tasks you need, rather than just quickly learn some small set of functions.
Here is an example that creates several of the Pandas data structures that are used in the "10 Minutes to pandas" document referenced above:
def make_sample_dataframe(): """Make sample dates and DataFrame. Returns (dates, df).""" dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) return dates, df
And, here is an example of the use of the above function:
In [117]: import utils01 In [118]: dates, df = utils01.make_sample_dataframe() In [119]: In [119]: dates Out[119]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') In [120]: In [120]: df Out[120]: A B C D 2013-01-01 0.521515 1.006002 -1.408913 -0.218981 2013-01-02 -0.517541 -0.190499 0.397701 0.895858 2013-01-03 0.068253 0.499286 -1.098401 -1.323183 2013-01-04 -0.086779 0.025269 0.459892 0.588754 2013-01-05 1.384825 -1.141312 0.097294 0.169665 2013-01-06 -0.391738 -0.072600 0.196514 0.799174
View the first and last rows of a DataFrame:
In [34]: df.head(n=2) Out[34]: A B C D 2013-01-01 -0.557541 1.016474 0.933149 -0.524661 2013-01-02 1.682318 -1.605635 -0.324727 2.057636 In [35]: In [35]: df.tail(n=3) Out[35]: A B C D 2013-01-04 0.696414 0.538999 1.131596 -0.960681 2013-01-05 -0.175765 -0.494210 1.111779 -0.670209 2013-01-06 -1.615098 0.018027 0.584815 -1.508152
Get the shape, column (labels), and actual data from a DataFrame:
In [38]: df.shape Out[38]: (6, 4) In [39]: df.columns Out[39]: Index(['A', 'B', 'C', 'D'], dtype='object') In [40]: df.values Out[40]: array([[-0.55754086, 1.01647419, 0.93314867, -0.52466119], [ 1.68231758, -1.60563477, -0.32472655, 2.05763649], [-0.45481149, -0.09087637, -1.1383327 , -0.7950994 ], [ 0.69641379, 0.53899898, 1.13159619, -0.96068123], [-0.17576451, -0.49421043, 1.11177912, -0.67020918], [-1.61509837, 0.01802738, 0.58481469, -1.50815216]]) In [41]: type(df.values) Out[41]: numpy.ndarray
Note that df.values returns an ndarray.
Access a row or range of rows -- Use .iloc with a single index or a slice. Examples:
In [72]: df.iloc[1] Out[72]: A 0.721339 B 0.733763 C -1.153457 D -1.345582 Name: 2013-01-02 00:00:00, dtype: float64 In [73]: df.iloc[1:2] Out[73]: A B C D 2013-01-02 0.721339 0.733763 -1.153457 -1.345582 In [74]: df.iloc[1:4] Out[74]: A B C D 2013-01-02 0.721339 0.733763 -1.153457 -1.345582 2013-01-03 2.047318 0.406103 -1.893892 0.065913 2013-01-04 0.737643 -1.539155 0.410927 0.038682
Access a row or range of rows -- Use .loc with index labels. Examples:
In [64]: df.loc[dates[1]] Out[64]: A 0.721339 B 0.733763 C -1.153457 D -1.345582 Name: 2013-01-02 00:00:00, dtype: float64 In [65]: df.loc[dates[1]:dates[2]] Out[65]: A B C D 2013-01-02 0.721339 0.733763 -1.153457 -1.345582 2013-01-03 2.047318 0.406103 -1.893892 0.065913 In [66]: df.loc[dates[1]:dates[1]] Out[66]: A B C D 2013-01-02 0.721339 0.733763 -1.153457 -1.345582 In [67]: df.loc['2013-01-01'] Out[67]: A 1.373992 B -0.080698 C -0.018425 D -0.424205 Name: 2013-01-01 00:00:00, dtype: float64 In [68]: df.loc['2013-01-01':'2013-01-03'] Out[68]: A B C D 2013-01-01 1.373992 -0.080698 -0.018425 -0.424205 2013-01-02 0.721339 0.733763 -1.153457 -1.345582 2013-01-03 2.047318 0.406103 -1.893892 0.065913
Notes:
dates was used to create the index for df:
def make_sample_dataframe1(): """Make sample dates and DataFrame. Returns (dates, df).""" dates = pd.date_range('20130101', periods=6) df = pd.DataFrame( np.random.randn(6, 4), index=dates, columns=list('ABCD')) return dates, df
Access the rows where the content of a item (column) in that row satisfies a condition or test:
In [10]: df.loc[df.B > 0].head() Out[10]: Unnamed: 0 A B C D 2 2013-01-03 0.986316 1.870495 -1.598345 -2.551315 5 2013-01-06 1.385534 1.328005 1.741578 -0.409209 7 2013-01-08 -0.820344 0.318531 0.278434 -0.898119 9 2013-01-10 -2.342766 0.048417 -0.352930 -0.134832 20 2013-01-21 -0.567319 1.784550 -0.114723 0.315661
Or:
In [9]: df.loc[df.B.apply(lambda x: x > 0)].head() Out[9]: Unnamed: 0 A B C D 2 2013-01-03 0.986316 1.870495 -1.598345 -2.551315 5 2013-01-06 1.385534 1.328005 1.741578 -0.409209 7 2013-01-08 -0.820344 0.318531 0.278434 -0.898119 9 2013-01-10 -2.342766 0.048417 -0.352930 -0.134832 20 2013-01-21 -0.567319 1.784550 -0.114723 0.315661
Notes:
The use of .apply() along with lambda (or a named Python function) enables us to select rows based on an arbitrarily complex condition.
Also, consider using functools.partial(). The following selects rows where the value in column B is in the range -0.1 to 0.1:
In [25]: import functools In [26]: f = functools.partial(lambda x, y, z: z > x and z < y, -0.1, 0.1) In [27]: In [27]: df.loc[df.B.apply(f)].head() Out[27]: Unnamed: 0 A B C D 9 2013-01-10 -2.342766 0.048417 -0.352930 -0.134832 27 2013-01-28 -0.673330 0.075427 -0.477715 -0.475463 33 2013-02-03 -0.776301 0.015220 0.518606 -0.286090 38 2013-02-08 0.894722 0.005027 -0.763636 -0.150279 44 2013-02-14 -0.403519 -0.059570 0.929560 -1.065283
Access a column or several columns -- Use the Python indexing operator ([]), with a column label or iterable of column labels. Or, for a single column, use dot notation. Examples:
In [98]: df['B'] Out[98]: 2013-01-01 -0.080698 2013-01-02 0.733763 2013-01-03 0.406103 2013-01-04 -1.539155 2013-01-05 -0.963585 2013-01-06 0.934215 Freq: D, Name: B, dtype: float64 In [99]: df[['B', 'D']] Out[99]: B D 2013-01-01 -0.080698 -0.424205 2013-01-02 0.733763 -1.345582 2013-01-03 0.406103 0.065913 2013-01-04 -1.539155 0.038682 2013-01-05 -0.963585 -0.449162 2013-01-06 0.934215 1.473294 In [100]: In [100]: df.C Out[100]: 2013-01-01 -0.018425 2013-01-02 -1.153457 2013-01-03 -1.893892 2013-01-04 0.410927 2013-01-05 -1.627970 2013-01-06 0.240306 Freq: D, Name: C, dtype: float64
Access individual elements by index relative to zero -- Use .iloc[r, c]:
In [42]: df.iloc[0] Out[42]: A 1.373992 B -0.080698 C -0.018425 D -0.424205 Name: 2013-01-01 00:00:00, dtype: float64 In [43]: df.iloc[0, 1] Out[43]: -0.08069801201343964 In [44]: df.iloc[0, 1:3] Out[44]: B -0.080698 C -0.018425 Name: 2013-01-01 00:00:00, dtype: float64 In [45]: df.iloc[0:4, 1] Out[45]: 2013-01-01 -0.080698 2013-01-02 0.733763 2013-01-03 0.406103 2013-01-04 -1.539155 Freq: D, Name: B, dtype: float64 In [46]: df.iloc[0:4, 1:-1] Out[46]: B C 2013-01-01 -0.080698 -0.018425 2013-01-02 0.733763 -1.153457 2013-01-03 0.406103 -1.893892 2013-01-04 -1.539155 0.410927 In [47]: df.iloc[0:4, 1:] Out[47]: B C D 2013-01-01 -0.080698 -0.018425 -0.424205 2013-01-02 0.733763 -1.153457 -1.345582 2013-01-03 0.406103 -1.893892 0.065913 2013-01-04 -1.539155 0.410927 0.038682
There are several ways to do this. Here are some examples:
import utils01 def test(): dates, df = utils01.make_sample_dataframe1() # iterate over column labels. print("*\n* column labels --\n*") print([x for x in df]) # iterate over items print("*\n* items --\n*") print([x for x in df.head(n=2).iteritems()]) # iterate over rows print("*\n* rows --\n*") print([x for x in df.head(n=2).iterrows()]) # iterate over rows as named tuples. print("*\n* named tuples --\n*") print([x for x in df.head(n=2).itertuples()]) # iterate over rows as named tuples returning one column from each tuple. print("*\n* column \"B\" from named tuple --\n*") print([x.B for x in df.head(n=2).itertuples()])
Here is the output from the above function:
In [67]: test() * * column labels -- * ['A', 'B', 'C', 'D'] * * items -- * [('A', 2013-01-01 -2.443710 2013-01-02 -1.003475 Freq: D, Name: A, dtype: float64), ('B', 2013-01-01 -0.320540 2013-01-02 -1.020769 Freq: D, Name: B, dtype: float64), ('C', 2013-01-01 0.010302 2013-01-02 0.115615 Freq: D, Name: C, dtype: float64), ('D', 2013-01-01 0.935831 2013-01-02 -0.514601 Freq: D, Name: D, dtype: float64)] * * rows -- * [(Timestamp('2013-01-01 00:00:00', freq='D'), A -2.443710 B -0.320540 C 0.010302 D 0.935831 Name: 2013-01-01 00:00:00, dtype: float64), (Timestamp('2013-01-02 00:00:00', freq='D'), A -1.003475 B -1.020769 C 0.115615 D -0.514601 Name: 2013-01-02 00:00:00, dtype: float64)] * * named tuples -- * [Pandas(Index=Timestamp('2013-01-01 00:00:00', freq='D'), A=-2.4437103289150857, B=-0.32054023603910436, C=0.01030189942471081, D=0.9358311337233644), Pandas(Index=Timestamp('2013-01-02 00:00:00', freq='D'), A=-1.0034752077816913, B=-1.0207687970125863, C=0.11561494820245698, D=-0.5146012044818192)] * * column "B" from named tuple -- * [-0.32054023603910436, -1.0207687970125863]
While iterating over a pandas.DataFrame produces the column label, which can be used to access the columns of the DataFrame. Example:
In [92]: for column in df: ...: print("{}[0]: {:7.3f}".format(column, getattr(df, column)[0])) ...: A[0]: -0.368 B[0]: 1.122 C[0]: -0.890 D[0]: 0.076
An easier (and cleaner?) way to access a column would be: df[column].
In contrast, iterating over a pandas.Series, produces the items in the Series. Example (note that dates is a Series):
In [112]: for date in dates: ...: print('date: {}/{}/{}'.format(date.month, date.day, date.year)) ...: date: 1/1/2013 date: 1/2/2013 date: 1/3/2013 date: 1/4/2013 date: 1/5/2013 date: 1/6/2013
Here is a simple bit of code that iterates over each of the items (cells) in a Pandas DataFrame. This function prints out elements column by column:
def show_df(df): for idx1, label in enumerate(df): print('{}. Column: {}'.format(idx1, label)) for idx2, item in enumerate(df[label]): print(' {}.{}. {:+6.4f}'.format(idx1, idx2, item))
And, here is what the above (function show_df) might display:
In [78]: show_df(df.head(n=2)) 0. Column: A 0.0. +0.9590 0.1. -3.6568 1. Column: B 1.0. +1.1409 1.1. -0.4395 2. Column: C 2.0. +1.2634 2.1. -0.3644 3. Column: D 3.0. +0.0824 3.1. +1.1789
And, here is a function that prints out elements row by row (i.e. one row after another):
def show_df_by_rows(df): columns = df.columns for row, index in enumerate(df.index): print('{}. Row: {}'.format(row, index)) for idx, item in enumerate(df.loc[index]): print(' {}.{}. {:+6.4f}'.format(idx, columns[idx], item))
Here is a sample printout from the above function:
0. Row: 2013-01-01 00:00:00 0.A. +0.9590 1.B. +1.1409 2.C. +1.2634 3.D. +0.0824 1. Row: 2013-01-02 00:00:00 0.A. -3.6568 1.B. -0.4395 2.C. -0.3644 3.D. +1.1789
You can do something analogous with list comprehensions or generator expressions. For example, consider this code:
def show_dataframe(df): generator = ((index, b.items()) for (index, b) in ((index, df.loc[index]) for index in df.index)) for date, data in generator: print('date: {}'.format(date)) for col, item in data: print(' col: {} item: {:12.4f}'.format(col, item))
When we run the above, calling show_dataframe, we might see:
In [90]: show_dataframe(df.tail(2)) date: 2013-01-05 00:00:00 col: A item: 0.2175 col: B item: 0.1573 col: C item: -0.2240 col: D item: 0.2395 date: 2013-01-06 00:00:00 col: A item: 0.1440 col: B item: -0.9796 col: C item: -2.2432 col: D item: -0.7182
Notes:
You can group items in a DataFrame according to some criteria, then process only items in that group. For example:
In [363]: dates, df = utils01.make_sample_dataframe1() In [364]: df Out[364]: A B C D 2013-01-01 0.286823 -0.490076 1.876985 0.900970 2013-01-02 0.338896 -0.111205 -1.516295 1.344511 2013-01-03 -1.045215 -0.155277 -0.238831 0.763586 2013-01-04 0.911923 0.383383 -1.838096 -0.233212 2013-01-05 -0.424031 -0.396694 -1.260573 1.912463 2013-01-06 1.198149 -0.729439 1.578052 -1.139293 In [365]: f1 = lambda x: 0 if x < 0.0 else 1 In [366]: df["E"] = [f1(x) for x in df.A] In [367]: df Out[367]: A B C D E 2013-01-01 0.286823 -0.490076 1.876985 0.900970 1 2013-01-02 0.338896 -0.111205 -1.516295 1.344511 1 2013-01-03 -1.045215 -0.155277 -0.238831 0.763586 0 2013-01-04 0.911923 0.383383 -1.838096 -0.233212 1 2013-01-05 -0.424031 -0.396694 -1.260573 1.912463 0 2013-01-06 1.198149 -0.729439 1.578052 -1.139293 1 In [368]: groups = df.groupby("E") In [369]: In [369]: len(groups) Out[369]: 2 In [371]: groups.get_group(0) Out[371]: A B C D E 2013-01-03 -1.045215 -0.155277 -0.238831 0.763586 0 2013-01-05 -0.424031 -0.396694 -1.260573 1.912463 0 In [372]: In [372]: groups.get_group(1) Out[372]: A B C D E 2013-01-01 0.286823 -0.490076 1.876985 0.900970 1 2013-01-02 0.338896 -0.111205 -1.516295 1.344511 1 2013-01-04 0.911923 0.383383 -1.838096 -0.233212 1 2013-01-06 1.198149 -0.729439 1.578052 -1.139293 1
Notes:
An alternative way to do the above task would pass a function to the .groupby method. That function could assign or select rows in arbitrarily complex ways. For example, the following function could assign items to two groups depending on whether the value in column "A" is negative or positive:
In [33]: def f1(index): ...: return 1 if df.loc[index].A < 0.0 else 0 ...: ...: In [34]: In [34]: a = df.groupby(f1) In [35]: In [35]: len(a) Out[35]: 2 In [36]: In [36]: a.get_group(0) Out[36]: A B C D E 2013-01-01 0.823745 1.259863 0.099038 2.401296 0 2013-01-03 1.067624 1.106958 1.616902 0.939021 0 2013-01-04 1.152899 0.190998 -0.062540 -1.786131 0 2013-01-06 0.680271 1.307369 -0.024296 -0.973855 0 In [37]: In [37]: a.get_group(1) Out[37]: A B C D E 2013-01-02 -0.358235 -1.920455 -0.553173 0.580201 1 2013-01-05 -0.226727 0.180529 0.900700 -1.835082 1
You can do this in a variety of ways:
Element-wise -- Use .map for Series and .applymap for DataFrame:
In [171]: dates.map(lambda x: x.day) Out[171]: Int64Index([1, 2, 3, 4, 5, 6], dtype='int64') In [172]: df.applymap(lambda x: 0.0 if x < 0.0 else x * 10.0) Out[172]: A B C D 2013-01-01 0.000000 11.222224 0.000000 0.764820 2013-01-02 8.165304 0.000000 8.425176 0.000000 2013-01-03 0.000000 7.066568 10.162480 0.000000 2013-01-04 7.097722 0.000000 10.544352 2.593139 2013-01-05 0.000000 0.000000 10.031058 6.354610 2013-01-06 5.629199 1.180783 0.000000 0.000000
Row-wise and column-wise -- Use one of:
For functions that take and return a DataFrame or that take and return a Series, use .pipe. Example:
In [197]: fn = lambda x: np.abs(x) In [198]: df.pipe(fn) Out[198]: A B C D 2013-01-01 0.368409 1.122222 0.889764 0.076482 2013-01-02 0.816530 0.963447 0.842518 1.371106 2013-01-03 0.164827 0.706657 1.016248 0.474849 2013-01-04 0.709772 1.695648 1.054435 0.259314 2013-01-05 0.057673 0.713738 1.003106 0.635461 2013-01-06 0.562920 0.118078 1.904701 0.149196
And, remember that there may be use cases where it is useful to create a "vectorized" function with numpy.vectorize.
You can sort by index, value, etc. See: http://pandas.pydata.org/pandas-docs/stable/basics.html#sorting.
You can do preliminary and rudimentary statistical analysis. See: http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics.
For more complex work, consider using the Scipy tools.
Examples:
In [65]: df.describe() Out[65]: A B C D count 6.000000 6.000000 6.000000 6.000000 mean 0.255717 -0.067143 0.211290 -0.127586 std 1.102925 0.651381 0.663725 0.691202 min -0.746677 -1.277578 -0.445694 -1.101834 25% -0.415984 -0.110226 -0.142937 -0.473979 50% -0.111748 0.004162 -0.060588 -0.210746 75% 0.545268 0.374949 0.470344 0.363150 max 2.257601 0.516208 1.357676 0.765088 In [66]: In [66]: sp.mean(df.A) Out[66]: 0.2557174574376679 In [67]: In [67]: sp.std(df.A, ddof=1) Out[67]: 1.102925321931004
See: https://bokeh.pydata.org/en/latest/
Here are Bokeh examples taken from the documentaion:
#!/usr/bin/env python from bokeh.plotting import figure, output_file, show def test01(): # prepare some data x = [1, 2, 3, 4, 5] y = [6, 7, 2, 4, 5] # output to static HTML file output_file("lines.html") # create a new plot with a title and axis labels p = figure(title="simple line example", x_axis_label='x', y_axis_label='y') # add a line renderer with legend and line thickness p.line(x, y, legend="Temp.", line_width=2) # show the results show(p) def test02(): # prepare some data x = [0.1, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0] y0 = [i**2 for i in x] y1 = [10**i for i in x] y2 = [10**(i**2) for i in x] # output to static HTML file output_file("log_lines.html") # create a new plot p = figure( tools="pan,box_zoom,reset,save", y_axis_type="log", y_range=[0.001, 10**11], title="log axis example", x_axis_label='sections', y_axis_label='particles' ) # add some renderers p.line(x, x, legend="y=x") p.circle(x, x, legend="y=x", fill_color="white", size=8) p.line(x, y0, legend="y=x^2", line_width=3) p.line(x, y1, legend="y=10^x", line_color="red") p.circle( x, y1, legend="y=10^x", fill_color="red", line_color="red", size=6) p.line(x, y2, legend="y=10^x^2", line_color="orange", line_dash="4 4") # show the results #show(p, browser="firefox") show(p) def main(): test01() test02() if __name__ == '__main__': main()
There are more examples in the Bokeh "Quickstart" document: https://bokeh.pydata.org/en/latest/docs/user_guide/quickstart.html#userguide-quickstart
See: https://pypi.python.org/pypi/altair
Note that Altair is not in the Anaconda distribution, but is easy to install with pip.
See: http://numba.pydata.org/numba-doc/dev/index.html.
And, here is a interesting article related to Numba: https://www.anaconda.com/blog/developer-blog/parallel-python-with-numba-and-parallelaccelerator/.
From the Numba docs:
From the Numba user manual:
Numba is a compiler for Python array and numerical functions that gives you the power to speed up your applications with high performance functions written directly in Python. Numba generates optimized machine code from pure Python code using the LLVM compiler infrastructure. With a few simple annotations, array-oriented and math-heavy Python code can be just-in-time optimized to performance similar as C, C++ and Fortran, without having to switch languages or Python interpreters. Numba’s main features are: * on-the-fly code generation (at import time or runtime, at the user’s preference) * native code generation for the CPU (default) and GPU hardware * integration with the Python scientific software stack (thanks to Numpy)
Here is some sample test code, copied from the Numba documentation:
# file: numba_test01.py import numba @numba.jit def sum2d(arr): M, N = arr.shape result = 0.0 for i in range(M): for j in range(N): result += arr[i, j] return result def plain_sum2d(arr): M, N = arr.shape result = 0.0 for i in range(M): for j in range(N): result += arr[i, j] return result
And, here is an example that calls the two above functions, one optimized by Numba and the other not. Notice the timings. The Numba optimized version is more than two orders of magnitude faster:
In [30]: import numba_test01 as nt In [31]: a = np.ones((1000, 1200)) In [32]: time nt.plain_sum2d(a) CPU times: user 621 ms, sys: 0 ns, total: 621 ms Wall time: 622 ms Out[32]: 1200000.0 In [33]: time nt.sum2d(a) CPU times: user 3.68 ms, sys: 0 ns, total: 3.68 ms Wall time: 3.7 ms Out[33]: 1200000.0
There is lots more that can be done with Numba in the way of optimizing code. See the docs.
The documentation on Dask can be found here: http://dask.pydata.org/en/latest/docs.html.
This summary of Dask is from the Dask documentation:
Dask is a flexible parallel computing library for analytic computing. Dask is composed of two components: 1. Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads. 2. “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collectiont run on top of the dynamic task schedulers.
If you are beginning to learn Dask, you might want some sample data:
The dask tutorial contains a script for generating sample data files. You can find the tutorial repository here: https://github.com/dask/dask-tutorial.
And, here is a script that will generate a few HDF5 files. I copied it from the Dask Web site (http://dask.pydata.org/en/latest/examples/dataframe-hdf5.html), and made a few minor modifications:
#!/usr/bin/env python """ synopsis: generate sample dask data files. usage: python generate_dask_data.py <file_prefix> options: -h, --help Display this help. """ import sys import string import random import pandas as pd import numpy as np def generate(prefix): # dict to keep track of hdf5 filename and each key fileKeys = {} for i in range(10): # randomly pick letter as dataset key groupkey = random.choice(list(string.ascii_lowercase)) # randomly pick a number as hdf5 filename filename = prefix + str(np.random.randint(100)) + '.h5' # Make a dataframe; 26 rows, 2 columns df = pd.DataFrame({'x': np.random.randint(1, 1000, 26), 'y': np.random.randint(1, 1000, 26)}, index=list(string.ascii_lowercase)) # Write hdf5 to current directory df.to_hdf(filename, key='/' + groupkey, format='table') fileKeys[filename] = groupkey # prints hdf5 filenames and keys for each print(fileKeys) def main(): args = sys.argv[1:] if len(args) != 1: sys.exit(__doc__) if args[0] in ('-h', '--help'): sys.exit(__doc__) prefix = args[0] generate(prefix) if __name__ == '__main__': main()
I used the above script to build sample data files as follows:
$ ./generate_dask_data.py "data02/sample_"
Then I read these HDF5 files into a Dask DataFrame by using the following:
In [38]: df = dd.read_hdf('./data02/sample_*.h5', key='/*') In [39]: df Out[39]: Dask DataFrame Structure: x y npartitions=10 int64 int64 ... ... ... ... ... ... ... ... ... Dask Name: concat, 22 tasks In [40]:
After which, I can do the following, for example:
In [40]: df.x.mean().compute() Out[40]: 501.53076923076924
We can do something that indicates how our data has been broken down into separate partitions. I can use this function:
def test(df): results = [] for idx in range(df.npartitions): mean = df.get_partition(idx).x.mean().compute() print('partition: {} mean: {}'.format(idx, mean)) results.append((idx, mean)) return results
Which produces something like the following:
In [10]: test(df) idx: 0 mean: 473.7692307692308 idx: 1 mean: 436.5769230769231 idx: 2 mean: 501.2692307692308 idx: 3 mean: 565.4230769230769 idx: 4 mean: 516.8846153846154 idx: 5 mean: 501.34615384615387 idx: 6 mean: 531.3076923076923 idx: 7 mean: 428.61538461538464 idx: 8 mean: 565.2307692307693 idx: 9 mean: 494.88461538461536 Out[10]: [(0, 473.7692307692308), (1, 436.5769230769231), (2, 501.2692307692308), (3, 565.4230769230769), (4, 516.8846153846154), (5, 501.34615384615387), (6, 531.3076923076923), (7, 428.61538461538464), (8, 565.2307692307693), (9, 494.88461538461536)]
Dask enables you to divide a large data structure or data set, for example, a Pandas DataFrame, into smaller structures, for example, smaller DataFrames, then load those smaller chunks from disk and process them.
Example:
First we'll create a data set, a Pandas DataFrame, that we can divide up into smaller chunks. Here is a Python script that we can use to create a sample CSV (comma separated values) file:
#!/usr/bin/env python # file: write_csv.py """ synopsis: Write sample CSV file from Pandas DataFrame. usage: python write_csv.py <outfilename> <num_rows> example: python write_csv.py test_data.csv 200 """ import sys import numpy as np import pandas as pd def make_sample_dataframe(periods): """Make sample dates and DataFrame. Returns (dates, df).""" dates = pd.date_range('20130101', periods=periods) df = pd.DataFrame( np.random.randn(periods, 4), index=dates, columns=list('ABCD')) return dates, df def create_data(outfilename, count): dates, df = make_sample_dataframe(count) df.to_csv(outfilename) def main(): args = sys.argv[1:] if len(args) != 2: sys.exit(__doc__) outfilename = args[0] count = int(args[1]) create_data(outfilename, count) if __name__ == '__main__': main()
And, from within IPython, we can run it to create a CSV file as follows:
In [113]: %run write_csv.py tmp2.csv 200
Now, we can read that file to create a Dask DataFrame with the following:
In [115]: import dask.dataframe as dd In [116]: daskdf = dd.read_csv('tmp2.csv')
We can look at our data with df.head() and df.tail():
In [117]: daskdf.head() Out[117]: Unnamed: 0 A B C D 0 2013-01-01 1.719008 0.168998 -0.582670 -0.199597 1 2013-01-02 0.947192 1.449137 -0.701263 0.342353 2 2013-01-03 1.321397 0.035692 0.147275 1.551782 3 2013-01-04 -0.286258 0.592772 1.770504 1.752572 4 2013-01-05 1.695924 0.159782 2.150698 -0.060106 In [118]: daskdf.tail() Out[118]: Unnamed: 0 A B C D 195 2013-07-15 0.303020 0.710051 -0.904407 -0.451793 196 2013-07-16 -0.703248 -0.973423 -0.830585 0.183094 197 2013-07-17 0.886046 1.530008 1.319875 -0.318807 198 2013-07-18 0.021749 2.570984 0.572013 1.249558 199 2013-07-19 -0.570810 -0.240768 2.203662 -0.014111
Also see the Pandas section for ways to view structures, for example: View Pandas data structures
Next, we'll divide it up -- This is an important capability of Dask; it enables us to process Dataframes/arrays that are either too large to fit comfortably in memory or which we are only interested in sub-slices. In this case, we'll specify a block size (or a partition size) when we read the CSV file and create a Dask DataFrame:
In [58]: %run write_csv.py tmp4.csv 500 In [59]: In [59]: df3 = dd.read_csv('tmp3.csv', blocksize=600) In [60]: In [60]: df3.head() Out[60]: Unnamed: 0 A B C D 0 2013-01-01 1.907704 0.317188 0.779075 0.327731 1 2013-01-02 -0.936242 -0.679869 -0.817254 -0.810020 2 2013-01-03 -1.465717 -0.775163 -0.621830 -0.171773 3 2013-01-04 0.878534 -0.910678 -0.363762 0.462970 4 2013-01-05 -0.182779 0.174225 -1.483841 -0.062528 In [61]: df3.tail() Out[61]: Unnamed: 0 A B C D 0 2013-07-15 0.426699 -2.126057 -0.784172 0.780982 1 2013-07-16 -0.727647 -1.552699 0.750276 -0.788475 2 2013-07-17 0.452168 -0.525214 0.003892 -0.029953 3 2013-07-18 -1.135117 0.626181 -0.895456 2.096875 4 2013-07-19 1.365505 -0.208806 0.115254 -1.210855 In [62]: In [62]: df3.A.mean().compute() Out[62]: 0.04365032375682896 In [63]:
And, now, we'll process that data chunk by chunk:
In [63]: for idx in range(df3.npartitions): ...: data = df3.get_partition(idx) ...: mean = data.A.mean().compute() ...: print('partition: {} mean: {}'.format(idx, mean)) ...: partition: 0 mean: 0.1307434691610682 partition: 1 mean: -0.10723637021736673 partition: 2 mean: 0.47059788011488657 partition: 3 mean: -0.029706498960742605 partition: 4 mean: 0.06754303873144374 partition: 5 mean: 0.1604556981338858 partition: 6 mean: -0.4161510144675041 partition: 7 mean: 0.6799116374415602 partition: 8 mean: 0.6303390153859068 partition: 9 mean: 0.6517677726166038 partition: 10 mean: -0.02111769936010994 o o o In [64]:
Notes:
Dask enables you to describe a complex process in terms of an execution graph: a digraph (directed graph) whose nodes are sub-processes. The valuable thing about being able to do so is that Dask can schedule the execution of that larger process so that some sub-processes are executed in parallel. On multi-CPU/multi-core hardware, this can be a big win.
Dask supports parallel processing on both a single machine and one multiple, distributed machines. In what follows, however, I will discuss parallel computation on a single machine.
To learn more about this, you will want to read the following:
Controlling parallelism in Dask requires understanding Dask schedulers, how they are used by Dask, and how to use them.
Note that Dask has default schedulers. If you do nothing to change or set the scheduler, you will be using the default, which is most ofter what you want. The notes that follow will attempt to help you determine when and under what conditions you might want to use a different scheduler and how to do that.
Also, keep in mind two concepts that are both related to optimization in Dask: (1) Parallelism is what you want when you have multiple tasks and want to speed them up by running/computing them in parallel. (2) Breaking your data and your Dask data collections into chunks is what you want when your data set is very large and will not fit in memory. You should keep in mind that breaking your data into chunks may slow down processing. Here is something that shows some of those differences:
In [57]: df1 = dd.read_csv('tmp5.csv', blocksize=1000000) In [58]: df2 = dd.read_csv('tmp5.csv', blocksize=8000) In [59]: In [59]: df1.npartitions Out[59]: 1 In [60]: df2.npartitions Out[60]: 12 In [61]: df1.get_partition(0).size.compute() Out[61]: 5000 In [62]: df2.get_partition(0).size.compute() Out[62]: 450 In [63]: In [63]: time df1.A.mean().compute() CPU times: user 15.8 ms, sys: 7.5 ms, total: 23.3 ms Wall time: 22.3 ms Out[63]: 0.02893067882172706 In [64]: time df2.A.mean().compute() CPU times: user 167 ms, sys: 9.85 ms, total: 177 ms Wall time: 164 ms Out[64]: 0.028930678821727045 In [65]:
Notes:
Synchronous processing on the local machine -- The default scheduler does that.
Let's figure out how to do that in parallel, for example, we'll try to compute the mean of each of the columns of our dataframe (four columns: "A", "B", "C", and "D") in parallel.
Here are two functions. One computes the mean for each column in our DataFrame, one column after another. The other attempts to use dask.distributed to schedule these four tasks so that they make use of more than one CPU core:
def compute_means_sequential(df): """ Sequentially compute the means of columns of dataframe. Args: df (dask.dataframe.DataFrame) -- A dataframe containing columns A, B, C, and D. Return: The means """ meanA = df.A.mean().compute() meanB = df.B.mean().compute() meanC = df.C.mean().compute() meanD = df.D.mean().compute() return meanA, meanB, meanC, meanD def compute_means_parallel(client, df): """ Compute in parallel the means of columns of dataframe. Args: client (dask.distributed.Client) -- The client to schedule the computation. df (dask.dataframe.DataFrame) -- A dataframe containing columns A, B, C, and D. Return: The means """ meanA = client.submit(df.A.mean().compute) meanB = client.submit(df.B.mean().compute) meanC = client.submit(df.C.mean().compute) meanD = client.submit(df.D.mean().compute) client.gather((meanA, meanB, meanC, meanD)) return meanA.result(), meanB.result(), meanC.result(), meanD.result()
You can find a file containing these snippets here: snippets.py.
Here is a test that uses the above on a 2-core machine:
In [17]: time snippets.compute_means_sequential(df1) CPU times: user 167 ms, sys: 21.3 ms, total: 189 ms Wall time: 379 ms Out[17]: (0.02893067882172706, -0.05704419047235241, -0.03281851829891229, -0.029845199428518945) In [18]: time snippets.compute_means_parallel(client, df1) CPU times: user 189 ms, sys: 16.9 ms, total: 206 ms Wall time: 281 ms Out[18]: (0.02893067882172706, -0.05704419047235241, -0.03281851829891229, -0.029845199428518945)
Here is a test that uses the above on a 4-core machine:
In [15]: time snippets.compute_means_sequential(df1) CPU times: user 160 ms, sys: 9.5 ms, total: 169 ms Wall time: 303 ms Out[15]: (0.02893067882172706, -0.05704419047235241, -0.03281851829891229, -0.029845199428518945) In [16]: In [16]: time snippets.compute_means_parallel(client, df1) CPU times: user 164 ms, sys: 5.03 ms, total: 169 ms Wall time: 224 ms Out[16]: (0.02893067882172706, -0.05704419047235241, -0.03281851829891229, -0.029845199428518945)
Notes:
See: http://cython.org/.
Cython enables us to write or produce C code while writing code in the style of Python. There's more to it than that, but you get the idea. We can write code that looks a lot like Python code, and then use Cython to turn it into C code.
Cython has another important use -- Because (1) Cython gives us easy access to libraries of compiled C code and (2) it is easy to write functions in Cython that can be called from Python, we can use it to easily "wrap" C functions for use in Python. In fact, if you look inside some Python packages, for example Lxml, you will see wrappers for underlying C code that were produced with Cython; Lxml makes calls into the libxml XML libraries provided by http://www.xmlsoft.org.
Here is a bit more description from http://cython.org/:
"Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language (based on Pyrex). It makes writing C extensions for Python as easy as Python itself.
- "Cython gives you the combined power of Python and C to let you
- write Python code that calls back and forth from and to C or C++ code natively at any point.
- easily tune readable Python code into plain C performance by adding static type declarations.
- use combined source code level debugging to find bugs in your Python, Cython and C code.
- interact efficiently with large data sets, e.g. using multi-dimensional NumPy arrays.
- quickly build your applications within the large, mature and widely used CPython ecosystem.
- integrate natively with existing code and data from legacy, low-level or high-performance libraries and applications."
And, the scikit-learn documentation page is here: http://scikit-learn.org/stable/user_guide.html.
EliteDataScience has an introduction to machine learning here: https://elitedatascience.com/learn-machine-learning
EliteDataScience has provided a Scikit-Learn tutorial here: https://elitedatascience.com/python-machine-learning-tutorial-scikit-learn.
Question: Is there support for tensorflow in Anaconda? Answer: Yes, but currently, installing it is tricky. For example, see this: https://gist.github.com/johndpope/187b0dd996d16152ace2f842d43e3990
Also see the section on Dask elsewhere in the current document: Dask for optimized (and parallel) computing.
You can store Panda DataFrames and Dask DataFrames in HDF5 archives with h5py. You can read about h5py here:
Also see: https://dask.pydata.org/en/doc-test-build/array-overview.html#construct
Here is an example that saves and retrieves a Dask DataFrame:
In [62]: df1, df2 = snippets.read_csv_files('tmp5.csv') In [63]: df1.to_hdf('tmp01.hdf5', '/Version1/tmp5') Out[63]: ['tmp01.hdf5'] In [64]: In [64]: df1a = dd.read_hdf('tmp01.hdf5', '/Version1/tmp5') In [65]: In [65]: df1.A.mean().compute() Out[65]: 0.02893067882172706 In [66]: df1a.A.mean().compute() Out[66]: 0.02893067882172706 In [68]: df2.to_hdf('tmp01.hdf5', '/Version1/tmp5_2') Out[68]: ['tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5'] In [69]: In [69]: df2a = dd.read_hdf('tmp01.hdf5', '/Version1/tmp5_2') In [70]: In [70]: df2.npartitions Out[70]: 12 In [71]: df2a.npartitions Out[71]: 1 In [72]: df2.B.su df2.B.sub df2.B.sum In [72]: df2.B.sum().compute() Out[72]: -57.04419047235241 In [73]: df2a.B.sum().compute() Out[73]: -57.04419047235241
Notes:
We load a Dask DataFrame (df1), then read it back into a separate variable (df1a).
We compute the mean of column A of both DataFrames so as to show that the one we wrote to HDF5 and the one we read back in from HDF5 contain the same data.
Notice that in the case of df2 and df2a, read_hdf function did not preserve the chunk size and number of partitions. However, the read_hdf function has an optional parameter that enables you to read a DataFrame from HDF5 creating multiple partitions and a smaller chunk size. Example:
In [80]: df2b = dd.read_hdf('tmp01.hdf5', '/Version1/tmp5_2') In [81]: df2b.npartitions Out[81]: 1 In [82]: df2c = dd.read_hdf('tmp01.hdf5', '/Version1/tmp5_2', chunksize=100) In [83]: df2c.npartitions Out[83]: 10
There is also an HTTP server for HDF5 archives. It presents a REST-ful interface that enables you to add, list, and retrieve data objects from HDF5 archives on a remote machine. The data returned in response to a retrieval request is formatted as JSON.
Yot can learn more about h5serv here: http://h5serv.readthedocs.io/en/latest/.
And, you can learn about the JSON representation of HDF5 here: http://hdf5-json.readthedocs.io/en/latest/index.html.
The documentation is here: https://asdf.readthedocs.io/en/latest/.
And, a bit more documentation: https://www.sciencedirect.com/science/article/pii/S2213133715000645
A CSV module is in the Python standart library. See: https://docs.python.org/3/library/csv.html