An Introduction to Python for Scientists

Tiago M. D. Pereira, ITA/UiO

http://folk.uio.no/tiago/intropython



Why Python?

  • Easy to learn, easy to use
  • Incredibly flexible and powerful: many libraries
  • Truly general purpose language: from shell scripts to GPU big data
  • Free as in free beer!
  • Number #4 in IEEE Top 10 Programming Languages to Learn 2016

Who uses Python?

  • Google: "Python where we can, C++ where we must"
  • Your Linux/Mac most likely uses Python for essential system tasks
  • Most uses of Python outside science: tons of companies for web backends
  • Last 10 years large shift to Python in scientific community (also in Astronomy: astropy/pyraf, etc.)

Python in brief

  • High-level, object-oriented, interpreted language
  • Clear, readable syntax forces clean code
  • Runs virtually everywhere
  • Extendable with C/C++/FORTRAN modules
  • Name comes from Monty Python
  • Rift in 2.x/3.x versions: just use 3.x

How to get started?

  • Python is most likely already installed in your system
  • However, better to use a Python distribution
  • Anaconda recommended, keeps things tidy, easy to manage packages
conda install mypackage
conda uninstall mypackage
conda update mypackage
conda list

Python as a calculator

$ python
Python 2.7.10 (default, Oct 23 2015, 19:19:21)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 3e-34/5.6**1.34
2.9822860601211976e-35
>>> 4+5
9
>>> "hello"+" world"
'hello world'
>>> "hello"+4
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: cannot concatenate 'str' and 'int' objects

Python for Science: basic packages

  • Numpy
  • Scipy
  • Matplotlib
  • IPython

IPython

  • Feature-rich interactive prompt for python
  • Makes it very easy to interact with data
  • Jupyter notebooks: annotate, share, organise your work!
  • Highly customisable: will fit your workflow
  • ipyparallel: distribute your work across machines
In [1]:
for a in range(3):
    print(a)
0
1
2
In [2]:
from math import *

def my_function(a, b):
    return a + cos(b * sqrt(a) / pi)

my_function(10, 3.4)
Out[2]:
9.039164227121516
In [4]:
Out[2] + 10
Out[4]:
19.039164227121518
In [5]:
cd ~/Desktop/
/Users/tiago/Desktop
In [6]:
ls ~/*ncdf
/Users/tiago/output_ray.ncdf

Numpy: efficient arrays for Python

In [7]:
import numpy as np
a = np.arange(100).reshape(10, 10)
print(a)
[[ 0  1  2  3  4  5  6  7  8  9]
 [10 11 12 13 14 15 16 17 18 19]
 [20 21 22 23 24 25 26 27 28 29]
 [30 31 32 33 34 35 36 37 38 39]
 [40 41 42 43 44 45 46 47 48 49]
 [50 51 52 53 54 55 56 57 58 59]
 [60 61 62 63 64 65 66 67 68 69]
 [70 71 72 73 74 75 76 77 78 79]
 [80 81 82 83 84 85 86 87 88 89]
 [90 91 92 93 94 95 96 97 98 99]]
In [8]:
b = np.ones((10, 10), dtype='uint8')
c = np.sin(2 * a / pi) + np.sqrt(b ** 1.5) + np.random.normal(0, 10, size=(10, 10))
In [9]:
c[c < 0] = 0.
c[-3:, -3:]
Out[9]:
array([[ 15.98910938,   0.        ,   2.56011278],
       [  0.39043593,   5.03626436,   0.        ],
       [  0.        ,   6.52683314,   0.        ]])

Scipy: numerical tools for Python

Subpackage Description
cluster Clustering algorithms
constants Physical and mathematical constants
fftpack Fast Fourier Transform routines
integrate Integration and ordinary differential equation solvers
interpolate Interpolation and smoothing splines
io Input and Output
linalg Linear algebra
ndimage N-dimensional image processing
odr Orthogonal distance regression
optimize Optimization and root-finding routines
signal Signal processing
sparse Sparse matrices and associated routines
spatial Spatial data structures and algorithms
special Special functions
stats Statistical distributions and functions
weave C/C++ integration
In [10]:
from scipy import linalg
data = np.array([[1, 2],
                 [3, 4]])
linalg.det(data)
Out[10]:
-2.0
In [11]:
linalg.inv(data)
Out[11]:
array([[-2. ,  1. ],
       [ 1.5, -0.5]])
In [ ]:
from scipy.io.idl import readsav
data = readsav("my_idl_save_file")

Matplotlib: powerful plotting made easy!

Some examples

In [12]:
from scipy.optimize import curve_fit

def gaussian(x, a, b, c, d):
    ''' Returns a gaussian function for a, b, c, d = m, stdev, max, offset '''
    return d + c / (b * np.sqrt(2 * np.pi)) * np.exp(-((x - a)**2 / (2 * b**2)))

x = np.arange(-120, 50.)
data = gaussian(x, 0, 15, 30, 4) + np.random.normal(0, 0.3, size=170)
In [13]:
fit, covar = curve_fit(gaussian, x, data)
plot(x, data, 'b-')
plot(x, gaussian(x, *fit), 'r-')
Out[13]:
[<matplotlib.lines.Line2D at 0x113084210>]

Performance of Python

  • Pure Python code is very high-level, much slower than C, Fortran
  • Array operations with numpy and most scipy routines written in C/Fortran
  • For performance, avoid loops and use array operations
  • Python can be extended by writing C or Fortran modules for bottlenecks
  • Cython makes the transition to C much easier
  • numba performs just-in-time (jit) compilation, very easy to use
  • f2py run Fortran routines from python
  • PyPy alternative implementation of Python with jit (not ready for prime time)
In [49]:
def det_by_lu(y, x):
    y[0] = 1.

    N = x.shape[0]
    for k in range(N):
        y[0] *= x[k,k]
        for i in range(k+1, N):
            x[i,k] /= x[k,k]
            for j in range(k+1, N):
                x[i,j] -= x[i,k] * x[k,j]
In [ ]:
import cython

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef cython_det_by_lu(double[:] y, double[:,:] x):
    y[0] = 1.

    cdef int N = x.shape[0]
    cdef int i,j,k

    for k in range(N):
        y[0] *= x[k,k]
        for i in range(k+1, N):
            x[i,k] /= x[k,k]
            for j in range(k+1, N):
                x[i,j] -= x[i,k] * x[k,j]
In [ ]:
from numba import jit, void, double

fastdet_by_lu = jit(void(double[:], double[:,:]))(det_by_lu)

Python in parallel

Many ways to run python programs in parallel, YMMV! Most common methods:

  • ipyparallel: high-level parellisation with hosts/workers, run parallel sessions of python
  • MPI4py: run Python with MPI (more manual and explicit parallelisation)
  • Openmp with Cython

On the GPU:

  • Accelerate CUDA libraries: BLAS, FFT, RAND, SPARSE, implicit use of GPU
  • Accelerate CUDA jit: similar to numba, easiest way to get started with CUDA
  • pyCUDA: python bindings to CUDA: lower level (kernels written in C), but more control
  • pyOpenCL: similar to pyCUDA, but for OpenCL
In [ ]:
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    data = np.arange(1000, dtype='i')
    comm.Send([data, MPI.INT], dest=1, tag=77)
elif rank == 1:
    data = np.empty(1000, dtype='i')
    comm.Recv([data, MPI.INT], source=0, tag=77)

Run with:

mpirun -np 16 python my_parallel_program.py

Other packages of interest:

  • Astropy / SunPy
  • Kivy: framework for touch-screen mobile applications
  • Mayavi: powerful 3D visualisations made easy

Astropy

A comprehensive Python package for astronomers:

  • Coordinate system transformations
  • Units and constants (WCS, etc.)
  • Extensive I/O modules: fits and more
  • Virtual Observatory: search, download, and handle data
  • Visualisation, statistics, image processing (cross correlation, better convolution, filters used in astronomy)