# <center> Introduction to Python Session 2

Instructor: Jerry Qiushi Yu
    
Topics for this session:

0. List
1. Function
2. I/O stream
3. Numpy


<span style="font-family:Papyrus; font-size:em;"> *Thanks to Professor Kerby Shedden and Professor Marcio Mourao for making their teaching materials available. Part of this session's material is based on theirs.* </span>

## <center> List
    
A list is the most basic container in Python. We can put homogeneous or heterogeenous objects in a list.

We can index and slice a list. Indexing in Python starts with $0$.

We can use an iterator to loop through a list.

In [22]:
la=[1,2,3]
print(la)
lb=[1,2,[2,4]]
print(lb)
lc=[1,2,"cat"]
print(lc)

[1, 2, 3]
[1, 2, [2, 4]]
[1, 2, 'cat']


In [19]:
print(la[0])
la_one=la[0]
print(la_one)
la_onetwo=la[0:2:1] # the pattern for slicing: begin(inclusive):end(exclusive):step
print(la_onetwo)
la_all=la[:]
print(id(la), id(la_all)) # id gives the unique label for an object in computer memory

1
1
[1, 2]
2932254011272 2932244995400


In [4]:
for element in la:
    print(element)

1
2
3


In [6]:
# a range object behaves almost the same as a list
# we usually use a range oject for indexing in a loop
range(6)

range(0, 6)

In [8]:
for number in range(0,6):
    print(number**2)

0
1
4
9
16
25


## <center> Function Definitions

We use the keyword def in Python to define functions. The string right in the next line after 'def' has a special status. It is called a docstring of the function. Various tools for automatically preparing documentation from source code will use the docstring - make a habit of always supplying one! Also, docstrings are typically triple-quoted even if it fits one line to allow for later expansion.

In [127]:
def get_strings(lst):
    """Returns from lst the elements whose type is 'str'."""
    out = []
    for i in lst:
        if type(i) is str:
            out.append(i) # Adds i to the list.
    return out       

In [128]:
print(get_strings) # Prints the function object.

<function get_strings at 0x0000018C54A0C158>


In [129]:
print(get_strings.__doc__) # Prints the docstring of the function.

Returns from lst the elements whose type is 'str'.


In [130]:
strings = get_strings([1,2.0,'one',10,'two']) # Obtain the 'str' elements from the list.
print(strings)

['one', 'two']


Now lets generalize the type of the elements filtered by the function.

In [131]:
def get_type_elements(lst, retType):
    """Returns from lst the elements whose type is 'retType'."""
    out = []
    for i in lst:
        if type(i) is retType:
            out.append(i) 
    return out

In [132]:
get_type_elements([1,2.0,'one',10,'two'])

TypeError: get_type_elements() missing 1 required positional argument: 'retType'

In [39]:
get_type_elements([1,2.0,'one',10,'two'], int)

[1, 10]

Now lets assume that most of the times you'll be calling this function to return strings from the list. You can create a default parameter value in the function.

In [133]:
def get_type_elements(lst, retType=str):
    """Returns from lst the elements whose type is 'retType'."""
    out = []
    for i in lst:
        if type(i) is retType:
            out.append(i)
    return out

In [134]:
get_type_elements([1,2.0,'one',10,'two']) # Calls the function with one positional argument.

['one', 'two']

In [135]:
get_type_elements([1,2.0,'one',10,'two'], int) # Calls the function with two positional arguments.

[1, 10]

In [136]:
get_type_elements(lst=[1,2.0,'one',10,'two'], retType=str) # Calls the function with keyword arguments.

['one', 'two']

In [137]:
get_type_elements(retType=str, lst=[1,2.0,'one',10,'two']) # If using only keyword arguments, the order does not matter.

['one', 'two']

In [138]:
get_type_elements([1,2.0,'one',10,'two'], float) # You can combine positional and keyword arguments.

[2.0]

In [140]:
get_type_elements(lst=[1,2.0,'one',10,'two'], float) 
# The positional arguments must precede keyword arguments.
# in this example, the first argument is keyword argument, and the second is positional.
# Therefore, there is an error about this. 

SyntaxError: positional argument follows keyword argument (<ipython-input-140-5e2205a866f7>, line 1)

We can also create anonymous or lambda functions in Python. This can be used to create temporary objects to be used in higher order functions.

In [44]:
triple = lambda x: x * 3 # Defines a lambda or anonymous function
triple(3)
# you use these temporary functions as arguments in other functions.

9

In [45]:
list(filter(lambda x: x%2==0, range(1,20))) # Creates a list with even numbers

[2, 4, 6, 8, 10, 12, 14, 16, 18]

# <center> Scoping Rules

Python scripts can contain function definitions, class definitions, and executable code. A script is executed by the interpreter from top to bottom. Statements are executed as they are encountered. Function and class definitions are processed when encountered (they are compiled to byte code and stored), but they are not immediately evaluated. Therefore, variables referenced inside a function or class definition do not need to exist at the point where the function definition appears in the script. This is a slightly subtle point and we will elaborate below with examples.

Python function bodies create their own scope. This means that a variable with a given name that is defined inside a function is unrelated to any variable with the same name defined outside the function. However, if a variable is referred to but not defined inside the function, it will resolve (or bind) to a variable that was created outside of the function body. This is also a subtle point that will hopefully be clarified below.

First, the simplest possible setting is a pure function that only refers to its arguments and to variables defined inside its own scope.

In [56]:
def f1():
    var=5
    print("From f1 - var is: ", var)

f1()

From f1 - var is:  5


The following output shows an error, because we are trying to access a local variable var in a global scope whereas the local variable only works inside function 'f1' or local scope.

In [57]:
print(var) # This variable is not recognized outside the scope of the function 'f1'

NameError: name 'var' is not defined

Python is a statically scoped and a lexically scoped language. This means that a variable used inside a function (that is not local to the function scope) resolves to a variable defined outside the function based on the position of that variable in the source file, rather than to the current state of the environment. 

For example, if var is referenced inside a function f1, but var is not defined inside f1, then var resolves to the first variable named var that is defined in the enclosing scope of the definition of f1. Note that var did not need to be defined before f1 was defined, it was only necessary for it to be defined before f was called.

In [58]:
var = 10

def f1():
    print("From f1 - var is: ", var)
    
f1()
print("From outside - var is: ", var)

From f1 - var is:  10
From outside - var is:  10


The two 'vars' below correspond to two different variables.

In [59]:
var = 10

def f1():
    var = 5
    print("From f1 - var is: ", var)

f1()
print("From outside - var is: ", var)

From f1 - var is:  5
From outside - var is:  10


The following output generates an error because var is being defined locally but referenced before that happens.

In [3]:
var = 10

def f1():
    var = var * 10
    print("From f1 - var is: ", var)

f1()
print("From outside - var is: ", var)

UnboundLocalError: local variable 'var' referenced before assignment

This problem can be solved by using the 'global' keyword and modifying a variable defined outside the function.

In [63]:
var = 10

def f1():
    global var
    var = var * 10
    print("From f1 - var is: ", var)

f1()
print("From outside - var is: ", var)

From f1 - var is:  100
From outside - var is:  100


To clarify the notion of lexical scope, in the following example we have a function f1 that refers to a non-local variable var. The “most local” var relative to the definition of f1 determines the name resolution. Thus when f1 is called, var holds the value 5. The most local var relative to the call of f1 is a different variable (also called var). This variable holds the value 10, and is not referenced at all in the code below.

In [52]:
var = 5

def f1():
    print(var + 1)

def f2(f1):
    var = 10
    f1()
    
f1()
f2(f1)

6
6


Things get a bit more complicated when we have nested scopes. In the following example, there are two different variables called var. When var is referenced inside of f2, it resolves to the variable in the closest enclosing scope, which is the variable holding the value 5.

In [53]:
var = 3

def f1():

    var = 5

    def f2():
        return var

    r = f2()
    print(r)
    
f1()

5


If no variable named var is encountered in the first enclosing scope, the interpreter goes out one level, and so on. Thus, in the next example the var referenced in f2 resolves to the var in module scope (holding value 3).

In [54]:
var = 3

def f1():

    def f2():
        return var
    
    r = f2()
    print(r)

f1()

3


One more important concept we must discuss is the closure. Below, the function f defines and returns the function g, and g refers to variables defined in the scope of f. Normally, after f returns its scope becomes unreachable, and all variables defined there are reclaimed. But in this case, since g remains in scope, the scope of f must remain reachable via g. In this case, g is called a closure because it not only contains the statements defining g (its function body), but also encapsulates its enclosing scope (the scope of f).

In [55]:
def f(y):

    z = 3 * y

    def g():
        return z

    return g
#g() means executing the function g and return the result
#g just points to the function itself
g = f(3)
r = g()
print(r)

9


Finally, note that in Python (unlike in some other languages), only functions and modules (files) create scope. In particular, loops do not have their own scope. Thus the loop index variable in the example below clobbers the existing variable with the same name:

In [147]:
i = 3

for i in range(5):
    pass

print(i)

4


# <center> Reading and writing files

Please run the two cells below first for preparation.

In [160]:
# Let's print the current working directory(cwd)
import os
os.getcwd()
#os.chdir('PATH')

'C:\\Users\\jerry'

In [161]:

#create one file in the working directory
with open('workfile', 'wt') as fout:
    fout.write('')
    
#create another file in the working directory
with open('textFile.txt', 'wt') as fout:
    fout.write('first\nsecond\n\nthird\nfourth\nfifth\n')

You'll certainly need to read data from files to data structures. For structured data, lets say, a table of data in a excel or csv file, you'll be able to use pandas input/output functions to read the data. But it is very important to know how to work with files in Python, specially if you need to deal with unstructured data.

In [163]:
f = open('workfile', 'r') 
# Opens the file specified in the first argument with mode specified in the second argument.
# second argument: 'r': reading mode. 'w':writing mode

In [164]:
f.close() # You always need to close the file object so to free OS resources.

Below is a very convenient way of opening files. This is because using the keyword 'with' will guarantee that the file object will be closed.

In [165]:
with open('workfile') as f:
    read_data = f.read()

In [166]:
f.closed # Tells whether the file is open or closed.

True

In [167]:
read_data # This file is actually completely empty.

''

I created a file named 'textFile.txt' with the content below.

first
second

third
fourth
fifth

In [169]:
f = open('textFile.txt', 'r') # If you don't specify the second argument, the file will be opened in reading mode.
read_data = f.read() # Notice that function 'read' reads the entire content of the file.
read_data

'first\nsecond\n\nthird\nfourth\nfifth\n'

In [170]:
print(read_data, end='') # The print function (as seen before) will interpret the end of line characters.

first
second

third
fourth
fifth


In [171]:
f.read() # You have reached the end of the file. There is nothing more to be read.
#you read through all the the pointers of a file, and in the end, f.read() is empty

''

In [172]:
f.seek(0) # This will redirect file to the beginning of the file.

0

In [173]:
print(f.readline(), end='')

first


In [174]:
print(f.readline(), end='')

second


In [175]:
for line in f: # This is a very convenient, fast way of accessing lines of a file.
    print(line, end='')


third
fourth
fifth


In [176]:
f.seek(0)
list(f) # You can also create a list out of lines in a file.

['first\n', 'second\n', '\n', 'third\n', 'fourth\n', 'fifth\n']

In [180]:
f.seek(0)
f.readlines() # Returns the same as above.

['first\n', 'second\n', '\n', 'third\n', 'fourth\n', 'fifth\n']

In [181]:
f.seek(0)
lines = [x.rstrip() for x in f] # Now cleaning up the list while reading lines from the file.
lines

['first', 'second', '', 'third', 'fourth', 'fifth']

You can also supply an argument to the read function. This argument has different meanings based on the opening mode. If you open the file in text mode, the argument represent the number of characters and what determines a character is determined by the file's encoding ('utf-8' is the default). If you opened the file in binary mode (you can do that using the character 'b' in the mode), the read argument represent raw bytes.

In [182]:
f.seek(0)
f.read(3) # Reads a number of bytes equivalent to the number of characters requested.

'fir'

In [17]:
f.read(7) # Again, notice that the read method advances the file handler position everytime is used.

'st\nseco'

If you want to get to know the default encoding, you can find it out using the sys module, as below.

In [18]:
import sys
sys.getdefaultencoding() 

'utf-8'

Now, if you want to write to a file...

In [183]:
import os
os.getcwd()
f = open('newtextfile.txt', 'w') # Make sure you open the file in writing mode.
f.write('This is a new line!\n')

20

What if we would like to write a tuple to a file?

In [184]:
value = ('A string', 10)
s = str(value)  # convert the tuple to string
print(s, end='')
f.write(s)

('A string', 10)

16

In [185]:
f.close()

In [186]:
with open('newtextfile.txt') as f:
    lines = [x.rstrip() for x in f]
lines

['This is a new line!', "('A string', 10)"]

# <center> Numpy

NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides several array-like data structures, including a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

The most commonly used array-like data structure is the ndarray (“n-dimensional array”) object. An ndarray is a Python wrapper around a contiguous chunk of memory that allows it to be manipulated like an array.

Conventionally 'Numpy' is abbreviated as np:

In [21]:
import numpy as np

x = np.zeros(5)
print(type(x))
x

<class 'numpy.ndarray'>


array([0., 0., 0., 0., 0.])

Numpy arrays are homogeneous (for most of the times), contiguous, typed arrays. This makes them dramatically faster than core Python lists for many operations, since the Python list stores all values by indirection and is dynamically typed. The main exception to this would be if you need to store heterogeneous data, and/or you need to shrink or grow the array frequently, in which case the Python list type may actually be more efficient.

There are currently 24 Numpy data types, called “dtypes”, documented here. This includes the usual 12 numeric types (1, 2, 4, and 8 byte signed and unsigned integers, 4 and 8 byte floating point values, and 4 and 8 byte complex number values). In addition there are string, date/time, and Python object dtypes. The default type for many array creation operations is float64, which is an 8 byte floating point value that is mostly interchangeable with a regular Python float value.


The np.zeros function creates an array of zeros, defaulting to float64 type. The following are all equivalent:

In [187]:
m = 10
x = np.zeros(m)
print(x.dtype)
x = np.zeros(m, np.float64)
print(x.dtype)
x = np.zeros(m, dtype=np.float64)
print(x.dtype)
x = np.zeros(m, dtype=float)
print(x.dtype)
x = np.zeros(m, dtype='d')
print(x.dtype)
x = np.zeros(m, dtype='double')
print(x.dtype)

float64
float64
float64
float64
float64
float64


The following examples create arrays of zeros with other data types:

In [25]:
x = np.zeros(m, np.int32)
print(x.dtype)
x = np.zeros(m, np.uint8)
print(x.dtype)
x = np.zeros(m, np.int64)
print(x.dtype)

int32
uint8
int64


Below are some other ways to create arrays. Each of these functions can take the dtype argument specifying any dtype, but we use the default float64 here:

In [26]:
x = np.ones(m)   # Sets all values to 1
print(x, x.dtype)
x = np.arange(m) # 0, 1, 2, ..., m-1
print(x, x.dtype)

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] float64
[0 1 2 3 4 5 6 7 8 9] int32


Unlike Python lists, Numpy arrays behave like mathematical vectors and matrices with respect to arithmetic operations, e.g. you can do something like this:

In [27]:
x = np.arange(5)
print(x)
y = np.arange(1, 6)
print(y)

x + y  # Pointwise sum
x - y  # Pointwise difference
x / y  # Pointwise quotient
x ** y # Pointwise exponentiation
x % y  # Pointwise remainder
x * y  # Pointwise product

[0 1 2 3 4]
[1 2 3 4 5]


array([ 0,  2,  6, 12, 20])

There is a lot more going on here... each of these operators has to be implemented separately for each dtype, i.e. + for int64 is a different function than + for float32. If you pass in mixed dtypes, i.e. multiplying an int64 by a float32, there will be a type promotion, which in this case means that the int64 array will be converted into a float32 array before the addition function is called. These hidden type promotions can degrade performance so sometimes it is better to convert your data to a common dtype before an intensive calculation begins. Conversions can be done with the astype method:

In [27]:
x = np.arange(5, dtype=np.float64)
y = x.astype(np.int32)
z = x.astype(np.float32)
#change types to be compatible before operations, and this makes it more efficinet.
x,y,z

(array([0., 1., 2., 3., 4.]),
 array([0, 1, 2, 3, 4], dtype=int32),
 array([0., 1., 2., 3., 4.], dtype=float32))

Another place where type promotion occurs is in performing division. Division will produce a float64 result unless both inputs are of type float32.

Indexing and slicing numpy arrays behaves similarly to indexing and slicing Python lists. Note the difference to lists - slices will normally return a “view” of the underlying data, meaning that if you change a slice, the same values will change in the parent array:

In [28]:
x = list(range(10))
#slicing in core python creates a new memory for a list.
y = x[3:6]
y[0] = 99
print(x,y)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [99, 4, 5]


In [29]:
x = np.arange(10)
y = x[3:6]
#slicing in numpy creates a new pointer for the same memory.
y[0] = 99
print(x,y)

[ 0  1  2 99  4  5  6  7  8  9] [99  4  5]


Note that views can result even for certain types of discontinuous slices. See the following example:

In [28]:
x = np.arange(20)
y = x[::2]
#x[::2]means all the even indexed elements
y[4] = 99
print(x,y)

[ 0  1  2  3  4  5  6  7 99  9 10 11 12 13 14 15 16 17 18 19] [ 0  2  4  6 99 10 12 14 16 18]


Numpy uses a very general approach for indexing from the array to its low-level memory block. As a result, for many operations we do not need to copy the underlying data. For example, we may have a very large array x, and if we create a new variable y = x.T holding the transpose of x, then y and x share the same storage. In addition to the example above, see the following example:

In [29]:
x = np.random.normal(size=(3, 2))
y = x.T

print('This is x:\n', x, '\n')
print('This is y:\n', y, '\n')

print(id(x))
print(id(y))
print(x.flags.owndata)
print(y.flags.owndata)

This is x:
 [[ 0.15971286 -0.43476738]
 [-1.43458464 -1.33690231]
 [-0.31839109  0.58305919]] 

This is y:
 [[ 0.15971286 -1.43458464 -0.31839109]
 [-0.43476738 -1.33690231  0.58305919]] 

2431678792448
2431678774368
True
False


The Python data analysis tools (core Python, Numpy, Pandas, and others) lack a high performance and universal way to represent missing values. The current work-around is to use NaN and None to represent missing values, but this approach has limitations. One issue is that by definition, NaN is the only value that is not equal to itself:

In [32]:
x = float('nan')
print(x == x)

False


This means, for example, that you cannot count the NaN values in an array in the obvious way:

In [33]:
x = np.array([1, np.nan, 2, np.nan])
print((x == np.nan).sum())
#NaN is not equal to itself.

0


The proper way to detect NaN values is with the np.isnan function:

In [34]:
print(np.isnan(x).sum())

2


Another issue is that NaN exists for float type variables (float32 and float64 in Numpy) but not for other variable types, e.g. integers. Therefore, when you insert a missing value into a Numpy integer array, it will be promoted to float type.

Numpy provides two main ways to work with string data. The first approach, which is much more common, uses the Python string pool to manage the strings, and simply places the object id’s into the ndarray. This produces an array of type Object, e.g.

In [32]:
x = np.array(["cat", "dog", "fish"], dtype = 'O')
x.dtype

dtype('O')

You can see that this array only contains object id’s by running the following:

In [33]:
s = "fish"
x[0] = s
print(id(s), id(x[0]))
x
# s and x[0] are pointers to the actual strings.

2431628516184 2431628516184


array(['fish', 'dog', 'fish'], dtype=object)

Note that this array can actually hold references to any Python object, not just strings:

In [34]:
x[0] = {i : i for i in range(10)}
x

array([{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9},
       'dog', 'fish'], dtype=object)

The other way to store strings in a ndarray is to use a fixed string width, in which case the string data is actually packed into the array directly:

In [150]:
x = np.array(["cat", "dog"], dtype='U3')
print(x)
x.dtype

['cat' 'dog']


dtype('<U3')

The dtype “<U3” refers to a Unicode string of 3 characters. Note that in this setting, if you attempt to assign a string that does not fit into the allotted storage, the string is truncated:

In [149]:
x[0] = "fish"
print(x)

['fis' 'dog']


## Arithmetic operations

As seen in the last lecture, Numpy arrays behave like mathematical vectors and matrices with respect to arithmetic operations. Lets see an example with a two dimensional Numpy array:

In [88]:
import numpy as np

x = np.ones((3,4)) # Creates a two dimensional numpy array.
x, x.dtype

(array([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]]), dtype('float64'))

In [4]:
print(x.ndim)  # Returns number of dimensions in an array.
print(x.shape) # Return tuple desribing array shape.
print(x.size)  # Returns number of elements.

2
(3, 4)
12


In [5]:
y = np.array([[1, 2], [3, 4]])
z = np.array([[5, 6], [7, 8]])

In [6]:
print(y)
print(z)

[[1 2]
 [3 4]]
[[5 6]
 [7 8]]


In [7]:
y + z # Summs element wise.

array([[ 6,  8],
       [10, 12]])

In [8]:
np.round(y / z, 2) # Divides element wise.

array([[0.2 , 0.33],
       [0.43, 0.5 ]])

In [9]:
w = np.array([[1, 2], [3, 4], [5, 6]]) # Lets create a new Numpy array 3x2.
w

array([[1, 2],
       [3, 4],
       [5, 6]])

In [10]:
w + z # Notice the error message below. We'll get to it later in this lecture.

ValueError: operands could not be broadcast together with shapes (3,2) (2,2) 

An easy way to avoid making copies when performing array arithmetic in Numpy is to use the in-place arithmetic operators +=, *=, etc. When we use x = x + y, a new allocation is made to hold the value x + y, and this allocated memory is then assigned to x, with the previous memory block of x (eventually) being garbage collected. But x += y does not result in a new allocation, as seen below:

In [None]:
y = np.array([[1, 2], [3, 4]])
z = np.array([[5, 6], [7, 8]])

print(id(y))
y = y + z   # Regular sum.
print(id(y))
print(y)

In [None]:
y = np.array([[1, 2], [3, 4]])
z = np.array([[5, 6], [7, 8]])

print(id(y))
y += z        # In-place sum.
print(id(y))
print(y)

## Indexing and Slicing

In [None]:
x = np.array([[1, 2], [3, 4]])
x

In [None]:
print(x[0]) # Returns the first line. # this only extracts the first row, and its dimension is 1
print(x[1:]) # Returns a similar structure to x with the second line. # slicing returns the same structure and dimensions.
print(x[0][1]) # Returns the second element of the first line.
print(x[0,1]) # A more concise way of writing the above lookup.

Remember that slices will normally return a “view” of the underlying data, meaning that if you change a slice, the same values will change in the parent array:

In [None]:
x = np.array([[1, 2, 3], [4, 5, 6]])
y = x[:,0] # retrieves the first column.
y[1]=100 # modifies its element.
x
#np.array slicing is different from list slicing. List slicing returns an independent object, 
#but array slicing shares the memory with the original object.

Now suppose you want to retrieve elements from the first and the third columns of x:

In [None]:
x = np.array([[1, 2, 3], [4, 5, 6]])
y = x[:,[0,2]] # We use a list to index - this is also called advanced or fancy indexing.
y[1]=100
x
# if you use fancy indexing, then an independent memory object will be created. 

When you use advanced indexing, no view is provided and a copy of the original object will be made.

In [15]:
x = np.array([[1,2], [3,4], [5,6]])
y = x[[-1,-2],:] # Another example of advanced indexing with negative indices.
y

[1, 2]

In [12]:
x = np.array([[1,-2], [3,4], [-5,6]])
print(x)
print(x<0) # x <0 is a boolean array.

[[ 1 -2]
 [ 3  4]
 [-5  6]]
[[False  True]
 [False False]
 [ True False]]


In [13]:
x[x<0]=0 # This is called boolean indexing. Here I am setting negative entries to zero.
x

array([-2, -5])


## Linear Algebra

Numpy supports a wide range of matrix and linear algebra operations. For example, you can use:

1. 'np.linalg.solve' to solve a linear system of equations
2. 'np.linalg.svd' for the singular value decomposition
3. 'np.linalg.cholesky' for the cholesky decomposition 
4. 'np.linalg.inv' for matrix inversion
5. 'np.linalg.solve' for getting the exact solutions for a system of equations
6. 'np.linalg.lstsq' for getting the least square solutions for a system of equations

We will not attempt to describe all of them here. Just a couple of quick examples:

In [18]:
u = np.array([1,2])
v = np.array([1,2])

a = np.dot(u, v)   # inner product or matrix multiplication
b = np.outer(u, v) # outer product.
print('Inner Product: \n', a)
print('Outer Product: \n', b)

Inner Product: 
 5
Outer Product: 
 [[1 2]
 [2 4]]


In [19]:
A = np.random.randint(0,10,(2,2))
A
[L,V] = np.linalg.eig(A)
print('Eigen Values: \n', L)
print('Eigen Vectors: \n', V)
#the matrix is by rows.
# the two eigen vectors are the two rows in the 2-d array.

Eigen Values: 
 [6. 0.]
Eigen Vectors: 
 [[ 1.  -0.8]
 [ 0.   0.6]]


## Vectorization

Vectorization within NumPy is used to express operations as occurring on entire arrays rather than their individual elements. Here’s a definition from Wes McKinney:

"
This practice of replacing explicit loops with array expressions is commonly referred to as vectorization. In general, vectorized array operations will often be one or two (or more) orders of magnitude faster than their pure Python equivalents, with the biggest impact seen in any kind of numerical computations. (see [here](https://www.safaribooksonline.com/library/view/python-for-data/9781449323592/ch04.html?orpq))
"

Lets see an example:

In [188]:
np.random.seed(52452)
values = np.random.choice([True,False], size=500)

In [189]:
def count_transitions1(values):
    """Returns the number of transitions from either False to True or from True to False"""
    output = 0
    # to loop through more than one sequnce, I use zip to put two arrays together
    for x,y in zip(values[:-1],values[1:]):
        if x!=y:
            output+=1
    return output

In [190]:
def count_transitions2(values):
    """Returns the number of transitions from either False to True or from True to False"""
    output = np.sum(values[:-1]!=values[1:])
    return output

In [191]:
print(count_transitions1(values))
print(count_transitions2(values))

245
245


In [192]:
%timeit -n 1000 count_transitions1(values)
%timeit -n 1000 count_transitions2(values)

495 µs ± 22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.88 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Reducing functions

Numpy has reducing functions that collapse a multidimensional array to one single axis. The axes are numbered 0 (rows), 1 (columns), etc. For example:

In [196]:
x = np.random.normal(size=(5,10))
x

array([[-0.78123224,  0.77682032,  0.63369182, -1.53993691, -0.20792374,
        -1.08904293,  1.79800305, -2.01340275,  0.02558545, -0.72738964],
       [-0.34794223, -0.79806307, -0.90678211, -1.05429223, -0.54624062,
        -0.35779351, -0.33593238, -0.51285328,  0.4023334 ,  0.76043202],
       [-1.14747964, -1.32380239, -0.73932437, -0.45555228, -1.77574302,
        -0.68495813,  0.2711496 ,  1.3484854 ,  0.11496398, -0.58343985],
       [ 0.22573678, -0.93900494, -0.04295383, -2.43586682,  0.29301177,
        -0.30513033, -0.73816706,  2.36734933,  0.20367889,  0.65738482],
       [-0.05439771,  0.8440284 ,  0.14541008, -0.45797527, -0.6719289 ,
        -0.47676485, -1.50909447,  0.02340448,  1.24607509,  1.03106125]])

In [201]:
print(x.mean(axis=0)) # column-wise means, size=10.

[-0.42106301 -0.28800434 -0.18199168 -1.1887247  -0.5817649  -0.58273795
 -0.10280825  0.24259664  0.39852737  0.22760972]


In [202]:
print(x.mean(axis=1)) # row-wise means, size=5.

[-0.31248276 -0.3697134  -0.49757007 -0.07139614  0.01198181]


## Broadcasting

The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python.

One use of broadcasting is if we want to center or scale an array by column:

In [39]:
x = np.random.random(size=(10000, 4))
print(id(x))

print(x.mean(0))
print(x.std(0))

print(x.shape)
print(x.mean(0).shape)

x -= x.mean(0)
x /= x.std(0)

print(x.mean(0))
print(x.std(0))

print(id(x))

2159153825312
[0.50196951 0.50373012 0.49870297 0.49596574]
[0.28617644 0.28823266 0.2895602  0.28989964]
(10000, 4)
(4,)
[-4.75308681e-16 -3.66739417e-15  1.75690573e-15 -3.78919118e-16]
[1. 1. 1. 1.]
2159153825312


In the example above, x.mean(0) returns an array with dimension (4,), which matches from the right with the dimension of x, which is (10000,4). Therefore the shapes are compatible for broadcasting. The behavior in this case is that the result of x.mean(0) is only computed one time, and the same result is used for centering each row of x.

There is a special case of the broadcasting rules that applies when a dimension’s length is equal to 1. In this case, the value in that dimension is copied to match the dimension on the same axis in the other array:

In [40]:
a = np.array([10.0, 15.0, 20.0])
b = np.array([5.0, 5.0, 5.0])
a / b

array([2., 3., 4.])

In [41]:
a = np.array([10.0, 15.0, 20.0])
print(a.shape)
b = np.array([5.0])
print(b.shape)
a / b

(3,)
(1,)


array([2., 3., 4.])

In [42]:
x = np.random.normal(size=(10, 2))
y = np.random.normal(size=(10, 1))
print(x + y)

[[-0.09081042 -1.38249452]
 [ 0.12437537  0.1764374 ]
 [-3.10067992  1.04663713]
 [ 0.34642815  0.29682263]
 [ 1.43060946 -1.16568979]
 [-0.12527773 -0.31937966]
 [ 0.54686872 -0.70074562]
 [ 0.30170236  0.4162724 ]
 [-2.20153494  0.10991214]
 [-0.67763269  0.17865969]]


There is a special syntax for adding a new axis of length 1 to an array:

In [43]:
x = np.zeros(10)  # shape is (10,).
y = x[:, None]    # shape is (10,1).
z = x[None, :]    # shape is (1,10).

A common setting where this is useful is when you want to center or scale a two-dimensional array by row. Adding a new column with dimension 1 allows the broadcasting rules to apply when they otherwise would not:

In [44]:
x = np.random.normal(size=(10, 3))
x -= x.mean(1)[:, None]
x /= x.std(1)[:, None]

See the intermediate values below:

In [45]:
x.mean(1), x.mean(1).shape

(array([ 7.40148683e-17, -1.48029737e-16,  0.00000000e+00, -3.70074342e-17,
        -1.85037171e-17,  1.85037171e-17,  0.00000000e+00, -4.07081776e-16,
        -9.71445147e-17,  1.48029737e-16]), (10,))

In [46]:
x.mean(1)[:, None], x.mean(1)[:, None].shape

(array([[ 7.40148683e-17],
        [-1.48029737e-16],
        [ 0.00000000e+00],
        [-3.70074342e-17],
        [-1.85037171e-17],
        [ 1.85037171e-17],
        [ 0.00000000e+00],
        [-4.07081776e-16],
        [-9.71445147e-17],
        [ 1.48029737e-16]]), (10, 1))

In [215]:
a=np.array(range(9))
a=a.reshape(3,3)
b=np.array(range(100, 103))
print('a: ', a, '\n', 'b: ', b)


a:  [[0 1 2]
 [3 4 5]
 [6 7 8]] 
 b:  [100 101 102]


array([[100, 102, 104],
       [103, 105, 107],
       [106, 108, 110]])

In [218]:
a+b # add by rows

array([[100, 102, 104],
       [103, 105, 107],
       [106, 108, 110]])

In [220]:
a+b[:,None] # add by columns

array([[100, 101, 102],
       [104, 105, 106],
       [108, 109, 110]])

## Structured Array

Numpy has structured array containers to store heterogeneous data types and their names. The structured arrays are helpful for small sized data, but we will soon learn how pandas are much more powerful in dealing with structured data. However, structured numpy arrays give us good intuitions about how pandas function.

In [75]:
ID=['Alice', 'Bob', 'Cathy']
age=[25,45,37]
weight=[55.0, 85.5, 68.0]
np.dtype({'names':('ID', 'age', 'weight'),'formats':('U', 'i4', 'f8')})


dtype([('ID', '<U'), ('age', '<i4'), ('weight', '<f8')])

In [76]:
data=np.zeros(3,dtype={'names':('ID', 'age', 'weight'),
                      'formats':('U10', 'i4', 'f8')})
# U10 is unicode string with length less than 10, i4 is int32, f8 is float64
data['ID']=ID
data['age']=age
data['weight']=weight


In [87]:
data[-1]['ID']
data['ID']
data['age']
list(data)

[('Alice', 25, 55.), ('Bob', 45, 85.5), ('Cathy', 37, 68.)]

### Structured Array to Pandas DataFrame

In [89]:
import pandas as pd
data_pd=pd.DataFrame(data)
data_pd

Unnamed: 0,ID,age,weight
0,Alice,25,55.0
1,Bob,45,85.5
2,Cathy,37,68.0


## One Last Example for Practicing What's Covered in This Session

Let's simulate a data set with 4 covariates (including the intercept) and a response variable. The dimension of the data set is 100 by 5.

In [122]:
np.random.seed(15315)
X=np.random.normal(size=(100,4))
X[:,0]=1
beta=[-2,3,5,-4]
Y=X.dot(beta)+np.random.normal(size=len(X))
data=np.column_stack([X,Y])
np.savetxt('SimulatedData.txt', data)
del X
del Y
del data

Let's practice reading in the simulated data set file with a Python I/O stream.

In [227]:
# Please type the code above for reading in the simulated data.
# Please think about what each step is doing

The simple OLS model assumes the linear function form: $Y=X \beta+\epsilon$.

The OLS estimator of $\beta$ has the formula: $\hat{\beta}=(X^T X)^{-1} X^T Y$.

In [223]:
# Please type the code above for computing the OLS estimator for beta.
# Please think about what each step is doing

In [224]:
#help(OLS)

Help on function OLS in module __main__:

OLS(X, Y)
    input: covariate matrix of X, including intercept
           response variable, Y vector
    output: OLS estimator of beta hat



In [125]:
#OLS(X,Y)

array([-2.0184651 ,  2.97390324,  5.15071529, -3.76644403])

In [225]:
# We can use a numpy function to directly compute the OLS solutions to beta
#np.linalg.lstsq(X,Y, rcond=None)[0]

array([-2.0184651 ,  2.97390324,  5.15071529, -3.76644403])