Python Basics¶

This is an introductory guide to python. Idea for beginners.

Let's create a simple bank program¶

P.S. To run a cell and create the next simitaneously press Shift+Enter
Also when outside of a cell (the left side of the cell goes blue) you can press 'a' to make a cell above and 'b' to make a cell below

Variables¶

action = "Withdraw"
# Action we're doing
balance = 50000
# Our original balance
interest = 0.4
# Interest of our account (4%)
amount = 50
# Amount used in a action

Notice how in the code cell above a small [1] appears. This means this is the first line of code executed in the program. Keep an eye on this in case you run lines in the wrong order

print(type(action))
print(type(balance))
print(type(interest))
print(type(amount))

<class 'str'>
<class 'int'>
<class 'float'>
<class 'int'>

This command is useful for telling us the datatype python has set as we don't declare this, but we can convert between relevant datatypes...

balance = float(balance)
print(type(balance))

<class 'float'>

Selection (If/Else)¶

if action == "Withdraw":
    print("You're withdrawing funds")

You're withdrawing funds

Note that selection statements don't require brackets

if action == "Withdraw" and amount <= balance:
    print("You've decided to withdraw and have sufficent funds")

You've decided to withdraw and have sufficent funds

Chaining logic uses key words 'and' / 'or'. Lets also copy this cell

if action == "Withdraw" and amount <= balance:
    balance -= amount
    print(balance)

49950.0

Let's add other cases such as 'Deposit' and 'Calculate Interest'

if action == "Withdraw":
    print("Withdraw")
elif action == "Deposit":
    print("Deposit")
else:
    print("Calculate interest")

Withdraw

python uses 'elif' rather than 'else if'

Iteration¶

To calculate compound interest lets test both types of loop in python
We will do this by saying for each month that balance increases by the amount we set in interest (4%)

months = 5

for i in range (0, 3):
    print(i)

0
1
2

Above is a for (Count-controlled) loop. i is the counter and will increase within the range we set. Our range is 0 to 3 which means it will go up to (but not including) 3.

for j in range(0,months):
    balance = balance * (1 + interest)
    print("Balance:"+balance)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-0fe331476169> in <module>()
      1 for j in range(0,months):
      2     balance = balance * (1 + interest)
----> 3     print("Balance:"+balance)

TypeError: must be str, not float

Be careful with datatypes. Although python will let you print any datatype, if you're using a string, all other varaibles in the print must be converted to string

balance = 50000.0
print("Starting balance = "+str(balance))
for j in range(0,months):
    balance = balance * (1 + interest)
    print("Balance:"+str(balance))

Starting balance = 50000.0
Balance:70000.0
Balance:98000.0
Balance:137200.0
Balance:192080.0
Balance:268912.0

We could also do this with a while loop (conditioned-controlled) although for a task with a fixed number of repetitions a for loop would typically be preferred

balance = 50000.0
print("Starting balance = "+str(balance))

counter = 0
while counter < months:
    balance = balance * (1 + interest)
    print("Balance:"+str(balance))
    counter += 1

Starting balance = 50000.0
Balance:70000.0
Balance:98000.0
Balance:137200.0
Balance:192080.0
Balance:268912.0

Lists¶

Although understanding variables is essential, your likely to mostly use contructs such as lists and dictionaries

myList = list()
# or
myList = []

Creating an empty list

myList = [1,2,3]

Creating a simple 3 item list

myList = ["abc", 2, "def"]

Not fussy about mixing data

myList.append(4)
print(myList)

['abc', 2, 'def', 4]

Also happy to add items

myList[0] = "xyz"
print(myList)

['xyz', 2, 'def', 4]

Changing items in a list also uses standard zero indexing (first item is 0)

for i in range(0, len(myList)):
    print(myList[i])

xyz
2
def
4

Iterating through a list can be done two ways. The first is simply setting a range up to the length of the list

for i in myList:
    print(i)

xyz
2
def
4

Even easier is using pythons default list loop which will enumerate all items. Instead of i holding the current loop iteration, it will hold the current item

Functions¶

You may not need to produce many functions in your project but understanding them is essential

def myFunc():
    print("abc")

Defining a function uses the key phrase 'def' and the function name with brackets

myFunc()

abc

To call it we simply type its name with brackets. As you can see the function prints "abc" as expected

def myFuncWithParams(x):
    print(x.lower())

Functions may also require parameters. In this case our function should print whatever is passed to it in lowercase

myFuncWithParams("I lovE PythoN")

i love python

Python Essentials¶

This should build upon the last code demo with a few more advanced concepts.

Imports¶

import numpy as np # importing numpy as np

You can import a Python module using the import command. You can also rename it (i.e. numpy to np)

data_new = [6, 7, 8, 0, 1]
data = np.array(data_new) # accessing numpy as np. Here I am converting a list to array
print(data)

[6 7 8 0 1]

Above we've used numpy to create a numpy array out of the list. This will be useful later as numpy arrays are used by modules later in your project

More Strings¶

a = 'Big data'
print(type(a))
print(isinstance(a, str))

<class 'str'>
True

You can return type of an object using type command. You can check whether an object is an instance of a particular type using isinstance function.

x = ' This is big data examiner'
x[10] = 'f'

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-27-0f8fdae752aa> in <module>()
      1 x= ' This is big data examiner'
----> 2 x[10] = 'f'

TypeError: 'str' object does not support item assignment

x = x[0:9] + "a lecture"
print(x)

 This is a lecture

Strings cannot be editied by character index but can be edited by using functions such as slicing

x = 'Java is a powerful programming language'
y = x.replace('Java', 'Python')
print(y)

Python is a powerful programming language

Replace is another useful function to replace characters and words

a = 'Python'
print(list(a))
print(a[:3])  
print(a[3:])

['P', 'y', 't', 'h', 'o', 'n']
Pyt
hon

Python is also quite flexible in converting between datatypes. For example it will turn a string into a list of characters with ease

# String concentation is very important
p = "Python is the best programming language"
q = ", I have ever seen"
print(p+q)

Python is the best programming language, I have ever seen

String concatenation

print("Costs £%.3f for a %s"  %(1.35, 'bag of sweets'))
print("Costs £%.2f for a %s"  %(0.73, 'apple'))
print("Costs £%.d for a %s"  %(1.13, 'chococlate bar'))

Costs £1.350 for a bag of sweets
Costs £0.73 for a apple
Costs £1 for a chococlate bar

You have to do lot of string formatting while doing data analysis. You can format an argument as a string using %s, %d for an integer, %.3f for a number with 3 decimal points. To do more with string look into string formatting in python

Date-time¶

# Python date and time module provides datetime, date and time types
from datetime import datetime, date, time
td = datetime(1989,6,9,5,1,30)  # do not write number 6 as 06, you will get an invalid token error.
print(td.day)

9

print(td.minute)
print(td.date())
print(td.time())

1
1989-06-09
05:01:30

print (td.strftime('%d/%m/%y %H:%M:%S'))

09/06/89 05:01:30

Datetime is a useful module for date and time formatting. It allows you to create a datetime object and then print and format elements as you wish.

Note that pressing shift + tab on a function should tell you its parameters

td = datetime(1989,6,9,5,1, 30)
td1 = datetime(1988,8, 31, 11, 2, 23)
new_time =td1 - td  # you can subtract two different date and time functions
print(new_time)

-282 days, 6:00:53

Dates and times can also be subtracted from one another to calculate difference

Handling Exceptions¶

print (float('7.968'))
print (float('Big data'))

7.968

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-37-b9f28340c149> in <module>()
      1 print (float('7.968'))
----> 2 print (float('Big data'))

ValueError: could not convert string to float: 'Big data'

For obvious reasons a string cannot be converted to a float (numeric datatype). To avoid hitting this error we should use a try-except statement (just like a try-catch in Java)

def return_float(x):
    try:
        return float(x)
    except:
        return 0

print (return_float('4.55'))
print (return_float('big data'))

4.55
0

The error for converting a string has been handled in the except section of the statement, so instead of printing an error, it returns 0

Tuples¶

deep_learning = ('SkLearn', 'Open cv', 'Torch')  # you can un pack a tuple
print(deep_learning[0])

SkLearn

Tuples are immutable which means their length can't be changed, but just like lists items can be fetched by index

x,y,z= deep_learning
print (x)
print (y)
print (z)

SkLearn
Open cv
Torch

Because our tuple is 3 items long it can also be converted into 3 seperate variables using x,y,z. (Same can be done with a list)

More Lists¶

countries = ['Usa', 'Russia', 'Usa', 'Germany', 'France', 'Italy']
countries.count('Usa')  # .count can be used to count how many values are ther in a list/tuple

2

Use of the count function

x = [3,2,3]
x.extend([4,9,6])
print(x)

[3, 2, 3, 4, 9, 6]

When adding multiple items extend is used rather than append

x.sort()
print(x)

[2, 3, 3, 4, 6, 9]

Python also has a handy sort function

countries.sort()
print(countries)
countries.sort(key=len)  # countries are sorted according to number of characters
print(countries)

['France', 'Germany', 'Italy', 'Russia', 'Usa', 'Usa']
['Usa', 'Usa', 'Italy', 'France', 'Russia', 'Germany']

You can also define the sort type if its not default (i.e. sorting by length rather than alphabet)

languages = ['Python', 'Pandas', 'Keras', 'Tensorflow']

for i,val in enumerate(languages):
    print (i,val)

0 Python
1 Pandas
2 Keras
3 Tensorflow

When iterating over a sequence; to keep track of the index of the current element, you can use 'enumerate' which gives the counter (i) and the item (val)

first_name = ['Ben', 'John', 'Kevin']
last_name = ['Andrew', 'Bustard', 'McLaughlin']
combined = zip(first_name, last_name)

for i in combined:
    print(i)

('Ben', 'Andrew')
('John', 'Bustard')
('Kevin', 'McLaughlin')

Zipping is also useful for combining lists into tuples (grouping items from seperate lists)

list(reversed(range(20)))

[19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

Reversed list

Dictionaries¶

myDict = {'a' : 3, 'b' : 6}
# key : value

Dictionaries are another important construct. Dictionaries allow you to map keys to values, which allows us to get items by id rather than index

print(myDict.get('a'))

3

How you get an item by key from a dictionary

for value in myDict:
    print(value)

a
b

Printing values (same can be done for keys by using key instead of value)

for key, value in myDict.items():
    print(key)
    print(value)

a
3
b
6

Printing both key and value in loop

myDict.update({'a' : 4})
print(myDict)

{'a': 4, 'b': 6}

Updating an item by key

myDict.update({'c' : 12})
print(myDict)

{'a': 4, 'b': 6, 'c': 12}

Adding an item is the same (will add if item is not found)

print(myDict.pop('a'))
print(myDict)

4
{'b': 6, 'c': 12}

Delete using pop (which also returns the value of the item)

Cleaning data¶

Raw data is messy, so you have to clean the data set to make it ready for analysis. Here we have a list of countries that consist of unnecessary punctuations, capilitalization and white space.

First, I am importing a python module called [regular expression](https://docs.python.org/2/library/re.html)
Second, I am creating a funtion called removeBadCharacters, to remove the unnecessary punctuations
Third, I am using some of python's inbuilt functions to clean text

countries = ['       Argentina', '$USA$', 'france', 'GerMany', 'Kenya!', 'India##', 'Spain(www.spain.com)']
import re

Above is an typical example of the kind of data formatting you may need to clean. Creating functions to apply cleaning is a very good idea (especially when you start using pandas)

def removeBadCharacters(text):
    return re.sub(r'[^\w\s]','',text)

for i in range(0, len(countries)):
    countries[i] = removeBadCharacters(countries[i])
print(countries)

['       Argentina', 'USA', 'france', 'GerMany', 'Kenya', 'India', 'Spainwwwspaincom']

the function removeBadCharacters uses re to apply a regex (a special chracter sequence) that substitues all punctuation with null (removes them). It returns this new format as listed above

for i in range(0, len(countries)):
    countries[i] = countries[i].strip()
print(countries)

['Argentina', 'USA', 'france', 'GerMany', 'Kenya', 'India', 'Spainwwwspaincom']

Strip is one of python's inbuilt functions. It removes all leading and ending whitespace

for i in range(0, len(countries)):
    countries[i] = countries[i].lower().capitalize()
print(countries)

['Argentina', 'Usa', 'France', 'Germany', 'Kenya', 'India', 'Spainwwwspaincom']

lower() makes all characters lowercase and .capitalize() makes the first letter a uppercase

The formatting is nearly complete but the last item which had a URL in brackets is still incorrect. Try creating a function called removeUrl which turns a string such as "Spain(www.spain.com)" to "Spain"

Visualisation¶

import matplotlib.pyplot as plt

plt.plot([1,2,3,4])
plt.ylabel('Some numbers')
plt.show()

y = [3, 10, 7, 5, 3, 4.5, 6, 8.1]
x = range(len(y))
print(x)

range(0, 8)

plt.bar(x, y, color="blue")
plt.show()

myDict = {'PersonA':26, 'PersonB': 17, 'PersonC':30}
plt.bar(list(myDict.keys()), list(myDict.values()))

<BarContainer object of 3 artists>

x = np.linspace(0,10,10)
y1 = x
y2 = x**2
y3 = x**3
y4 = np.sqrt(x)

fig = plt.figure()  # an empty figure with no axes
fig, ax_lst = plt.subplots(2, 2)
plt.subplot(2,2,1)
plt.plot(x, y1, 'ro')
plt.subplot(2,2,2)
plt.plot(x, y2, 'bo')
plt.subplot(2,2,3)
plt.plot(x, y3, 'go')
plt.subplot(2,2,4)
plt.plot(x, y4, 'yo')

[<matplotlib.lines.Line2D at 0x7f7d58664278>]

<Figure size 432x288 with 0 Axes>

The Live EDA Demo¶

This demo extends what we've covered about python and also introduces the basics of some other essential library's such as pandas. This guide should act as a simplification of the kind of notebook you will produce through your project.

Importing Data & Cleaning¶

import pandas as pd

Pandas is the package you will be learning next. It's used for accessing and modifying datasets. For now we will just use it to open our 'games' dataset and print

df = pd.read_csv('games.csv')
df.head()

# df = pd.read_csv('_____.tab', sep='\t') --> Needed for household dataset tab files
df2 = pd.read_csv("example.tab", sep='\t')
df2.head()

Pandas allows us to open a csv/tab file and save it as a dataframe (called 'df'). This is extremely useful as it will mostly organise the data as we want, and allows us to easily get columns/rows. df.head() is used to print the top of the table just to have a quick peek.

print(df.shape)

(16598, 11)

Shape tells us its structure. This dataset has 16598 rows and 11 columns.

print(df.dtypes)

Rank              int64
Name             object
Platform         object
Year            float64
Genre            object
Publisher        object
NA_Sales        float64
EU_Sales        float64
JP_Sales        float64
Other_Sales     float64
Global_Sales    float64
dtype: object

dtypes tells us datatypes of the columns

print(df.loc[0])

Rank                     1
Name            Wii Sports
Platform               Wii
Year                  2006
Genre               Sports
Publisher         Nintendo
NA_Sales             41.49
EU_Sales             29.02
JP_Sales              3.77
Other_Sales           8.46
Global_Sales         82.74
Name: 0, dtype: object

Using df.loc (locate) gets row data by index, so here we've selected the first row and printed out all of its detials

for i in range(0, 20):
    print(df.loc[i]["Genre"])

Sports
Platform
Racing
Sports
Role-Playing
Puzzle
Platform
Misc
Platform
Shooter
Simulation
Racing
Role-Playing
Sports
Sports
Misc
Action
Action
Platform
Misc

Above is a basic example of iterating through rows in our data. At each iteration we've used df.loc[i] to get the current row's data, and then specify we only want "Genre" returned. Through learning pandas you will find more efficient ways of doing this

for i in range(0, df.shape[0]):
    if df.loc[i]["Genre"] == "Misc":
        df.drop(i, inplace=True)

print(df["Genre"])

0              Sports
1            Platform
2              Racing
3              Sports
4        Role-Playing
5              Puzzle
6            Platform
8            Platform
9             Shooter
10         Simulation
11             Racing
12       Role-Playing
13             Sports
14             Sports
16             Action
17             Action
18           Platform
20       Role-Playing
21           Platform
22           Platform
23             Action
24             Action
25       Role-Playing
26       Role-Playing
27             Puzzle
28             Racing
29            Shooter
30       Role-Playing
31            Shooter
32       Role-Playing
             ...     
16568          Puzzle
16569         Shooter
16570      Simulation
16571       Adventure
16572       Adventure
16573          Racing
16574          Racing
16575       Adventure
16576          Sports
16577         Shooter
16578          Sports
16579          Sports
16580       Adventure
16581          Sports
16582          Action
16583          Action
16584          Puzzle
16585         Shooter
16586       Adventure
16587          Sports
16588          Puzzle
16589          Action
16590    Role-Playing
16591       Adventure
16592      Simulation
16593        Platform
16594         Shooter
16595          Racing
16596          Puzzle
16597        Platform
Name: Genre, Length: 14859, dtype: object

Above we use drop in the loop to remove any games with a "Misc" game category. It is a simple (and by no means the most effiencent way) of removing rows based on a check

You'll have to do a lot more cleaning than this so follow the codecademy pandas closely to learn how to use it more effectively and efficiently than this

Analysing data¶

Once your data is clean and you've set up your dataframe for analysis, begin testing and further understanding your data

platforms = dict(df["Platform"].value_counts())
print(platforms)

{'PS2': 1939, 'DS': 1770, 'PS3': 1205, 'X360': 1139, 'PS': 1120, 'PSP': 1107, 'Wii': 1045, 'PC': 936, 'XB': 778, 'GBA': 712, 'GC': 520, '3DS': 456, 'PSV': 389, 'PS4': 321, 'N64': 301, 'SNES': 222, 'XOne': 198, 'SAT': 158, '2600': 128, 'WiiU': 122, 'NES': 96, 'GB': 90, 'DC': 52, 'GEN': 26, 'NG': 12, 'WS': 6, 'SCD': 4, '3DO': 3, 'TG16': 2, 'PCFX': 1, 'GG': 1}

Pandas has some simple tools to see your data. For example by counting we can simply see how popular categories of data are, but vizualisations are much better...

import matplotlib.pyplot as plt

Matplotlib is a basic graphical package for python. We'll use it to make some simple graphs and draw some meaning from our data

plt.plot(platforms.keys(), platforms.values())

[<matplotlib.lines.Line2D at 0x7f7d4b9b0438>]

As you can see the default can be messy. To clear it up lets make it bigger and change to a bar chart

plt.figure(figsize=(15,9))
plt.plot(platforms.keys(), platforms.values())

[<matplotlib.lines.Line2D at 0x7f7d4b8c97b8>]

We've added a parameter for figsize when we initilize the figure. Now its a little easier to read but the chart type still doesn't make sense

plt.figure(figsize=(15,9))
plt.bar(list(platforms.keys()), list(platforms.values()))

<BarContainer object of 31 artists>

Great, now we can see the data and understand it effectively. By creating multiple charts and comparing variables we can start to draw greater meaning from the data and help us make a discovery

platform = []
popularity = []
numberOfPlatforms = 9

for key in sorted(platforms, key=platforms.get, reverse=True):
    platform.append(key)
    popularity.append(platforms.get(key))
    
platform_sample = platform[0:numberOfPlatforms]
popularity_sample = popularity[0:numberOfPlatforms]

print(platform_sample)
print(popularity_sample)

['PS2', 'DS', 'PS3', 'X360', 'PS', 'PSP', 'Wii', 'PC', 'XB']
[1939, 1770, 1205, 1139, 1120, 1107, 1045, 936, 778]

Before moving onto the next visualisation I'm sorting the keys/values we have for each platform and splitting these into lists. This way I can get the data for the 9 most popular platforms which will help simplify the chart

other = 0
for i in range(numberOfPlatforms, len(popularity)):
    other += popularity[i]
platform_sample.append("Other")
popularity_sample.append(other)

print(platform_sample)
print(popularity_sample)

['PS2', 'DS', 'PS3', 'X360', 'PS', 'PSP', 'Wii', 'PC', 'XB', 'Other']
[1939, 1770, 1205, 1139, 1120, 1107, 1045, 936, 778, 3820]

I'm also including a loop to sum all other items not included in these lists, which is named the "Other" category. This is important to reflect the fact that there are more than 9 platforms

fig1, ax1 = plt.subplots()
ax1.pie(popularity_sample, labels=platform_sample, autopct='%1.1f%%', shadow=False, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

Now we have a pie chart the accurately reflects our data, with some summarisation in the "other" category.

For a challenge, when you've got a better understadning of pandas, try creating dataframes for different years and build a chart to compare how popular certain gaming platforms were in each year side-by-side

What else should you include in your notebook¶

More advanced visualisations; There are more more effective and advanced charts to use in matplotlib and seaborn
Extra datasets; Gather more insight and collect more data to build a greater understanding in your analysis
Look from a different angle; Analyse all relevant data you have in several different ways and think outisde the box to make a useful discovery
Consider using machine learning; Use machine learning to build a model which can use your analysis to predict future trends

Don't worry about this too much yet, we'll cover all of this throughout the course. For now get more familiar with python and begin learning pandas and then matplotlib/seaborn

The Live Pandas Demo¶

This demo will show you how pandas works.

import pandas as pd

data

%time

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.48 µs

$$ P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)} $$

	Name	Price
0	Apple	3.3
1	Banana	1.3

	Rank	Name	Platform	Year	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales
0	1	Wii Sports	Wii	2006.0	Sports	Nintendo	41.49	29.02	3.77	8.46	82.74
1	2	Super Mario Bros.	NES	1985.0	Platform	Nintendo	29.08	3.58	6.81	0.77	40.24
2	3	Mario Kart Wii	Wii	2008.0	Racing	Nintendo	15.85	12.88	3.79	3.31	35.82
3	4	Wii Sports Resort	Wii	2009.0	Sports	Nintendo	15.75	11.01	3.28	2.96	33.00
4	5	Pokemon Red/Pokemon Blue	GB	1996.0	Role-Playing	Nintendo	11.27	8.89	10.22	1.00	31.37