The #100DaysOfMLCode is a challenge in which you spend at least 1 hour a day learning about Machine Learning and you publish what you have been learning to keep yourself accountable during that time.
During my first week of #100DaysOfMLCode I’ve been working on two different courses in no particular order. Here is the list of courses:
- Machine Learning Crash Course
- Computer Vision Udacity Nanodegree Free Preview here
This is what I learned about Machine Learning during my first 2 days:
Key Machine Learning Terminology
-
Feature: features are the input variables we feed into a network, it can be as simple as a single number or more complex as an image (which in reality is a vector of numbers, where each pixel is a feature)
-
Label: is the thing we are predicting, it is normally referred as y
-
Prediction: or predicted value if the value we predict with a previously trained model for a given output and it is referred as y’
Regression vs. classification:
-
A regression model predicts continuous values.
-
A classification model predicts discrete values.
Linear Regression
Is a method for finding the straight line or hyper plane that best fits a set of points.
Line formula:
y = wx + b
Where:
w = Weights
x = Input features
b = Bias
Some convenient loss functions for linear regression are:
- L2 Loss also called squared error and it is equal to (observation — prediction) 2
- Mean Square Error: is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and take the average (divide by the number of examples):
When training a model we want to minimize the loss as much as possible to make the model more accurate without over fitting.
This is what I learned about Pandas
Pandas is a great python API for column-oriented data analysis.
To import pandas use the following line:
import pandas as pd
There are 2 primary data structures used in Pandas:
Series: which represents a single column
DataFrame: which is similar to a relational data table, it is composed by one or more series.
To create a serie:
city_names = pd.Series(['Barcelona', 'Madrid', 'Valencia'])
population = pd.Series([1609000, 3166000, 790201])
To create a dataframe with the previous series use the following:
spain_cities_df = pd.DataFrame({ 'City name': city_names, 'Population': population })
A dataframe is created by passing a dictionary mapping with a string as the column name as a serie as the content.
Most commonly you will not write the content of a dataframe but read it from a file such as a comma separated values file (csv for short).
spain_cities_df = pd.read_csv('path/to/file.csv', sep=',')
You can get interesting statistics with the df.describe() function, you will get the count, mean, std, min, 25%, 50%, 75% and max for each column.
spain_cities_df.describe()
Another useful function is df.head() this will display the top 5 columns so you can have an idea of what the dataframe contains
spain_cities_df.head()
Similarly you can use df.tail() and it will return the last 5 rows of data in the dataframe. Both functions will accept an integer as input for the number of rows to return, by default it is 5 but you can use any number you want, for example
spain_cities_df.tail(20)
Will return the las 20 rows of the dataframe
A powerful feature is graphing. DataFrame.hist lets you quickly study the distribution of values in a column:
spain_cities_df.hist('Population')
To access data just use a column name as the key of the dataframe:
spain_cities_df.hist['City name']
Will return the whole serie, with the 3 items inside
To access just one item in that column you can do this
spain<em>_cities_</em>df.hist['City name'][0]
That will return “Barcelona” as a string
It is also possible to return only a slice of the dataframe (by slicing as you would do with any array in python)
spain_cities_df[0:2]
Will return a Dataframe with the first 2 columns of the sliced dataframe
Pandas will also allow manipulating data in series so for example you could do this:
And all values in that column will be divided by 1000
To add new series (or columns) to a Dataframe it is as simple as to define it
Every value in a Dataframe will have an auto generated integer index, the index once created will never change, even if the data is reordered the index will move with the row.
Dataframe.reindex will reorder rows (it accepts a list of indexes as the new order)
spain_cities_df.reindex([2, 0, 1])
Will sort the cities as Valencia, Barcelona, Madrid
Pandas is huge and these are just the basics of course, but knowing just that it is already possible to do a lot of data analysis!
This is whas I could learn during my first 2 days of 100 days of ML Code!
During the first week I was also learning about Convolutional Neural Networks and Computer Vision, but that I will be posting in the next couple of days!
I post my daily updates on my Twitter account @georgestudenko and you can also see my daily progress on my Github repositoy