[:en]Machine Learning Course text notes[:]

By admin, October 13, 2018

[:en]This post contains notes of the Machine Learning course taken at University of Toronto, 2018 (still ongoing)

This may make references to material that is copyrighted. Therefore, the material will not be attached.

The working book is Géron, A; Hands-On Machine Learning with Scikit-Learn & TensorFlow; O’Reilly; 7th release; 20-07-2018.

Most of the mentioned examples are part of the Jupyter notebooks mentioned in the working book.

This article will be divided into 12 parts, named day classes. Mostly, each of them represents a working book chapter.

O’Reilly Multibook subscription service is Safari Books Online. The mentioned book is available there.

http://oreilly.com/safari

It is really worthwhile to mention that both libraries: Montréal/Québec BanQ and Toronto Public Library offer access to Safari Books Online and . They also offer interesting language training tools, and online books. Biggest public libraries have this kind of agreement with editors. Amazing resources to get updated!

Installing locally

Run once

virtualenv -p “c:\Program Files\python36\python.exe” env

Run always

.\env\Scripts\activate

jupyter nbextension enable toc2/main

jupyter notebook

pip3 install –upgrade -r .\handson-ml\requirements.txt

Class 18-09-2018

Math + statics + computer science.  SAS, AS400 Mainframe, Cobol and Fortran, R, SPSS (hard if miss a comma)

Tools – ML – from where it came and to were it is going (AI)

Internal and external clients should know the concepts.

Tackle the project – (bitcoin price opportunity) – Kaggle

45000K dollars prizes

Beware with IT – 250 business indicators, fancy dashboards, for what? You may be good with software, but what is the problem? Which educated decision should we provide?

Credit Report in Canada: It is being about modeling logistic regression for the last 20 years.

Regarding the model, if the environment changes little in time, you may keep the model for a bit long time.

However, in the next second, your model starts to be deprecated.

ML holds the ability to dynamically adapt to the changes

 

“DECISION MAKING IS THE HARDEST THING FOR A HUMAN TO DO.”

 

We do not want to be wrong, but there is an uncertainty. We live in a 3D Universe and have no idea about the future.

“Move on!” Assessing Good decision: were the outcomes good?

Best decisions are made with almost total control of data. And it still will have some chance to fail.

Canada – A company has a likelihood of bankruptcy of 2%. Some sectors have 4%.

If you start some business nowadays, you MUST know the data, missing data, gaps in order to provide means to make an EDUCATED DECISION.

“Algorithms do not think. A programmer came out with it. Therefore, there is room for mistakes.”

Models are not only mathematical ones. Engineering: the scale model of an airplane.

Rules-based models. Experts have experience. Wisdom is the integration of experience.

E.g. Crossing the street. An extremely dangerous decision. Some die doing that. Algorithm: detect the presence of cars. If not, OK. There are cars? Yes. Are they moving? Yes. Are they starting/accelerating? You estimate if something changed all the time.

We are pattern recognition animals. (Nobel Prize 2002)

System 1 – survival of the species – get unknown patterns and prepare your body to fight for life.

System 2 – Work the patterns.

It will appear as a sequence of if-them rules.

The difference of Artificial Intelligence and Machine Learn (static modeling to dynamic modeling)

However, based on information of prediction, you make a decision, you have the artificial intelligence (when the system makes the decision).

That means we need to model the decision (succession of a number of rules, or odd rules)

OK, so when is AI or not?

“Everybody is doing AI?” Not really.

Acquisition -> extract -> Process -> Make the decision

OCR – Reading. ML or AI? It is AI because it makes a decision if it is an A, B, O or 0.

Our course is about providing info to humans so that they may make a decision.

Types of Machine Learning

Supervised, unsupervised, semi-supervised and Reinforcement Learning

Learn on the fly or by batch

Instance-based or model-based

Supervised Learning Classification

Columns (attributes) – variables; labeled items are used to identify unidentified items;

Features – More used mainly in array (trees) random forest, etc

A feature is an attribute value;

Independent variables (statistics)

Example :

Universe – companies in Canada

The probability of a company goes bankrupt. Get the characteristics of the companies that went bankrupt to predict current companies to go bankrupt.

This model will work for some time and get deprecated.

You teach the system. Easier and more precise. Very targeted, very specific. May have pitfalls. The way it was defined or classified. A line in the sand. But you need to do that.

You use labels.

Unsupervised Learning

Data is unlabeled. It separates in clusters. It separates your universe in groups, compares all different attributes, in groups that are as different among each other and as look-alike inside the group.

E.g. Marketers, consumer analysts use this.

Postal code can group you by age, interests, etc. They have so much information about you that your postal code may suffice.

Another is anomaly detection (cancer cells?)

Semi-supervised Learning

The mix of the two before

In one side of the spectrum, we have the supervised. In the other side, we have the unsupervised. We can classify who are the people in different photos by knowing some labeled ones.

Reinforcement Learning (rewarding)

Reward the system to find the best process. Water? Fire? Fire burns (-50 points). I need water (50 points).

Accumulate as many points as possible (walking robots)

The reward means you need to know well what the problem you need to solve.

 

Reference about the reward to an old article.

What you have to lose and what you want to keep?

Like a chess game, or a life.

2nd WW – After the war you will be back. Win that!

Vietnam War – After 1 year you will go back home. Do nothing and go back home (survive)

The goal of all companies is: MAKING MONEY.

If the parameters do not help that, it is wrong. It needs to be in the right direction. Sometimes the companies do not have strategies.

Batch learning

Takes time to process all available data at once.

Efficiency and accuracy. Apply the prices of the current month and compare with the model results.

Online Learning (on the fly)

The trading market needs to react live to retrain based on the feed of data.

It has a lot of constraints (physical) – speed, volume, process.

AWS – Ontario and BC. OK, but all Canada … needed a server. What will you win, what you will lose?

The curse of dimensionality – the cost of doing it weekly was not giving much more value but would cost 10 times more. Monthly is OK.

Market crisis – out of scope. It should be another model / project.

If your opinion is too strong, if you say, something may happen. Mood has been accused of triggering it. They identified that, the Asian Tigers in 1997 / 98.  Macroeconomics models have this kind of consequences.

If in companies of Teranet says something like this, this may happen again.

Independent variables – should be independent. An example of macro variables of low influence is weather.

Fast trading market once almost the market because of lack of bias in their systems.

Instance-Based Learning

Learn by heart. (K-Nearest Neighbors)

Less vector distance will win.

 

Model-Based Learning

Universe will have a model (equation). Then the model will compute the results. It ignores that close to the item, it has different items.

The equation separates the classes.

 

Common Challenges of ML

 

Insufficient quantity of data (not representative of the Universe)

Nonrepresentative Training data (or the data has different structure)

Poor-Quality Data (missing data [fill them with means, medians, averages, fill them,…] , outliers [maybe ignore or remove the 2.5% of the last percentiles in the borders of gaussian]) (Houses given for free?)

“Give me everything”

Irrelevant features (feature selection, feature extraction [merges])

Overfitting and Underfitting

XG Boost – Extreme Gradient Boost (balance all the time the efficiency of the model, avoiding under and overfitting)

Testing and Validation

Cross validation – We can now cross validate (do that 10 times to test different amounts of 20%

 

Assignment 1

Rename – Put your name

https://www.analyticsvidhya.com/blog/2018/03/introduction-k-neighbours-algorithm-clustering/

https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy

https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/

 

Linear regression

https://towardsdatascience.com/simple-and-multiple-linear-regression-in-python-c928425168f9

 

 

Class 25-09-2018

 

End to end project

Unbalanced class: (business, computer science, statistic). K-Means

End-to-end project: Understand Logical path.

Machine learning – 90% are based on 20 years ago techniques, as prediction models. However, the tools and the dynamic nature of the environment are the key to modern ML

Most Functions are ready to use and they are fast. We need to know what is happening behind the scenes.

Concept of pipelines: the concept of a task to run tasks several times under a given sequence. (automate)

As modelers, it is likely that in most cases you will not be an expert in the industry you will work. You will look probably different industries because it is way more fun than specializing in a single one.

Non-technical questions to the so-called experts

 

  • Identify the questions. Extract as much as you can, why, what, how, when, to understand where you are. Buy the expert a beer and understand the thing.
  • What is the goal in terms of accuracy? It need to have sense.
  • Teranet example.

 

Concept of the loss function

 

It will give different weights. Gains versus costs.

Accuracy and complexity (overfitting, underfitting)

Simplifying the model. Concept of aucamerazor (middle-age) : “The best decision, the simplest way is always the best.”

Example: Health industry. False positives and false negatives. (sick, but not, or not sick, but yes)

False negatives need to be punished if it is more serious.

 

Assumptions

 

When you start a model, you always must make assumptions.

Even if you are not considering all of them, so you (and your partners) may consider them.

Document and naming convention for files (must be easy).

latitude 20640 non-null float64

housing_median_age 20640 non-null float64

total_rooms 20640 non-null float64

total_bedrooms 20433 non-null float64 ß– unbalanced. (this is heaven)

population 20640 non-null float64

households 20640 non-null float64

median_income 20640 non-null float64

median_house_value 20640 non-null float64

ocean_proximity 20640 non-null object

 

<1H OCEAN 9136

INLAND 6551

NEAR OCEAN 2658

NEAR BAY 2290

ISLAND 5 <- Will disbalance the model

 

Know the data. You will need to take a look on it, get acquainted to it, sleep with it.

Quantitative: OK; qualitative: discard at first.

Sampling – a whole course: 80 – 20, random, or other.

 

Data visualization

 

Apply storytelling concepts to your presentations.

You need to make your data a history teller. Make it simple. Visualization must be simple.

You may lose interest and legitimacy if visualization is too complex or unclear.

You may show all different steps (in a simple way, as a sequence of simple things)/

 

Correlate variables

 

Build correlation variables among the variables correlated.

 

Creating ratios

 

Sometimes you divide values among them, you may have some insight. (Ratios)

In Europe, most of the financial models are financial ratios. This may well enlighten you model.

Labeled variables

 

Rename labels as dummy variables – OneHot encoder – create dummies (Ocean = 0, inland = 1)

If you have others, transport to have another column ex. Island = 0, continent = 1

Linear regression – if it is good enough (and perform as good the others)               , use it

Can I remove variables without hurting the performance of the model?

Avoid obvious ratios to avoid cardinality. E.g. Number of employees x revenue. Usually, it is directly related.

First model to test always

 

Linear regression, unless you are seasoned in the problem.

In the end, it is about helping someone to make an educated decision.

Excel example: Same scatter plots.

Simple tools sometimes make you faster. You could us in an airport!

Questions about assignments

Useful links to know more about statistics in Python

https://docs.python.org/3/library/statistics.html

https://en.wikipedia.org/wiki/Norm_(mathematics)

https://stackoverflow.com/questions/25050311/extract-first-item-of-each-sublist-in-python

Other links I had contact this week: Apache Gluon

https://github.com/Ishitori/DeepLearningWithMXNetGluon/blob/master/Toronto%20AI%20presentation.pptx

 

Class 02-10-2018

 

Assignment: the Same transformation but change the model (Lasso Regression).

Rule of thumb for the assignment is having an average price for the real state in a region.

Use regression.

Lasso (Least Absolute Shrinkage and Selection Operator) and (Ridge) – Used when:

– We have a large number of variables and producing overfitting (more than 10 variables)

– Large enough to produce computational challenge (millions or billions of features)

They are similar, but Ridge performs L2 Regularization whereas Lasso uses L1 regularization.

Advantages over Ridge:

-For the same values of alfa, coefficients of Lasso are far smaller;

– For the same alfa, lasso has higher RSS (Poorer fit)

– Produces more sparse space (more zeroes in the table)

– Higher RSS for higher alphas

In KNN, Your K should be the smallest possible, ideally. However, avoiding underfitting.

In other words, the trade-off of your accuracy is the subject of modeling.

Predict belonging to a group: Classifier

Predict a value: regressor

Professor have done it using Excel (60K lines – apart from 10K of test, 28 x 28  (784 features) pixels having values in each cell from 0 to 255)

Size and volume are OK. The curse of size might be a problem.

“By having a vector of 784 features, which label (number) will it have?”

Binary: FIVE and NOT-FIVE

Sometimes the data have some order (time, increasing value, etc) The pattern does not exist.

Randomize the order (shuffle) is a way to make classifiers not sensitive to that.

SGDClassifier – years the 60s … Adaline – first application of neural network

Inputs need to be balanced anyway

Cross_val-score – check / validate performance

Problems: you have 10% of fives. If your accuracy is 90%, you need only to say (everyone is non-five)

Confusion matrix (TP, False positives, true negative, FN)

Precison – type I error: Positives x false positives (check PPT)

Recall – type II error: Negatives x false negatives

Different weights: sick – not sick

Recall: risk you take vs accuracy. If the recall is 0 in this case, your precision will be 90% and not good at all.

F1 to compare different models, whereas a modelers team will ask to recall and precision, which is more precise – slide 16

MOST IMPORTANT QUESTION

 

To a customer, or even an internal client: To where you want to go in terms of acquiring new clients, % of loss? 5.75%?

Roc – Receiving Operating Curve

Slide 22: the square would be ideal: total recall, 100% precision: Area under the curve will be bigger (AUC)

Client (internal, in the company)

Customers (real external clients)

Best having: legal advice and legal support.

Also: documentation

  • Include: confusion matrix
  • ROC curve
  • Why you calculated that: cleaning the data process and everything else
  • It makes mistakes. What is a consequence of mistakes. E.g. Financial Market.
  • Hire an insurance for your service if you are independent or own a company
  • Demonstrate that your model is good.
  • Was the model done according to the highest standards in the industry?
  • If errors are inside within the expected rate, thus, it is OK. Standard of your industry.
  • It should be inside your documentation.

Results may have an emotional impact, and affect customers or clients decisions

Models are not perfect.

OvO and OvA classification.

SVM is better OvO; others are better OvA

Slide 32: where is white, is where we have mistaken (he divided the value of the cell by the sum of the column, and blacked the diagonal)

A data analyst is also a story-teller

 

Class 09-10-2018

 

Review the lesson 1: training setting and test setting: make sure you are choosing the right set.

You should have a test set. Always.

If your input is in 4 dimensions, your output will be 4 dimensions!

Mistakes in code might happen. You could compute manually in a piece of paper. Small samples (5 numbers), submit it to your model.

Independent variable = descriptors = attributes = feature (could be same but not always), predictors.

Clustering and unsupervised learning

 

k-Means, DBSCAN and Hierarchical. The first two are more similar. They are top down. The 3 are taken here because they are completely different among all of those referred to in the book. One is separating, other is aggregating. Hierarchical is bottom-up.

Between them:  K-Means computes distances , DBSCAN work in agglomerations, compute density.

Why we use each one, in which situation.

Even if it is hard to estimate, the model should be businesswise. Should make sense to users even though it is mathematically correct.

If you can show something reducing dimension, mainly to 3D in your story telling, it would be great.

Most of the time it is applied to marketing (regroup users, consumers, individuals x behavior patterns) in order to tailor the offer. Very popular.

 

Compress data

Detect outliers – dbscan does it; k-means, never

Example is a bad word in the slide. It comes from a sample. Example should be OBSERVATION

Lawyers love this kind of approach, to separate similar documents (contents, subjects)

DBScan identifies outliers

K-means

You decide the number of clusters. We can guess using some calculations in advance

In finance, the customer might need only 3: banks: credit score: approve, reprove, manual revision. If your model gives 5 clusters, it is of no use.

Sometimes you can have 5, and they want 7.

The weakness of the model is also the power of it: you decide the number of clusters.

It creates seeds to map the centroids.

You never have outliers with K-means. Largely used, simple.

Value of K could be data driven (slide 24)

Wikipedia file: Determining the number of clusters in a dataset; elbow method

Check in the assignment the silhouette

You will create groups that are look alike.

DBScan

K-means, convex. DBScan, is not convex. Slide 28

DBScanner goes better where there are shapes. Focused on density.

Epsilon = 1.02 (radius of the cluster)

Minpoints = 4 (minimum points inside)

If you increase the Epsilon a lot, the density will be smaller. The points will be loose.

If reducing it, you will have a lot of outliers.

In city projects, it could be useful. Housing density, for instance. Also, the density of cells.

Hierarchical Clustering

As Nilton jobs under Velaqua.

Every single point a cluster where you merge two of them.

Number of clusters depend on where will you cut. In which level. (2,4,6,…)

It is a bottom-up strategy.

 [:]