This post contains notes of the Machine Learning course taken at University of Toronto, 2018 (still ongoing)
This may make references to material that is copyrighted. Therefore, the material will not be attached.
The working book is Géron, A; Hands-On Machine Learning with Scikit-Learn & TensorFlow; O’Reilly; 7th release; 20-07-2018.
Most of the mentioned examples are part of the Jupyter notebooks mentioned in the working book.
This article will be divided into 12 parts, named day classes. Mostly, each of them represents a working book chapter.
O’Reilly Multibook subscription service is Safari Books Online. The mentioned book is available there.
It is really worthwhile to mention that both libraries: Montréal/Québec BanQ and Toronto Public Library offer access to Safari Books Online and http://Lynda.com . They also offer interesting language training tools, and online books. Biggest public libraries have this kind of agreement with editors. Amazing resources to get updated!
Run once
virtualenv -p “c:\Program Files\python36\python.exe” env
Run always
.\env\Scripts\activate
jupyter nbextension enable toc2/main
jupyter notebook
pip3 install –upgrade -r .\handson-ml\requirements.txt
Math + statics + computer science. SAS, AS400 Mainframe, Cobol and Fortran, R, SPSS (hard if miss a comma)
Tools – ML – from where it came and to were it is going (AI)
Internal and external clients should know the concepts.
Tackle the project – (bitcoin price opportunity) – Kaggle
45000K dollars prizes
Beware with IT – 250 business indicators, fancy dashboards, for what? You may be good with software, but what is the problem? Which educated decision should we provide?
Credit Report in Canada: It is being about modeling logistic regression for the last 20 years.
Regarding the model, if the environment changes little in time, you may keep the model for a bit long time.
However, in the next second, your model starts to be deprecated.
ML holds the ability to dynamically adapt to the changes
“DECISION MAKING IS THE HARDEST THING FOR A HUMAN TO DO.”
We do not want to be wrong, but there is an uncertainty. We live in a 3D Universe and have no idea about the future.
“Move on!” Assessing Good decision: were the outcomes good?
Best decisions are made with almost total control of data. And it still will have some chance to fail.
Canada – A company has a likelihood of bankruptcy of 2%. Some sectors have 4%.
If you start some business nowadays, you MUST know the data, missing data, gaps in order to provide means to make an EDUCATED DECISION.
“Algorithms do not think. A programmer came out with it. Therefore, there is room for mistakes.”
Models are not only mathematical ones. Engineering: the scale model of an airplane.
Rules-based models. Experts have experience. Wisdom is the integration of experience.
E.g. Crossing the street. An extremely dangerous decision. Some die doing that. Algorithm: detect the presence of cars. If not, OK. There are cars? Yes. Are they moving? Yes. Are they starting/accelerating? You estimate if something changed all the time.
We are pattern recognition animals. (Nobel Prize 2002)
System 1 – survival of the species – get unknown patterns and prepare your body to fight for life.
System 2 – Work the patterns.
It will appear as a sequence of if-them rules.
The difference of Artificial Intelligence and Machine Learn (static modeling to dynamic modeling)
However, based on information of prediction, you make a decision, you have the artificial intelligence (when the system makes the decision).
That means we need to model the decision (succession of a number of rules, or odd rules)
OK, so when is AI or not?
“Everybody is doing AI?” Not really.
Acquisition -> extract -> Process -> Make the decision
OCR – Reading. ML or AI? It is AI because it makes a decision if it is an A, B, O or 0.
Our course is about providing info to humans so that they may make a decision.
Supervised, unsupervised, semi-supervised and Reinforcement Learning
Learn on the fly or by batch
Instance-based or model-based
Columns (attributes) – variables; labeled items are used to identify unidentified items;
Features – More used mainly in array (trees) random forest, etc
A feature is an attribute value;
Independent variables (statistics)
Example :
Universe – companies in Canada
The probability of a company goes bankrupt. Get the characteristics of the companies that went bankrupt to predict current companies to go bankrupt.
This model will work for some time and get deprecated.
You teach the system. Easier and more precise. Very targeted, very specific. May have pitfalls. The way it was defined or classified. A line in the sand. But you need to do that.
You use labels.
Data is unlabeled. It separates in clusters. It separates your universe in groups, compares all different attributes, in groups that are as different among each other and as look-alike inside the group.
E.g. Marketers, consumer analysts use this.
Postal code can group you by age, interests, etc. They have so much information about you that your postal code may suffice.
Another is anomaly detection (cancer cells?)
The mix of the two before
In one side of the spectrum, we have the supervised. In the other side, we have the unsupervised. We can classify who are the people in different photos by knowing some labeled ones.
Reward the system to find the best process. Water? Fire? Fire burns (-50 points). I need water (50 points).
Accumulate as many points as possible (walking robots)
The reward means you need to know well what the problem you need to solve.
Reference about the reward to an old article.
What you have to lose and what you want to keep?
Like a chess game, or a life.
2nd WW – After the war you will be back. Win that!
Vietnam War – After 1 year you will go back home. Do nothing and go back home (survive)
The goal of all companies is: MAKING MONEY.
If the parameters do not help that, it is wrong. It needs to be in the right direction. Sometimes the companies do not have strategies.
Takes time to process all available data at once.
Efficiency and accuracy. Apply the prices of the current month and compare with the model results.
The trading market needs to react live to retrain based on the feed of data.
It has a lot of constraints (physical) – speed, volume, process.
AWS – Ontario and BC. OK, but all Canada … needed a server. What will you win, what you will lose?
The curse of dimensionality – the cost of doing it weekly was not giving much more value but would cost 10 times more. Monthly is OK.
Market crisis – out of scope. It should be another model / project.
If your opinion is too strong, if you say, something may happen. Mood has been accused of triggering it. They identified that, the Asian Tigers in 1997 / 98. Macroeconomics models have this kind of consequences.
If in companies of Teranet says something like this, this may happen again.
Independent variables – should be independent. An example of macro variables of low influence is weather.
Fast trading market once almost the market because of lack of bias in their systems.
Learn by heart. (K-Nearest Neighbors)
Less vector distance will win.
Universe will have a model (equation). Then the model will compute the results. It ignores that close to the item, it has different items.
The equation separates the classes.
Insufficient quantity of data (not representative of the Universe)
Nonrepresentative Training data (or the data has different structure)
Poor-Quality Data (missing data [fill them with means, medians, averages, fill them,…] , outliers [maybe ignore or remove the 2.5% of the last percentiles in the borders of gaussian]) (Houses given for free?)
“Give me everything”
Irrelevant features (feature selection, feature extraction [merges])
Overfitting and Underfitting
XG Boost – Extreme Gradient Boost (balance all the time the efficiency of the model, avoiding under and overfitting)
Cross validation – We can now cross validate (do that 10 times to test different amounts of 20%
Rename – Put your name
https://www.analyticsvidhya.com/blog/2018/03/introduction-k-neighbours-algorithm-clustering/
https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy
https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/
Linear regression
https://towardsdatascience.com/simple-and-multiple-linear-regression-in-python-c928425168f9
Unbalanced class: (business, computer science, statistic). K-Means
End-to-end project: Understand Logical path.
Machine learning – 90% are based on 20 years ago techniques, as prediction models. However, the tools and the dynamic nature of the environment are the key to modern ML
Most Functions are ready to use and they are fast. We need to know what is happening behind the scenes.
Concept of pipelines: the concept of a task to run tasks several times under a given sequence. (automate)
As modelers, it is likely that in most cases you will not be an expert in the industry you will work. You will look probably different industries because it is way more fun than specializing in a single one.
It will give different weights. Gains versus costs.
Accuracy and complexity (overfitting, underfitting)
Simplifying the model. Concept of aucamerazor (middle-age) : “The best decision, the simplest way is always the best.”
Example: Health industry. False positives and false negatives. (sick, but not, or not sick, but yes)
False negatives need to be punished if it is more serious.
When you start a model, you always must make assumptions.
Even if you are not considering all of them, so you (and your partners) may consider them.
Document and naming convention for files (must be easy).
latitude 20640 non-null float64
housing_median_age 20640 non-null float64
total_rooms 20640 non-null float64
total_bedrooms 20433 non-null float64 ß– unbalanced. (this is heaven)
population 20640 non-null float64
households 20640 non-null float64
median_income 20640 non-null float64
median_house_value 20640 non-null float64
ocean_proximity 20640 non-null object
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5 <- Will disbalance the model
Know the data. You will need to take a look on it, get acquainted to it, sleep with it.
Quantitative: OK; qualitative: discard at first.
Sampling – a whole course: 80 – 20, random, or other.
Apply storytelling concepts to your presentations.
You need to make your data a history teller. Make it simple. Visualization must be simple.
You may lose interest and legitimacy if visualization is too complex or unclear.
You may show all different steps (in a simple way, as a sequence of simple things)/
Build correlation variables among the variables correlated.
Sometimes you divide values among them, you may have some insight. (Ratios)
In Europe, most of the financial models are financial ratios. This may well enlighten you model.
Rename labels as dummy variables – OneHot encoder – create dummies (Ocean = 0, inland = 1)
If you have others, transport to have another column ex. Island = 0, continent = 1
Linear regression – if it is good enough (and perform as good the others) , use it
Can I remove variables without hurting the performance of the model?
Avoid obvious ratios to avoid cardinality. E.g. Number of employees x revenue. Usually, it is directly related.
Linear regression, unless you are seasoned in the problem.
In the end, it is about helping someone to make an educated decision.
Excel example: Same scatter plots.
Simple tools sometimes make you faster. You could us in an airport!
Questions about assignments
Useful links to know more about statistics in Python
https://docs.python.org/3/library/statistics.html
https://en.wikipedia.org/wiki/Norm_(mathematics)
https://stackoverflow.com/questions/25050311/extract-first-item-of-each-sublist-in-python
Other links I had contact this week: Apache Gluon
https://github.com/Ishitori/DeepLearningWithMXNetGluon/blob/master/Toronto%20AI%20presentation.pptx
Assignment: the Same transformation but change the model (Lasso Regression).
Rule of thumb for the assignment is having an average price for the real state in a region.
Use regression.
Lasso (Least Absolute Shrinkage and Selection Operator) and (Ridge) – Used when:
– We have a large number of variables and producing overfitting (more than 10 variables)
– Large enough to produce computational challenge (millions or billions of features)
They are similar, but Ridge performs L2 Regularization whereas Lasso uses L1 regularization.
Advantages over Ridge:
-For the same values of alfa, coefficients of Lasso are far smaller;
– For the same alfa, lasso has higher RSS (Poorer fit)
– Produces more sparse space (more zeroes in the table)
– Higher RSS for higher alphas
In KNN, Your K should be the smallest possible, ideally. However, avoiding underfitting.
In other words, the trade-off of your accuracy is the subject of modeling.
Predict belonging to a group: Classifier
Predict a value: regressor
Professor have done it using Excel (60K lines – apart from 10K of test, 28 x 28 (784 features) pixels having values in each cell from 0 to 255)
Size and volume are OK. The curse of size might be a problem.
“By having a vector of 784 features, which label (number) will it have?”
Binary: FIVE and NOT-FIVE
Sometimes the data have some order (time, increasing value, etc) The pattern does not exist.
Randomize the order (shuffle) is a way to make classifiers not sensitive to that.
SGDClassifier – years the 60s … Adaline – first application of neural network
Inputs need to be balanced anyway
Cross_val-score – check / validate performance
Problems: you have 10% of fives. If your accuracy is 90%, you need only to say (everyone is non-five)
Confusion matrix (TP, False positives, true negative, FN)
Precison – type I error: Positives x false positives (check PPT)
Recall – type II error: Negatives x false negatives
Different weights: sick – not sick
Recall: risk you take vs accuracy. If the recall is 0 in this case, your precision will be 90% and not good at all.
F1 to compare different models, whereas a modelers team will ask to recall and precision, which is more precise – slide 16
To a customer, or even an internal client: To where you want to go in terms of acquiring new clients, % of loss? 5.75%?
Roc – Receiving Operating Curve
Slide 22: the square would be ideal: total recall, 100% precision: Area under the curve will be bigger (AUC)
Client (internal, in the company)
Customers (real external clients)
Best having: legal advice and legal support.
Also: documentation
Results may have an emotional impact, and affect customers or clients decisions
Models are not perfect.
OvO and OvA classification.
SVM is better OvO; others are better OvA
Slide 32: where is white, is where we have mistaken (he divided the value of the cell by the sum of the column, and blacked the diagonal)
A data analyst is also a story-teller
Review the lesson 1: training setting and test setting: make sure you are choosing the right set.
You should have a test set. Always.
If your input is in 4 dimensions, your output will be 4 dimensions!
Mistakes in code might happen. You could compute manually in a piece of paper. Small samples (5 numbers), submit it to your model.
Independent variable = descriptors = attributes = feature (could be same but not always), predictors.
k-Means, DBSCAN and Hierarchical. The first two are more similar. They are top down. The 3 are taken here because they are completely different among all of those referred to in the book. One is separating, other is aggregating. Hierarchical is bottom-up.
Between them: K-Means computes distances , DBSCAN work in agglomerations, compute density.
Why we use each one, in which situation.
Even if it is hard to estimate, the model should be businesswise. Should make sense to users even though it is mathematically correct.
If you can show something reducing dimension, mainly to 3D in your story telling, it would be great.
Most of the time it is applied to marketing (regroup users, consumers, individuals x behavior patterns) in order to tailor the offer. Very popular.
Compress data
Detect outliers – dbscan does it; k-means, never
Example is a bad word in the slide. It comes from a sample. Example should be OBSERVATION
Lawyers love this kind of approach, to separate similar documents (contents, subjects)
DBScan identifies outliers
You decide the number of clusters. We can guess using some calculations in advance
In finance, the customer might need only 3: banks: credit score: approve, reprove, manual revision. If your model gives 5 clusters, it is of no use.
Sometimes you can have 5, and they want 7.
The weakness of the model is also the power of it: you decide the number of clusters.
It creates seeds to map the centroids.
You never have outliers with K-means. Largely used, simple.
Value of K could be data driven (slide 24)
Wikipedia file: Determining the number of clusters in a dataset; elbow method
Check in the assignment the silhouette
You will create groups that are look alike.
K-means, convex. DBScan, is not convex. Slide 28
DBScanner goes better where there are shapes. Focused on density.
Epsilon = 1.02 (radius of the cluster)
Minpoints = 4 (minimum points inside)
If you increase the Epsilon a lot, the density will be smaller. The points will be loose.
If reducing it, you will have a lot of outliers.
In city projects, it could be useful. Housing density, for instance. Also, the density of cells.
As Nilton jobs under Velaqua.
Every single point a cluster where you merge two of them.
Number of clusters depend on where will you cut. In which level. (2,4,6,…)
It is a bottom-up strategy.