Machine Learning Notes

4 min readDec 23, 2020

Visit my personal website adamnovotny.com to view the most recently updated version of ML and data resources I’ve found helpful.

AdaBoost: Fits a sequence of weak learners on repeatedly modified data. The modifications are based on errors made by previous learners.
Analysis of variance ANOVA
Anomaly detection: future examples may look nothing like the past. This is where supervised learning differs because it assumes that future examples fall within the range of the training data. Using Gaussian Mixtures for anomaly detection
Bayesian modelling
Beta Distribution: probability distribution on probabilities bounded [0, 1]
Classification algorithms comparison

Expectation-maximization (EM): assumes random components and computes for each point a probability of being generated by each component of the model. Then iteratively tweaks the parameters to maximize the likelihood of the data given those assignments. Example: Gaussian Mixture
Gradient Boosting: optimization of arbitrary differentiable loss functions. — Risk of overfitting
Hypothesis tests

KNN: + Simple, flexible, naturally handles multiple classes. — Slow at scale, sensitive to feature scaling and irrelevant features
K-means: aims to choose centroids that minimize the inertia, or within-cluster sum-of-squares criterion. Use the “elbow” method to identify the right number of means
Lasso: linear model regularization technique with tendency to prefer solutions with fewer non-zero coefficients

Linear Discriminant Analysis (LDA): A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule. The model fits a Gaussian density to each class, assuming that all classes share the same covariance matrix.
Linear regression assumptions (LINE): 1) Linearity, 2) Independence of errors, 3) Normality of errors, 4) Equal variances. Tests of assumptions: i) plot each feature on x-axis vs y_error, ii) plot y_predicted on x-axis vs y_error, iii) histogram of errors
Overfitting, bias-variance and learning curves. Overfitting (high variance) options: more data, increase regularization, or decrease model complexity
Overspecified model: can be used for prediction of the label, but should not be used to ascribe the effect of a feature on the label
PCA: transform data using k vectors that minimize the perpendicular distance to points. PCA can be also thought of as an eigenvalue/engenvector decomposition
Receiver operating characteristic (ROC): relates true positive rate (y-axis) and false positive rate (x-axis). A confusion matrix defines TPR = TP / (TP + FN) and FPR = FP / (FP + TN)
Preprocessing: duplicates -> outliers -> missing values -> feature correlation -> feature distribution/skew
Naive Bayes
Normal Equation

Random Forests: each tree is built using a sample of rows (with replacement) from training set. + Less prone to overfitting
Reinforcement Learning

Ridge Regression regularization: imposes a penalty on the size of the coefficients
R2: strength of a linear relationship. Could be 0 for nonlinear relationships. Never worsens with more features
Sample variance: divided by n-1 to achieve an unbiased estimator because 1 degree of freedom is used to estimate b0
Sigmoid
SMOTE algorithm is parameterized with k_neighbors. Generate and place a new point on the vector between a minority class point and one of its nearest neighbors, located [0, 1] percent of the way from the original point
SVM: Effective in high dimensional spaces (or when number of dimensions > number of examples). SVMs do not directly provide probability estimates
Stochastic gradient descent cost function

Written by Adam Novotny