# Machine Learning Notes

This alphabetically sorted collection of AI, ML, and data resources was last updated on 3/26/2021.

- AdaBoost: Fits a sequence of weak learners on repeatadly modified data. The modifications are based on errors made by previous learners.
- Analysis of variance ANOVA
- Bayesian modelling
- Beta Distribution: probability distribution on probabilities bounded [0, 1]
- Classification algorithms comparison

- Confidence interval: linear regression coefficient

- Expectation-maximization (EM): assumes random components and computes for each point a probability of being generated by each component of the model. Then iteratively tweaks the parameters to maximize the likelihood of the data given those assignments. Example: Gaussian Mixture
- F-statistic: determines whether to reject a full model (F) in favor of a reduced (R) model. Reject full model if F is large — or equivalently if its associated p-value is small

- Gradient Boosting: optimization of arbitrary differentiable loss functions. — Risk of overfitting
- KNN: + Simple, flexible, naturally handles multiple classes. — Slow at scale, sensitive to feature scaling and irrelevant features
- K-means: aims to choose centroids that minimize the inertia, or within-cluster sum-of-squares criterion. Use the “elbow” method to identify the right number of means
- Lasso: linear model regularization technique with tendency to prefer solutions with fewer non-zero coefficients

- Linear Discriminant Analysis (LDA): A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule. The model fits a Gaussian density to each class, assuming that all classes share the same covariance matrix.
- Linear regression assumptions (LINE): 1) Linearity, 2) Independence of errors, 3) Normality of errors, 4) Equal variances. Tests of assumptions: i) plot each feature on x-axis vs y_error, ii) plot y_predicted on x-axis vs y_error, iii) histogram of errors
- Overfitting, bias-variance and learning curves. Overfitting (high variance) options: more data, increase regularization, or decrease model complexity
- Overspecified model: can be used for prediction of the label, but should not be used to ascribe the effect of a feature on the label
- PCA: transform data using k vectors that minimize the perpendicular distance to points. PCA can be also thought of as an eigenvalue/engenvector decomposition
- Pearson’s correlation coefficient

- Receiver operating characteristic (ROC): relates true positive rate (y-axis) and false positive rate (x-axis). A confusion matrix defines TPR = TP / (TP + FN) and FPR = FP / (FP + TN)
- Naive Bayes
- Normal Equation

- Random Forests: each tree is built using a sample of rows (with replacement) from training set. + Less prone to overfitting
- Ridge Regression regularization: imposes a penalty on the size of the coefficients

- R2: strength of a linear relationship. Could be 0 for nonlinear relationships. Never worsens with more features
- Sample variance: divided by n-1 to achieve an unbiased estimator because 1 degree of freedom is used to estimate b0
- SMOTE algorithm is parameterized with k_neighbors. Generate and place a new point on the vector between a minority class point and one of its nearest neighbors, located [0, 1] percent of the way from the original point
- SQL COALESCE(): evaluates the arguments in order and returns the current value of the first expression that initially doesn’t evaluate to NULL
- SQL window function: row_number() and partition()
- SVM: Effective in high dimensional spaces (or when number of dimensions > number of examples). SVMs do not directly provide probability estimates
- Stochastic gradient descent cost function

This article was originally published on my personal website adamnovotny.com