• Backward selection. We start with all variables in the model, and remove the variable with the largest p-value—that is, the variable that is the least statistically significant. The new (p − 1)-variable model is fit, and the variable with the largest p-value is removed. This procedure continues until a stopping rule is reached. For instance, we may stop when all remaining variables have a p-value below some threshold.
• Bootstrap Bootstrap is used to determine the mean and other statistical parameters like the standard error by repeatedly sampling a given dataset randomly with replacement and then using these samples to estimate these statistical parameters like mean and standard error.
• Bayes Classifier If we know the probability distribution of the data in a classification setting, then a bayes classifier each observation to the most likely class, given its predictor values. The bayes classifier minimizes the test error. Prediction can be made using $Pr(Y = j|X = x_0 )$
The expected test MSE, for a given value $x_0$ , can always be decomposed into the sum of three fundamental quantities: the variance of $\hat{f}(x_0)$, the squared bias of $\hat{f}(x_0)$ and the variance of the error terms. $E (y_0 − \hat{f}(x_0))^2 = Var( \hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + Var(\epsilon)$. See More: Bias
• Classification
Classification is used to predict which class a data point is part of (discrete value). Example: I have an unknown fruit that is yellow in color, 5.5 inches long, diameter of an inch, and density of X. What fruit is this? I would use classification for this kind of problem to classify it as a banana (as opposed to an apple or orange).
• Confusion Matrix
 Predicted Negative Positive Actual Negative FN FP Positive TN TP
• Sensitivity TP / (TP + TN)
• Specificity FN / (FN + FP)
• Cross Validation
• Validation Set - When half of the data is used for training and the rest for validation. Has high bias because only half of the data is used to construct the classifier/regression.
• Leave one out cross validation - When all but one data point is used to train and then validate on just one data point. This has low bias because almost all of the data is used for constructing the classifier/regression. But it might show a high variance of test error because the classifiers that are being constructed are higly correlated. The output they produce when averaged give high variance.
• k-fold cross validation - When data is divided into k parts and one part is used as a hold out set.
• Dependent Variable Same as Output Variable
• F-statistic : In multiple linear regression, a F-statistic is used to measure the quality of fit: $\frac{(TSS − RSS)/p}{RSS/(n − p − 1)}$, if this quantity is high, at least one of the predictors is relevant. Just using p-value sometimes randomly finds a particular predictor to be relevant even if it is not relevant.
• Feature Same as Input Variable
• Forward selection. We begin with the null model—a model that con- tains an intercept but no predictors. We then fit p simple linear re- gressions and add to the null model the variable that results in the lowest RSS. We then add to that model the variable that results in the lowest RSS for the new two-variable model. This approach is continued until some stopping rule is satisfied.
• heteroscedasticity When the error term $\epsilon$ in linear regression across different points has a variance. Sometimes, it can be spotted as a funnel shape in the residual graph.
• High Leverage points: In linear regression, high leverage point is the point with an unusual value of x.
• Independent Variable Same as Input Variable
• Input Variable
Column used by a learning algorithm to Make prediction or other analysis. X is the Input varibale if we are analyzing f or predicting Y using X: $Y = f(X)$
• Interaction term: When using linear regression with a multiplicative term, the multiplicative term is called the interaction term: $Y = aX_1 + bX_2 + cX_1 * X_2 + d$, the $cX_1 * X_2$ term is the interaction term.
• Irreducible Error
If $Y = f(X)$ is the function being analyzed, and we come up with an analysis $\hat{Y} = \hat{f}(\hat{X}, \epsilon)$ , $\epsilon$ is something that the analysis is not measuring as an input variable. This is the irreducible error.
• Linear Discriminant Analysis
This is a classification method in which points belonging to each class in the training set are modelled as a multivariate gaussian with priors as the number of points in a particular class. For simplification, the covariance matrix is averaged and assumed to be the same. This leads to linear decision boundaries.
• Linear Regression
1. Simple Linear Regression: Predicting a quantitative response Y on the basis of a single predictor variable X. It assumes that there is approximately a linear relationship between X and Y . Mathematically, we can write this linear relationship as $Y \approx \beta_0 + \beta_1 X$ IF we are trying to fit this relationship on a set $(x_0, y_0), (x_1, y_1) ...$, one way is to try to minimize $\sum{e_i^2}$, where $e_i = y_i - \hat{y}_i$ where $\hat{y} = \beta_0 + \beta_1 * x_i$
2. Multiple Linear Regression: When Y is predicted on the basis of multiple Xs A good introduction
• Logistic Regression
Logistic regression is a form of classification where each class gets a linear classifier $V_{c, X} = \beta_{0c} + \beta_{1c} * X_1 + \beta_{2c} * X_2 ...$. A new point is classified to the line nearest to it. The way to train these classifiers is to model probabilites for each class. The probability of a point X being in class i is $P_{ci, X} = exp(V_{ci, X}) / \sum\nolimits_{j} exp(V_{cj, X})$ . For the training set, these probabilities are multiplied and then an Expectation Maximization algorithm(Newton Raphson) is used to find the $\beta$ coefficients.
• Mixed selection. This is a combination of forward and backward se- lection. We start with no variables in the model, and as with forward selection, we add the variable that provides the best fit. We con- tinue to add variables one-by-one. Of course, as we noted with the Advertising example, the p-values for variables can become larger as new predictors are added to the model. Hence, if at any point the p-value for one of the variables in the model rises above a certain threshold, then we remove that variable from the model. We con- tinue to perform these forward and backward steps until all variables in the model have a sufficiently low p-value, and all variables outside the model would have a large p-value if added to the model.
• Non Parametric Methods
Does not assume that sample data comes from a population that follows a probability distribution based on a fixed set of parameters. All of the data is used to make a prediction or other analysis. For eg, kNN or kMeans.
• Outliers: An outlier is a point for which y i is far from the value predicted by the model. It has a high residual.
• Output Variable
Column being predicted or analyzed. Y is the Output varibale if we are analyzing f or predicting Y using X: $Y = f(X)$
• p-Value The p-value is defined as the probability of obtaining a result equal to or “more extreme” than what was actually observed, when the null hypothesis is true.
• Parametric Methods
Assumes that sample data comes from a population that follows a probability distribution based on a fixed set of parameters. The data can be divided between train and test data. training data can be used to estimate the probability distribution. Eg, Linear Regression, Neural Networks.
• Polynomial Regression: When we include polynomial terms in linear regression. $Y = aX + bX^2 + cX^3 ...$
• Predictors Same as Input Variable
• R-Square $(1 - RSS/TSS)$, where TSS is $\sum{(y_i - mean(y)) ^ 2}$. The value is between 0 and 1. One means perfect fit. In the simple linear regression case with one variable, this value is same as the correlation between x and y.
This is a classification method in which points belonging to each class in the training set are modelled as a multivariate gaussian with priors as the number of points in a particular class. The covariance matrix is not averaged and is assumed to be different for each class. This leads to quadratic decision boundaries.
• Reducible Error
If $Y = f(X)$ is the function being analyzed, and we come up with an analysis $\hat{Y} = \hat{f}(\hat{X}, \epsilon)$ , $\epsilon$ is something that the analysis is not measuring as an input variable. But $\hat{f}$ is not a perfect estimate of $f$. This difference between $\hat{f}$ and $f$ is called the reducible error.
• Regression
Regression is used to predict continuous values. Example: I have a house with W rooms, X bathrooms, Y square-footage and Z lot-size. Based on other houses in the area that have recently sold, how much (dollar amount) can I sell my house for? I would use regression for this kind of problem.
• ResponseSame as Output Variable
• RSE : Residual Standard error = $\sqrt{RSS/(n-2)}$. This can be used to estimate the $\epsilon$ if the true distribution is $Y = mX + C + \epsilon$.Linear RegressionRSS
• RSS : Residual Sum of Squares in Linear Regression. $\sum{e_i^2}$. Linear Regression
• Standard Error : How accurate is the sample mean (or another quantity) $\hat{\mu}$ as an estimate of $\mu$? We can get the answer by standard error which in the case of sample mean is $\sigma^2 / n$ where $\sigma$ is the standard deviation of the sample.
• Supervised Learning
In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output. Classification(add link), Regression(add link)
• t value : $\beta / Standard Error(\beta)$ follows the t -distribution and the corresponding p-value can be computed using a table Linear Regression
• Unsupervised Learning
Unsupervised learning, on the other hand, allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we dont necessarily know the effect of the variables. Eg, clustering (add link)