Mathematics is the study of topics such as quantity, structure, space and change. Mathematics provides a means to represent the real world in the form of numbers, relationships and equations.
Machine learning employs a wide variety of mathematical concepts and functions in order to create processes that:
- analyze vast amounts of data in workable timeframes
- discover actions, relationships and forces occurring in nature
- model the real world
- interact with the real world
In short, mathematics is the base methodology employed to create machine learning models that are used to address challenging applications such as image recognition, face recognition, natural language processing and pattern recognition:
Mathematics -> Models -> Applications
Below is a summary of prominent machine learning mathematical topics grouped into these categories:
- Activation Functions
- Graph Theory
- Information Theory
- Linear Algebra
- Mathematical Constants
- Mathematical Models
- Mathematical Operations
- Matrix Mathematics
- Accuracy and Precision
- A/B Testing
- Cluster Analysis
- Coefficient of Determination
- Data Cleaning
- False Positive and False Negative Errors
- Hypothesis Testing
- K-Means Clustering
- Law of Large Numbers
- Mean Squared Error
- Logistic Regression
- Loss Function
- Precision and Recall
- Regression Analysis
- Sample Size Determination
- Selection Bias
- Standard Deviation
- Statistical Classification
- Statistical Inference
- Statistical Power
- Statistical Reliability and Validity
Activation Function Data Flow Equation
The activation function data flow equation is shown below. Input values v, weights w and bias b are fed to a summation function which passes its output s to an activation function g which in turn passes its outputs t to other network nodes.
the activation function data flow equation is:
Activation Function Examples from Nature
An example from biology is the shift in the action potential of a brain cell:
An example is the haemodynamic response function, which shows the body responding to environmental changes by delivering nutrients such as oxygen and glucose to stressed tissues:
Activation Function Examples in Neural Networks
In neural networks, activation functions are used in combination with other functions such a matrix addition and multiplication. The networks below show two examples from Google research. The network on the left was designed by human experts and the network on the right by an automated process:
Rectifier (RELU) Activation function
A rectifier function (also referred to as a Rectified Linear Unit or ReLU) is defined as:
Common positive comments about ReLU activation functions include:
- Faster convergence of stochastic gradient descent. Due to their linear, non-saturating form.
- Simpler, faster computation. Due to their linear form not involving exponential operations.
Common negative comments include:
- They can be fragile and 'die'. If the learning rate is set too high, a large gradient flowing through a ReLU neuron can cause weights to update is a way that the neuron will never activate again.
The derivative of the function is:
The curve produced has a fairly gradual rise:
Common negative comments about the sigmoid activation function include:
- Sigmoids can saturate and kill gradients. Gradients (change) at the tails of 0 and 1 are almost zero.
- Sigmoid outputs are all positive values. This can bias network results. The effect can be mitigated by not using sigmoids in the final layers of a network.
Softmax Activation Function
The Softmax function "squashes" a K-dimensional vector z of arbitrary real values to a K-dimensional vector of real values in the range [0, 1] that add up to 1. If:
then the total of all values of z used as a power of e (Euler's Number) is:
and the individual normalized values j of K dimensioned vector z are:
Note that we're using the values of z as a power of e (Euler's Number) instead of using just z alone. This will change negative values to positive values and give disproportionately more weight to originally positive values.
TANH Activation Function
The tanh (Hyperbolic Tangent) function is the hyperbolic analogue of the tan circular function used throughout trigonometry.
The equation for tanh is:
The derivative of tanh is:
Compared to the Sigmoid function, tanh produces a more rapid rise in result values.
Common negative comments about tanh activation functions include:
- Tanh can saturate and kill gradients. Gradients (change) at the tails of 0 and 1 are almost zero.
Calculus is used in machine learning to represent change. Training machine learning models and performing inference against those models involves a lot of change measurement, and calculus is used extensively for these operations.
then the derivative function is expressed as the change in f(x), also thought of as the y axis, with respect to every point x:
then the derivative of this function can be expressed as:
An example is the derivative of:
Splitting it into two functions, it's made up of an outer function f(g) and an inner function g(x):
Taking the derivative of each of these separately:
Putting it all together:
Differential Calculus Sum and Difference Rule
In calculus, the sum rule is used to differentiate functions of the form:
The derivatives of these functions are:
An example is:
The differential of the function is:
An example is:
An example is:
Histogram of Oriented Gradients
The Histogram of Oriented Gradients technique counts occurrences of gradient orientation in localized portions of an image.
This is illustrated by the area S below:
Stochastic Gradient Descent
- Stochastic: random processes are used
- Gradient: a derivative based change in a function output value
- Descent: moving from greater to lesser values of a function output value
Derivatives are key to making stochastic gradient descent work, so let's take a look at them first before tackling SGD. Derivatives are shown graphically in the illustration below, where:
The curved blue line shows the function:
The derivative of this function is:
The tangent line of a curve shows its slope, which is also the derivative. The two straight lines show the tangent of f(x) at two points,
red line - shows the tangent for x=2:
green line - shows the tangent for x=1:
Gradients provide a couple of key pieces of information that are essential for performing SGD:
- The direction of descent - in the graph above, moving left descends the curve
- The size of descent steps - when the derivative (slope) is larger, steps can be larger
Now let's take a look at a formula for using SGD to find a minimum for our function above. The formula iteratively uses a step function to subtract values from x to arrive at the estimated minimum:
In the graph illustration above:
Using SGD, we're iteratively reducing the value of x, using the formula above, to find the minimal value of y. The multiplier used in the x reduction equation is called the learning rate:
There are a number of approaches to calculating the learning rate. These include:
- Momentum -remembers the update at each iteration and determines the next update as a linear combination of the gradient and the previous update
- Averaging - records an average of its parameter vector over time
- AdaGrad - increases the initial learning rate for more sparse parameters and decreases the learning rate for less sparse ones
- RMSProp - (for Root Mean Square Propagation) is a method in which the learning rate is adapted for each of the parameters - the idea is to divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight
- Adam - (short for Adaptive Moment Estimation) is an update to the RMSProp optimizer - in this optimization algorithm, running averages of both the gradients and the second moments of the gradients are used
So as we reduce the value of x the value of y will also decrease. Because we want to find the minimum value of y we want to stop iterating the reduction of x when y=0. In our example, we're able to look at the graph and see that the minimum is at y=0. But in actual SGD applications, the value of y is unknown. We won't be able to determine the exact minimum, so we'll want to stop the iteration of x when the reductions in x get very small.
Below are the calculations from running a number of iterations for the calculation of y based on the values shown:
The (x,y) pair values are plotted in the graph below. Notice that as the value of y approaches 0 the descent slows.:
SGD will typically be used to find the minimum for a function of multiple variables, not just one as we used in our example above. Also, the vertical axis dependent variable is typically referred to as a cost, or loss, function.
The illustration below shows a cost/loss function J with two independent variables:
Moving down the gradient descent curve might look something like this:
Let's say we have a cost/loss function J that is similar to the single variable function we explored but with two variables:
We would want to find the minimum of this function, which can be expressed as:
In our simple example above, we used the derivative of our function as part of the calculation to modify the variable x. Because we're now dealing with two variables, we'll use partial derivatives with respect to each variable.
A graph is a structure of related objects. Below is an example of a graph with 6 nodes (also called vertices) 7 lines (also called edges).
The illustration below is of a graph characterized by the vertices V and edges E:
Graph Node Functions
Graph nodes perform a variety of machine learning functions. These include:
Activation functions generate an output value based on an input value and function equation:
Dropout is a regularization technique for reducing overfitting in neural networks by preventing complex co-adaptations on training data. It is a very efficient way of performing model averaging with neural networks. The term "dropout" refers to dropping out units (both hidden and visible) in a neural network.
Input functions collect and organize data that will be passed on to other graph nodes.
Matrix operations such as addition, multiplication and rotation use input matrices to create new output matrices.
Mathematical operations apply input data to a mathematical function to produce output data.
Memory functions retain information for use by other graph nodes.
Output functions collect and organize data that will be passed out of the graph.
Pooling functions combine and simplify data.
In a feed forward graph, connections between nodes do not form a cycle. Data flows forward from node to node and nodes perform graph node functions. These graphs form the basis for feed-forward convolutional neural networks.
Nearest Neighbor Algorithm
The nearest neighbor algorithm is a computational method for finding an efficient path through a graph, using these steps
- start on an arbitrary vertex as current vertex.
- find out the shortest edge connecting current vertex and an unvisited vertex V.
- set current vertex to V.
- mark V as visited.
- if all the vertices in domain are visited, then terminate.
- Go to step 2.
Information entropy is defined as the average amount of information produced by a probabilistic stochastic source of data. Generally, information entropy is the average amount of information conveyed by an event, when considering all possible outcomes. Entropy is zero when one outcome is certain to occur.
Cross Entropy (loss)
Cross entropy can be used to calculate loss. In the example below, the cross entropy function is defined (using log base e) as:
In the example below, because 1-hot labels are being used:
and the cross-entropy is measuring the loss associated with representing the values of S in L:
Linear algebra deals with calculations of spaces defined by lines, planes and vectors.
Eigenvalues and Eigenvectors
Eigenvalues together with eigenvectors define aspects of linear transforms on spaces. An eigenvector defines a direction that is not changed. An eigenvalue defines a length change related to the eigenvector.
Below, in a shear mapping linear transform, the red arrow changes direction but the blue arrow does not. The blue arrow is an eigenvector of this shear mapping because it doesn't change direction, and since its length is unchanged, its eigenvalue is 1.
A Euclidean vector is a geometric object that has magnitude (or length) and direction, such as in the example below:
A matrix (plural matrices) is a rectangular array of numbers, symbols, or expressions, arranged in rows and columns. Matrices allow the efficient and compact storage of information that can then be used in many types of machine learning functions. Important matrix operations include:
The Derivative of e to the Power of x
gives a curve at which the slope at any value x is also the value of y, which means the derivative is equal to the function itself, or:
Characteristics of e
e also has these important characteristics:
- e is the base rate of growth shared by all continually growing processes
- e shows up whenever systems grow exponentially and continuously, such as population, interest calculations and radioactive decay
- every rate of growth can be considered a scaled version of e (unit growth, perfectly compounded)
- e allows taking a simple growth rate where all change happens as the end of a finite period and find the impact of compound, continuous growth at infinitely smaller intervals
Compound Growth using e
Let's take a closer look at this last point, the impact of smaller intervals of compound growth calculation. Let:
The equation for calculating F is:
If we perform the calculation only once per period, then, n=1 and:
On the other extreme, if we perform compounding calculations continually letting n --> infinity, then:
So now we see how e fits into this equation. But how did it get there? To explore this, let's start with a simplified version of equation (1) above and let:
With these values, equation (1) becomes:
Now let's solve this equation for values of n. After n=10, we'll use a logarithmic scale and jump by a factor of 10 for each new calculation so we can see what happens with very large values of n:
In graph form, it looks like the chart below. Notice the jump between 10 and 100 as we move from a linear graph to a logarithmic graph for values of F:
What we see is that as the value of n approaches infinity, the value of F converges on the value of e: 2.7182820... This means that:
Another equation for calculating the value of e is:
An Example of Compound Growth using e
Let's look at an example of equation (1) when we let n --> infinity and have other values that are not equal to 1 for I, r, t:
equation (1) looks like this:
Combining the Rate of Growth and Time Periods using e
Using n --> infinity allows us to combine rate of growth and time periods calculated into one number. In the example above:
Deriving the Use of e in Compound Growth Equations
We haven't yet explained how we derived the following substitution in equation (1):
In other words, how did we simplify to using only e and rt as shown in equation (7)? Explaining it is a somewhat lengthy process, so we'll go through it in detailed steps...
Changing r/n to a Denominator
First, let's change r/n to a denominator using the algebraic rule of complex fraction inversion:
Applying the Power Rule to the Exponents
Next we're going to use the algebra power rule which says:
To do this, we'll first let:
and substitute it into our equation (8) so that::
Now we're going to introduce r into the exponent nt by adding it in as both a numerator and denominator that cancel each other out and rearranging the single fraction into two fractions, so that:
Now we'll apply the power rule from equation (9) with our new exponent values of x from equation (11) to get :
Next we'll replace x in equation (13) with the value from above:
Showing How e Enters our Equation
We still have to see how e enters this equation. To do this, within equation (14) let:
Within this equation, now look at:
Remember, we already know from equation (5) above that:
So equation (1) is basically the same as y except we're using n/r instead of just n. Now consider this:
Think about it, n/r might approach infinity at a somewhat slower pace than n, but it's still approaching infinity. So now we can say that for y from equation (15):
Using Backward Substitution to Show a Final Result
Now we can do backward substitutions through our sequence of equations to show a final result.
Combining equations (16) and (15) we have:
Combining equations (17) and (14) we have:
Combining equations (18) and (13) we have:
Combining equations (19) and (11) we have:
Combining equations (20) and (1) we have:
Which, finally, shows how we got to equation (3):
Curve (Model) Fitting
In the example below, the dotted black line represents fitting points generated with a sine function. The solid lines are various degree polynomials:
- Red - first degree
- Green - second degree
- Orange - third degree
- Blue - fourth degree
In overfitting, a statistical model describes random error or noise instead of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data.
In the example below, the green line represents an overfitted model and the black line represents a regularized model. While the green line best follows the training data, it is too dependent on it and it is likely to have a higher error rate on new unseen data, compared to the black line.
Regularization is a process of introducing or removing information in order to solve an ill-posed problem or to prevent overfitting, as illustrated in the example below. Machine learning dropout functions are a form of regularization. Regularization is often accomplished by adding a tuning parameter to a model training process.
Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Underfitting would occur, for example, when fitting a linear model to non-linear data. Such a model would have poor predictive performance, as shown in the figure below:
Basic Mathematical Operations
Basic operations include:
- Elementary Arithmetic - e.g. addition, subtraction, division, multiplication
- Exponentiation - one number (base) raised to the power (exponent) of another number
- Logarithms - inverse of exponentiation, the logarithm of a number is the exponent to which another fixed number, the base, must be raised to produce that number - as an example, since 2^3=8, log2(8)=3
- Square root - a square root of a number a is a number y such that y^2 = a; in other words, a number y whose square (the result of multiplying the number by itself, or y⋅y) is a.
- Summation - the addition of a sequence of numbers
A matrix (plural matrices) is a rectangular array of numbers, symbols, or expressions, arranged in rows and columns. Matrix mathematics is important in machine learning because data is often sent between the nodes in a model graph and operated on in a matrix format. Matrices are a mathematically convenient way to apply mathematical operations to sets of data.
Matrix Basic OPERATIONS
Basic matrix operations include addition, scalar multiplication and transposition.
the matrix product AB (denoted without multiplication signs or dots) is defined to be the n × p matrix:
An example of multiplying AB and BA is shown below, notice that in the result, the number of rows is determined by the left matrix operand and the number of columns by the right matrix operand:
Central Limit Theorem
The central limit theorem establishes that, in most situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (a bell curve) even if the original variables themselves are not normally distributed.
This provides a way to understand characteristics of a population of data points using only samples taken from that population.
To see this visually, look at the example below. Random 0s and 1s were generated, and then their means calculated for sample sizes ranging from 1 to 512. Note that as the sample size increases the tails become thinner and the distribution becomes more concentrated around the mean.
If two events, A and B, are independent then the joint probability is multiplicative:
Events are independent if the occurrence of one does not affect the probability of occurrence of the other.
A simple example of coin tosses demonstrates this. If a coin is tossed twice, there are four possible combinations of results for which side of the coin is facing up in the first and second toss:
- head tail
- head head
- tail tail
- tail head
If we let
- A = first toss
- B = second toss
then the probability of each of these four result combinations is:
Let's see why. For the first toss, there are only two possible results:
If the coin toss is fair (no bias in the toss is introduced) there's a .50 probability of getting a head facing up. This is because of the finite number of possible results, in this case two - head or tail. No other outcome is possible. This is called a finite geometry, any geometric system that has only a finite number of points, in our case two. We are assuming that the coin cannot land on its edge, so it can only land in one of two ways, showing a head or tail facing up.
Let's say the coin is tossed and a head is the result:
This means that the first event is now in the past and cannot be changed. This is due to the arrow of time, a concept that says time only moves forward, not backward. So at this point, there is no possibility that the first toss produced a tail. There's a zero probability that the first toss is a tail. This eliminates two of the previously possible results:
- tail tail
- tail head
We are now left with only two possible outcomes for our experiment:
- head head
- head tail
The second toss result is not effected by the first toss result. The toss results are independent. Again, there's a .50 probability of a head or tail in the second toss. Let's say we get this result:
Now that this result has occurred, it also cannot be changed due to the arrow of time.
So our final result is:
- head tail
Looking at the tosses and possible results from a point in time before the tosses were made, we can see that our four possible results each had a .25 probability of happening:
- head tail: .25 probability
- head head: .25 probability
- tail tail: .25 probability
- tail head: .25 probability
Why? Because the second toss probabilities were only .50 of .50. The first toss had already occurred and eliminated two of the potential result combinations, leaving only two of the four available total results.
You can see these results visually with the coin combinations pictured below:
Of the four possible combinations, only one is possible with two tosses of the coin. Since each combination is equally possible, each has a .25 probability of happening.
Mutually Exclusive Events
If two events are mutually exclusive then only one or the other of the two can occur and the probability of either occurring is additive:
In probability theory, events E1, E2, ..., En are said to be mutually exclusive if the occurrence of any one of them implies the non-occurrence of the remaining n − 1 events.
A good example of the probability of mutually exclusive events is calculating the probability of drawing a King or Queen from a deck of cards. There are 4 Kings in the deck, so the probability of drawing a King is 4/52, or 1/13. And since there are also 4 Queens in the deck, the probability of drawing a Queen is also 1/13. So, if:
then the probability of drawing a King or Queen is:
Probability Density Function
The Probability Density Function is used to specify the probability of the random variable falling within a particular range of values, as opposed to taking on any one value. This probability is given by the integral of this variable’s PDF over that range—that is, it is given by the area under the density function but above the horizontal axis and between the lowest and greatest values of the range. The probability density function is nonnegative everywhere, and its integral over the entire space is equal to one. In the graphic below, IQR is the interquartile range.
Accuracy and Precision
Accuracy and precision describe the proximity of measurement results to the true value and variation in measurement data, as illustrated below:
Coefficient of Determination
The coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).
In the example below, since the regression line does not miss any of the points by very much, the R2 of the regression is relatively high:
Data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. Data quality screens include:
Extrapolation is the process of estimating, beyond the original observation range, the value of a variable on the basis of its relationship with another variable, such as shown in the example below:
False Positive and False Negative Errors
To understand these errors, first consider the hypotheses being tested. These can be stated as the null (H0) and alternative hypothesis (H1):
- Null Hypothesis (H0): also thought of as the default state of nature ... an example being healthy persons in a hypothesis test of the side effects of a new vaccination.
- Alternative Hypothesis (H1): is the negative of the null hypothesis, and indicates a state that deviates from the default ... an example being sick persons as a result of the side effects of a new vaccination.
During hypothesis testing, false testing results can occur:
- False Positive (type I error): the null hypothesis is true, but is rejected ... an example is a healthy person being indicated as being made sick in the test of a new vaccination.
- False Negative (type II error): the null hypothesis is false, but fails to be rejected ... an example is a sick person being indicated as being healthy in the test of a new vaccine.
The elbow method looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the "elbow criterion".
Law of Large Numbers
The law of large numbers is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.
Proofs of the law of large numbers are shown here. Intuitively, it's possible to understand what's happening when considering:
- The experiment results are derived from a universe of potential results where an actual theoretical mean exists. In the example below, it's the mean of dice rolls, which is 3.5.
- The first experiment result establishes an initial distance from the theoretical mean.
- In order for the second experiment result has a finite probability of moving the observed averages farther away from the theoretical mean. That probability is a fractional number between 0 and 1. Let's call that probability p-2.
- Subsequent experiment results also have probabilities, p-n.
- Each experiment result is an independent event. That is, one does not depend on the other.
- The joint probability of a group of independent events occurring is determined by their product.
- So in our example, the joint probability would be: p-2 x p-3 x p-4 x ... p-n
- An example with a small number of results is: .84 x .76 x .63 x .58 = .2332
- As you can see, the joint probability gets small very quickly.
- This is an over simplification, but it helps understand the mathematics at work.
In the sample experiment below, notice that by 1000 trials, the observed averages variation from the theoretical mean has been reduced to almost zero. And by only a very small number of trials, the observed averages have come very close to the theoretical mean.
Mean Squared Error
Mean squared error (MSE) is used to measure the average distance of observed data points from predicted data points. Squaring the numbers eliminates negative values. A formula for MSE is:
Precision and Recall
Resampling is a process of systematically changing sample data sets used for testing a model or hypothesis. There are a number of approaches that can be used, such as:
- the size of the population from which the sample is taken is at least 10 times the size of the sample - as the population size increases above that level, the sample size would increase somewhat, but quickly level off at a constant number - this shows how large populations can be accurately sampled using relatively small sample sizes
- s = sample size
- z = Z-score - which relates to the desired confidence level of the result
- 90% confidence = 1.645 Z-score
- 95% confidence = 1.960 Z-score
- 99% confidence = 2.576 Z-score
- d = standard of deviation - how much variance expected in the responses
- a safe starting value is .5
- m = margin of error - how much error to allow in the result
- a typical value would be +/- 5%, or .05
an example using:
- z = 1.96
- d = .5
- m = .05
the calculation is:
Sampling is concerned with the selection of a subset of individuals from within a statistical population to estimate characteristics of the whole population. Important factors for proper sampling include:
- Sample Size: the sample size used in a study is determined based on the expense of data collection, and the need to have sufficient statistical power.
- Sampling Methods: include -
- Simple random sampling: entirely by chance
- Systematic sampling: from an ordered sampling frame such as a phone book
- Stratified sampling: from homogeneous subgroups
- Probability proportional to size sampling
- Cluster sampling: of groups such as by geography
- Quota sampling: of groups such as individuals by age and sex
- Accidental sampling: of the population that is close at hand
- Voluntary sampling: by self selection
- Panel sampling: of the same groups repeatedly over time
- Snowball sampling: of small groups that self recruit over time
The standard deviation (also represented by the Greek letter sigma σ or the Latin letter s) is a measure that is used to quantify the amount of variation or dispersion of a set of data values. A low standard deviation indicates that the data points tend to be close to the mean of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values. The standard deviation of a random variable, statistical population, data set, or probability distribution is the square root of its variance. A simplified way to think of this is that roughly two thirds of the data points are within one standard deviation of the mean and 95% within two standard deviations.
In machine learning, you'll often see the process of using a model to determine the characteristics of input data as inference. For example, the input might be the image of a person and inference is used to perform face recognition to determine the identity of the person.
The statistical power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H0) when the alternative hypothesis (H1) is true. It can be equivalently thought of as the probability of accepting the alternative hypothesis (H1) when it is true—that is, the ability of a test to detect an effect, if the effect actually exists.
Power analysis is an experimental technique for determining the effect of different sample sizes.
The trigonometric functions are functions of an angle. The derivation of the word comes from the ancient Greek (trigon) for triangle. The functions relate the angles of a triangle to the lengths of its sides. The most familiar trigonometric functions are the sine, cosine, and tangent. The trigonometric functions of angle theta are illustrated in the diagram below: