#+TITLE: Neural Networks * meta | Prof. | Thomas P. Caudell | | Office | ECE Rm 235D | | Text | "Neural Networks: a comprehensive foundation" Haykin Second Edition | All homework should be submitted electronically - book file:nn-a-comprehensive-foundation.pdf * class notes ** 2010-08-24 Tue \usetikzlibrary{arrows} \tikzstyle{neuron} = [circle, draw, text centered, font=\footnotesize] \begin{tikzpicture}[->,>=stealth', shorten >=1pt, auto] \node [neuron] (neuron) at (0,0) {$\Sigma$}; \node (input) at (-2,0) {$\overrightarrow{x}$}; \node (weight) at (-0.75,1) {$\overrightarrow{w}$}; \node (output) at (1,0) {$y$}; \draw (-1.5,1) -- (neuron); \draw (-1.5,0.5) -- (neuron); \draw (-1.5,0) -- (neuron); \draw (-1.5,-0.5) -- (neuron); \draw (-1.5,-1) -- (neuron); \draw (neuron) -- node[above] {$\Phi$} (output); \end{tikzpicture} file:data/canonical-model.png Canonical Model: /embodied/, always exist with inputs and outputs - input vector $\overrightarrow{x}$ - many incoming axons with weights $\overrightarrow{w}$ - \Sigma of inputs $v = \overrightarrow{w}T\overrightarrow{x}$ - \Phi some non-linear post-processing of sum - pure linear identity - binary on/off - output y Structure - single neuron - levels of neurons - trees of neurons - arbitrary graphs (cycles) -- this introduces /time/ or /memory/ into the system ** 2010-08-26 Thu NN work initially derived from efforts to understand brain. Today we'll talk about biological models of the brain, basically all the stuff we'll be throwing out. - neuron doctrine :: brain composed of /discrete/ cells, not one contiguous tissue (emerged late 1800s) : 0>----- 0>------ 0>------ two kinds of cells in the brain - neurons :: 1010 neurons in the brain -- many, many more at birth - glial :: ~100 times as many glial as neurons - interface between body and neurons /blood brain interface/ - clean up the waste produced by the neurons - provide scaffolding/structure on which neurons grow Three parts of neuron : dendrites soma axons : --------- ---- ----- : inputs computer outputs : : -----------\ : \ /------------------\ : | | soma | : ------------+ | 50nm | length 200microns -> 2m : | | |>------------------------- : +------------+ | \ ^ : | | | \ | : -------------+ \------------------/ \ myelin coating : | \ : -------------/ axon hillock general types of neurons - unipolar :: dendrite and axon are connected to each other (no computing) - bipolar :: dendritic tree and a single axon (as shown above) -- e.g. sensing, some dendrites in eye actually sense light, some in skin actually detect mechanical pressure, etc.. - multiple polar :: many bushy dendritic trees, and a single axon -- e.g. in spine and related to motor control - pyramidal :: connical body with multiple potentially long dendritic trees out the point of the cone, and one branching axon coming out the base -- in cerebral cortex, used for higher order cognition - purkinje cell :: very well organized comb-like dendrites, can have 100s of thousands of inputs, used in motor control, in cerebellum synapses - electrolytes consisting of sodium, calcium, clourine, and potassium and ions (not electrons) which flow down axons as charge - neurons have an internal negative charge on the order of 60-70 millivolts - slowly accumulates positive charge until the /axon hillock/ fires and send the charge down the axon and resets the neuron to a resting charge -- this happens on a period of ~1ms - pulse is /regenerated/ on the way down the axon ensuring that the height and the width of the pulse is maintained from the beginning to the end of the axon -- these are like voltage-dependent valves and pumps along the axon - myelin, is mainly fat, so it looks white, so /white matter/ in the brain is mainly connections, and /grey matter/ in the brain has more neuronal bodies - signals travel along an axon w/o myelin at ~10 meters per second, with glial cells wrapping the axon in myelin, which insulates portions of the axon s.t. those portions of /skipped/ by the traveling spike resulting in clock speeds of up to ~100 meters per second - max clock-speed of a neuron is ~1 kilo-hertz - a typical neuron could have on the order of 10,000 synapses reaction time - for say breaking in your car can be ~ 1/2 second, that's like 5-10 serial steps of neurons, plus the flow down the spine to the motor control synapse : /----------------- : | Soma, : cleft | dendrite, : ~30nm | axon, : ------------\ | or even another : synaptic | | synaptic bulb : bulb | | : | chem | : transmits | signal | : electric | -----> | : sig to | | : chem sig | | : ------------/ | : | : | : \----------------- learning takes place largely at the synapse, per electric pulse how much chemical is released. ** 2010-08-31 Tue \usetikzlibrary{arrows} \tikzstyle{neuron} = [circle, draw, text centered, font=\footnotesize] \begin{tikzpicture}[-),>=stealth', shorten >=1pt, auto] \node (input) at (-2,0) {$\overrightarrow{x}$}; \node [neuron] (neuron1) at (0,0) {$1$}; \node [neuron] (neuron2) at (1,0) {$2$}; \draw (-1.5,1) -- (neuron1); \draw (-1.5,0.5) -- (neuron1); \draw (-1.5,0) -- (neuron1); \draw (-1.5,-0.5) -- (neuron1); \draw (-1.5,-1) -- (neuron1); \draw (neuron1) -- (neuron2); \draw (neuron2) -- (1.75,0); \end{tikzpicture} file:data/neurons.png Input of charge along dendrites - soma is constantly leaking charge - each incoming impulse jumps up the charge in the soma - inputs arriving at different distances down the dendrites will take different amounts of time to arrive - complex spatio-temporal integration Hebe's rules - if a neuron's fires are correlated with the firing of a synapse on the neuron, then the strength of the synapse's effect on the neuron will be increased - (1) above is /pre-synaptic/ neuron - (2) above is /post-synaptic/ neuron - when they fire together the synapse increases in strength due to the sympathetic electrical and chemical processes Cerebral Cortex flattens out to ~2sq feet ~imm thick, this is the darker /gray matter/ (no myelin), under this sheet are bundles of connections between areas of the cortex (more mylin) /white matter/. ** 2010-09-02 Thu *** bias in many cases learning can not take place without an initial bias, where does this initial bias come from? \usetikzlibrary{arrows} \tikzstyle{neuron} = [circle, draw, text centered, font=\footnotesize] \begin{tikzpicture}[->,>=stealth', shorten >=1pt, auto] \node (bias) at (0,1) {bias}; \node [neuron] (neuron) at (0,0) {$\Sigma$}; \node (input) at (-2,0) {$\overrightarrow{x}$}; \node (weight) at (-0.75,1) {$\overrightarrow{w}$}; \node (output) at (1.5,0) {$y$}; \draw (bias) -- (neuron); \draw (-1.5,1) -- (neuron); \draw (-1.5,0.5) -- (neuron); \draw (-1.5,0) -- (neuron); \draw (-1.5,-0.5) -- (neuron); \draw (-1.5,-1) -- (neuron); \draw (neuron) -- node[above] {$\Phi$} (output); \end{tikzpicture} file:data/bias.png \begin{equation} v = \Sigman_{i=1}wixi+w0 \end{equation} /bias/ can be considered (and implemented as) an axon with constant input or a separate parameter to \Sigma *** \Pi neuron \usetikzlibrary{arrows} \tikzstyle{neuron} = [circle, draw, text centered, font=\footnotesize] \begin{tikzpicture}[->,>=stealth', shorten >=1pt, auto] \node [neuron] (neuron) at (0,0) {$\Pi$}; \node (input) at (-2,0) {$\overrightarrow{x}$}; \node (weight) at (-0.75,1) {$\overrightarrow{w}$}; \node (output) at (1.5,0) {$y$}; \draw (-1.5,1) -- (neuron); \draw (-1.5,0.5) -- (neuron); \draw (-1.5,0) -- (neuron); \draw (-1.5,-0.5) -- (neuron); \draw (-1.5,-1) -- (neuron); \draw (neuron) -- node[above] {$\Phi$} (output); \end{tikzpicture} file:data/pi-neuron.png Unlike a \Sigma neuron, a \Pi takes the product of it's inputs rather than the sum. *** adapting activation functions (\phi) weights and structures :PROPERTIES: :CUSTOM_ID: notes-activation-functions :END: (see reading-activation-functions) activation functions (\phi) in increasing complexity -- these will all be monotonically non-decreasing (biological plausibility) - constant - linear - piecewise linear - sigmoid $\frac{1}{1+e-a(v+b)}$ - equals 1 at +\infty - equals 0 at -\infty - where /a/ controls slope and /b/ controls intersection at origin - hyperbolic tangent, a sigmoid shifted down so it equals -1 at -\infty - there is one which is the /most/ biologically plausible \begin{equation} \phi(v) = \frac{v2}{1+v2} \end{equation} - stochastic activation functions - in one the neuron fires with some probability - on the other type the output /itself/ is a probability, however this is problematic because there is no way for a set of neurons to normalize their outputs, (/probabilistic neural networks/) we like these to be - monotonically non-decreasing - bounded - continuously differentiable *** radial basis neurons Rather than a sum or product of inputs, these take the difference between each input and its weight. So the larger the difference between the input vector $\bar{x}$ and the weight vector $\bar{w}$ the more active the neuron. \begin{equation} v = |\bar{w}-\bar{x}|2 \end{equation} and \begin{equation} y = e-v^{2} \end{equation} *** architectures layers - input layer l0 - single layer has input and 1 processing layer (l1) recurrent system - like a single-layer feed forward, but all output neurons are connected (/lateral/ recurrent system) - or with outputs from some li going back into some lj s.t. j<i recurrent systems can have very weird behavior, in terminology of control systems they are /non-linear/ (see /non-linear dynamical systems/) *** invariants a variety of inputs can come from the same object (distance, orientation, etc...), needs to learn only this invariant information ** 2010-09-07 Tue The text book is written by someone from a signal-processing background, he includes many flow-diagrams and we can ignore them if we like. Today we'll finish Chapt. 1 - chapter 2 is learning - chapter 3 is the first /neural network/ chapter (single layer perceptron) - chapter 4 multi-layer perceptrons *** Structure \usetikzlibrary{arrows} \tikzstyle{off} = [circle, draw] \tikzstyle{on} = [circle, draw, fill=blue!40] \begin{tikzpicture}[->,>=stealth', shorten >=1pt, auto] \node [off] (01) at (0,0); \node [off] (02) at (0,1); \node [off] (03) at (0,2); \node [off] (04) at (0,3); \node [off] (05) at (0,4); \node (x) at (-1,2) {$\bar{x}$}; \node [off] (11) at (1,1.5); \node [on] (12) at (1,2.5); \node [off] (13) at (1,3.5); \node [on] (21) at (2,0); \node [on] (22) at (2,1); \node [off] (23) at (2,2); \node [on] (24) at (2,3); \draw (01) -- (11); \draw (02) -- (12); \draw (02) -- (13); \draw (03) -- (11); \draw (03) -- (12); \draw (04) -- (13); \draw (05) -- (13); \draw (11) -- (21); \draw (12) -- (22); \draw (13) -- (23); \draw (12) -- (24); \draw (11) -- (24); \end{tikzpicture} file:data/knowledge-rep.png How do you represent knowledge in a neural network? Some pattern of activation in the network. - /distance/, \forall $\bar{x}$ \exists some $\bar{y}$ which is the /activation/ of the network due to the input. Each $\bar{y}$ can be thought of as a point in an /n/ dimensional space (/n/ neurons in the network). We can take the Euclidean distance between these points as measures of their similarities. - Manhattan distance is L1 \begin{equation} L1 = dkj = |\bar{y}k - \bar{y}j| = \left(\sumn_{l=1}{(ykl - ylj)}2\right)\frac{1}{2} \end{equation} - Euclidean distance is L2 - another interesting one is L\infty - also dot product of the vectors is interesting - other metrics could be /statistical/, the mean of the activated values or something - edit or hamming distance A /metric/ or /distance/ must be - positive :: d \geq 0 - triangular :: d12+d23 \geq d13 - symmetric :: d12 = d21 - an /important/ input should stimulate /more/ neurons - /prior/ information can be built into the structure of the network or the pre-processor of the network - three ways to handle /invariants/ -- it's very possible for your system to lock onto the wrong invariant if you're not careful in selection of test data 1) structure 2) training 3) feature space transformation (preprocessing) e.g. we might calculate the /moments/ of each image in a series of images - moments :: the following are examples of moments, something $\frac{y(x)}{mo}$ for input images could be used to control for the overall brightness of the system, these can also be used for translation (e.g. x'=(x-m1)), rotation, etc... assuming there's only one item of interest in the scene - m0 = \int\infty y(x) dx (area under the curve) - m1 = \int\infty y(x)x dx (expected value of x or mean) - mn = \int\infty y(x)xn dx standard geometrical invariants - translation -- could add another layer that /or/'s together a bunch of inputs from different locations - rotation - scale -- ratios are invariant over different scales ** 2010-09-09 Thu *** chapter 2 \usetikzlibrary{arrows} \tikzstyle{neuron} = [circle, draw, text centered, font=\footnotesize] \begin{tikzpicture}[->,>=stealth', shorten >=1pt, auto] \node [neuron] (neuron) at (0,0) {$\Sigma$}; \node (input1) at (-1.5,0.5) {$x_{1}$}; \node (input2) at (-1.5,-0.5) {$x_{2}$}; \node (output) at (1,0) {$d$}; \draw (input1) -- node[above] {$w_1$} (neuron); \draw (input2) -- node[below] {$w_2$} (neuron); \draw (neuron) -- node[above] {$\Phi$} (output); \end{tikzpicture} file:data/turner.png Consider a neuron w/2 inputs x1 and x2 and for each combination we have a desired output value d. table of desired behavior | x1 | x2 | d | y | |-----+----+---+---| | ... | | | | In this example let error /e/ equal $\frac{1}{2}e2=(d-y)2$, the $\frac{1}{2}$ is there for the kinetic energy analogy. Kinetic Energy \begin{equation} e = \frac{1}{2}mv2 \end{equation} So for the above, lets make our neuron linear, and have our desired value be either positive or negative. We want to minimize the error using a /gradient descent algorithm/. \begin{equation} w1(n+1) = w1(n)-\eta\frac{\delta e2(n)}{\delta w1} \end{equation} where \eta provides a scaling factor to convert between units of error and units of weight. So what's our derivative? \begin{equation} \frac{\delta e2}{\delta w1} = 2e()x1() \end{equation} **** learning algorithms 1) supervised vs. unsupervised 2) local vs. global 3) statistical vs. deterministic 4) memorization vs. generalization 5) fast vs. slow, meaning the rate of weight change per experience, fast learning typically involves allot of forgetting (see /stability plasticity dilemma/) ** 2010-09-21 Tue on paper ** 2010-09-23 Thu *** optimization - ECE506 is entirely dedicated to optimization we have some function, and we want to find the minimum set xrange[-5:5] set yrange[0:25] plot x**2+5, '-' w p ls 2 3 14 e file:data/opt-func.svg - we can do /gradient descent/ with \begin{equation*} \bar{\Delta}E = \frac{\delta E}{\delta w1}\hat{w1} + \frac{\delta E}{w2}\hat{w2} + \frac{\delta E}{w3}\hat{w3} \end{equation*} to update our weights with \begin{equation*} \bar{w}(n+1) = \bar{w}(n) - \eta\bar{\Delta}E \end{equation*} Using this we can prove that the error will not increase. Considering the single-dimension case with a Taylor expansion \begin{eqnarray*} E(w(n+1)) &=& E(w(n))+\frac{\delta E}{\delta w} (w(n+1)-w(n) + \ldots\\ &=& E(w(n)) - \eta \left(\frac{\delta E}{\delta w}\frac{\delta E}{\delta w}\right) \end{eqnarray*} - using /Newton's Method/ we can compute the $\Delta w$ required to take us directly to the minimum \begin{eqnarray*} \Delta E &=& E(w(n+1)) - E(w(n))\\ &=& \frac{\delta E}{\delta w}\Delta w + \frac{1}{2}\frac{\delta2 E}{\delta w2}\Delta w2\\ &=& 0 \end{eqnarray*} so solving for $\Delta E$ we can get \begin{eqnarray*} \Delta E &=& \frac{-\frac{\delta E}{\delta w}}{\frac{1}{2}\frac{\delta2 E}{\delta w2}}\\ &=& -H-1(n)\Delta E(n) \end{eqnarray*} where H is a /Hessian/ (and NxN matrix of all possible partial derivatives of a N-length vector) *** training we have a set of training vectors | x | d | |---+---| | | | for each training vector we can do /gradient descent/ of the weights towards that vector incremental learning - linear \phi \begin{equation*} w(n+1) = w(n) + \eta e(n) x(n) \end{equation*} - non-linear \phi \begin{equation*} w(n+1) = w(n) + \eta \phi'(v(n)) e(n) x(n) \end{equation*} an /epic/ is a run through all of our training vectors after an epic we can assess our progress as the overall error \begin{equation*} E(k) = \sum1^N{e2(n)} \end{equation*} to get our cumulative error in the same scale as our per-vector error we can take the /root mean square/ (RMS) error \begin{equation*} ERMS(k) = \sqrt{\frac{1}{N}\sum1^N{e2(n)}} \end{equation*} ** 2010-09-28 Tue *** Questions - HW 2.12 :: what is the question asking? these are two normalized Gaussians, we take the difference of these two Gaussians (Mexican hat). What happens if we translate this across along the x axis. - HW 2.10 :: two sums, write out the expression as the positive sum of the wx's minus the sum of the cy's or something... there are a number of ways this can be expressed - general :: assume that the internal activation of a network under no input is set to 0 - Eulerian integration :: $\frac{\delta y}{\delta t} = f(y)$, and we know the value at y=0. we can put this initial value in and use $\frac{\delta y}{\delta t}$ to algebraically compute $\delta y$ given some $\delta t$. We can then just keep doing that. *** Project - this weekend we'll get the API code - the first step is to run a dumb agent that does nothing or provides a random sequence of actions, we then try to beat that - so how could we use the LMS neuron for the project. we could use a competitive layer of /winner take all neurons/ along our line of sight to select the brightest spot in our field of vision, then turn towards that spot. right before we eat an object we'll have a strong RGB input in our center neuron, we can treat this center neuron as an /LMS neuron/ with a desired output of a positive $\Delta energy$. this could be a simple starting architecture. - our /experimental setup/ should report both average length of lifetimes and standard deviations on this length over a number of trials -- maybe even a /t test/? - if we want to we can exceed the page limit with an appendix of additional figures - would be good to try to compute the upper bound on the possible life-span given some assumptions - we'll get a tentative outline *** Perceptrons - error minimization :: - linear \phi - minimize $\frac{1}{2}e2(n)$ where $e(n) = d(n) - y(n)$ - perceptron :: - non-linear \phi - minimizing another criterion function aside from the squared error ** 2010-09-30 Thu some current research uses complex numbers for activation propagation to propagate activation with a unit amplitude, but with both frequency and phase *** perceptron \usetikzlibrary{arrows} \tikzstyle{neuron} = [circle, draw, text centered, font=\footnotesize] \begin{tikzpicture}[->,>=stealth', shorten >=1pt, auto] \node [neuron] (neuron) at (0,0) {$\Sigma$}; \node (input) at (-2,0) {$\overrightarrow{x}$}; \node (weight) at (-0.75,1) {$\overrightarrow{w}$}; \node (bias) at (0,2) {bias = 1}; \node (output) at (1,0) {$y$}; \draw (-1.5,1) -- (neuron); \draw (-1.5,0.5) -- (neuron); \draw (-1.5,0) -- (neuron); \draw (-1.5,-0.5) -- (neuron); \draw (-1.5,-1) -- (neuron); \draw (bias) -- node[right] {$w_0$} (neuron); \draw (neuron) -- node[above] {$\phi$} (output); \end{tikzpicture} file:data/perceptron.png with \begin{equation*} \phi = \left\{ \begin{array}{rl} 1 &: v > 0\\ -1 &: v \leq 0 \end{array} \right. \end{equation*} the only other neural network architecture that provably converges is /adaptive resonance/ ** 2010-10-05 Tue *** perceptron (cont) - treat bias just like any other weight - w0 is the bias weights, which is updated like any other - if error then update weights with \begin{equation*} \bar{w}(n+1) = \bar{w}(n) + \eta(\bar{w}(old)-\bar{x}) \end{equation*} or \begin{equation*} \bar{w}(n+1) = \bar{w}(n) +- \eta\bar{x} \end{equation*} where /if error/ means \begin{equation*} if \left\{ \begin{array}{rcl} \bar{w}T\bar{x} \geq 0 &and& \bar{x} \in c+1\\ \bar{w}T\bar{x} < 0 &and& \bar{x} \in c-1 \end{array} \right. \end{equation*} *** simulation of multilayer perceptrons for a multilayer feed-forward network we can use a matrix representation of the neurons and their weights, then the running of the neural network could be reduced to matrix multiplication. the following computes the activation \begin{equation*} \bar{v}(n+1) = \bar{W}(n) \bar{y}(n) \end{equation*} and the output of the entire network \begin{equation*} \bar{y}(n+1) = \Phi(\bar{v}(n+1)) \end{equation*} *** multilayer perceptrons - learning /internal weights/, how to update weights which are further back in the network? ** 2010-10-12 Tue *** back-propagation learning \usetikzlibrary{arrows} \tikzstyle{neuron} = [circle, draw, text centered, font=\footnotesize] \begin{tikzpicture}[->,>=stealth', shorten >=1pt, auto] \node [neuron] (00) at (0,0) {}; \node [neuron] (01) at (0,1) {}; \node at (0,1.5) {$l_1$}; \node [neuron] (10) at (1,0) {}; \node [neuron] (11) at (1,1) {}; \node at (1,1.5) {$l_2$}; \node [neuron] (20) at (2,0) {}; \node [neuron] (21) at (2,1) {}; \node at (2,1.5) {$l_3$}; \node at (0,-0.5) {$k$}; \node at (1,-0.5) {$j$}; \node at (2,-0.5) {$i$}; \node (d1) at (3,1) {$d_1$}; \node (d2) at (3,0) {$d_2$}; \draw (-0.75,0) -- (00); \draw (-0.75,1) -- (01); \draw (00) -- node[below] {$w_{jk}$} (10); \draw (00) -- (11); \draw (01) -- (10); \draw (01) -- (11); \draw (10) -- node[below] {$w_{ij}$} (20); \draw (10) -- (21); \draw (11) -- (20); \draw (11) -- (21); \draw (20) -- (d2); \draw (21) -- (d1); \end{tikzpicture} file:data/back-prop.png - dependencies :: E \leftarrow e \leftarrow y \leftarrow v \leftarrow w - partial error :: \begin{equation*} ei = di - yi \end{equation*} - error :: \begin{equation*} E(n) = \frac{1}{2}\Sigmai{e2_{i}(n)} \end{equation*} - output layer :: \begin{eqnarray*} \Delta wij &=& - \eta \frac{\delta E}{\delta wij}\\ &=& - \eta ei\frac{\delta e}{\delta wij}\\ &=& - \eta ei \frac{\delta ei}{\delta yi}\frac{\delta yi}{\delta wij}\\ &=& \eta ei \phii' yi \end{eqnarray*} where $\phi'i = \frac{\delta}{\delta vi}\phii(vi)$ - local gradient :: how the overall error changes as the activation of a single neuron changes $\deltai=-\frac{\delta E}{\delta vi}$ and hence the change in weights between any two neurons is as follows \begin{equation*} \Delta wij = \eta \deltai yi \end{equation*} this is like local Hebbian learning Now layer by layer - Output Layer \begin{equation*} \Delta wij = - \eta \frac{\delta E}{\delta vi} = \eta \deltai yi \end{equation*} - Hidden Layer \begin{equation*} \Delta wjk = - \eta \frac{\delta E}{\delta vj}yk \end{equation*} where \begin{equation*} \frac{\delta E}{\delta vj} = \Sigmai{ei \frac{\delta ei}{\delta vj}} \end{equation*} and e changes with v, and y changes with v, and vi changes with vj... \begin{eqnarray*} \Delta wjk &=& - \Sigma{ei \phi'i \frac{\delta vi}{\delta yi} \frac{\delta yi}{\delta vj}}\\ &=& - \Sigma{ei \phi'i wij \phi'j}\\ &=& \eta yk \phi'j \Sigma{ei \phi'i wij}\\ &=& \eta \deltaj yk \end{eqnarray*} finally we get \begin{equation*} \deltaj = \phij' \Sigmai{\deltai wij} \end{equation*} Weight changes propagate back through the network, the \deltaj is dependent on the sum of the \deltai's. The \delta for each neuron need be computed only once. This is a two-pass algorithm. for each pattern we pass through forward computing the v, y, and the \phi' values, saving the y and \phi' values. then on the way back we compute the e and \phi' values to get the \delta values of the output layer, and work backwards. ** 2010-10-19 Tue *note*: there may be a mistake in the /summary of back propagation/ section (eq. 4.47) the correct equation is (eq. 4.39) *** back-propagation review - forward pass - clamp on the inputs - compute activations for nodes and their outputs through the network, and store these - at the output we compute the errors ei(n) - backward pass - \forall layers - compute the \delta's, $\deltaj(n) = \phi'j\Sigmai{\deltaiwij}$ - compute the $\Delta w(n) = -\eta \deltajyk$ - calculate \phi' - loop back to the next previous layer and repeat *** momentum - without momentum \begin{equation*} \Delta wij(n) = - \eta \deltaj(n) yk(n) \end{equation*} - with momentum \begin{equation*} \Delta wij(n) = \alpha \Delta wjk(n-1) - \eta \deltaj(n) yk(n) \end{equation*} if \alpha and -\eta sum to one, then the above is a /convex combination/ ** 2010-10-21 Thu file:data/back-prop.png *** Back Propagation Learning Algorithm 1) hidden neurons: compute and store on the forward pass - $vj = \Sigma wjkxk$ - $yi=\phi(vi)$ - $\phi'(vi)$ 2) output neurons: compute and store on the forward pass - $vi= \Sigma wijyj$ - $yi = \phi(vi)$ - $\phi'(vi)$ 3) then can compute the error as $ei = di-yi$ 4) output neuron: backwards - $\deltai = ei\phi'(vi)$ - $\Delta wij = \eta \deltai yj$ 5) hidden neurons: backwards - $\deltaj = \phi'(vj) \Sigmai{\deltaiwij}$ - $\Delta wjk = \eta \deltaj yk$ 6) finally do one more forward pass through the network in which we add all of the $\Delta w$ values to our weights For back propagation with momentum we save the old $\Delta w$ so that we can use it to calculate our new $\Delta w$. \begin{equation*} \Delta wij(n) = \alpha \Delta wij(n-1) + \eta \deltai yj(n) \end{equation*} *** Training - 2 inputs and 4 outputs - for a full pass you can track a /pattern error/ $\frac{1}{2}\Sigmai{e2_{i}}$, however for a more intuitive metric it may be useful to look at the RMS error which is "of the same size" as the errors themselves $\sqrt{\frac{1}{|i|}\Sigmai{e2_{i}}}$ - the epic error could be taken as the RMS error over the entire set $\sqrt{\frac{1}{|epic||i|}\Sigman\Sigmaiei^{2}(n)}$, we should plot these by epic - after each epic we can turn off learning (backward pass) and compute the errors generated from the /testing/ set of samples giving us another error (i.e. errortesting). We should plot both errortesting and errortraining on the same scale (note we should re-run the training data w/o learning). This is called a /generalization plot/. - the training error *will* monotonically decrease, however it is possible that the testing error could begin to rise if we're over-fitting the training data. - would be good to look at both incremental and batch update of the weights (e.g. do or do not update mid-epic) - we should also look at how the order of presentation affects the performance of the network (only has an effect when doing incremental weight updates) - stopping criteria - some error threshold - testing error starts to increase - etc... - 3 architectures, 3 numbers of nodes, possibly to vary \eta, \alpha, and even breaking some connections or removing some neurons after training to see how the network holds up - weight initialization is yet another thing we could vary across multiple runs. There could be many local minima which we could land in depending on our starting position (or initial weights). For a given output neuron (expanded using a Taylor's expansion) \begin{eqnarray*} y0 &=& \phi\left[\Sigmaiwoi[\Sigma wijxj]\right]\\ &=& \Sigmalal(\Sigmaiwoi\Sigmaman(\Sigma wij xi)m)l \end{eqnarray*} So you could have a very large number of local minima. Typically you want to pick your weights in a random distribution centered around 0 -- small weights lead to large values of \phi' and large changes in weight. ** 2010-10-26 Tue /Bayes error/ is the theoretical best generalization error achievable. So for example in our back-prop assignment, we won't get better (at least on the test data) than the Bayes error which is around 13-14%. Note this is "classification error" or percent correct, not RMS error. This could be a good stopping criteria. *** heuristics (see /convergence heuristics/ in the text book) - you can look at the variance of your training data, and get a feel for what the \sigmaw of your weights should be - for our homework assignment, the most effective solution will be to center our input data on the origin, and force a unit standard deviation on the input -- this will keep us from saturating our neurons thereby reducing their information capacity to binary on/off. Note: this is a moment transformation, subtracting first moment and dividing by second moment. - also, maybe set target values to something achievable (e.g. 0.1 and 0.9 instead of the asymptotic 0 and 1) *** universal approximation theorem (p.208 in the text) For certain types of bounded functions over a finite domain \exists a single-hidden-layer feed forward neural network which can arbitrarily approximate that function (i.e. \forall \epsilon \exists a number n s.t. a network with n neurons in the hidden layer can approximate the function). \begin{equation*} F(\bar{x}) = \Sigmai=1^{n}\alphaj\phij(\Sigma wjixi+bi) \end{equation*} This is like a Fourier Transform, a sum of orthonormal parts to approximate an arbitrary function. *** back propagation to do other things We could also for example take the partial of /a/ (the slope of the sigma function) of a neuron, and use pack-propagation to adjust these values. We can take partials of our inputs $\Delta xi = \eta \frac{\delta E}{\delta xi}$ to guess what x would likely give us any particular output /y/. You would need an initial guess of inputs, but for any initial guess back-prop could be used to move form the guess input to another input which is more appropriate for a particular desired output. ** 2010-10-28 Thu *** finishing off Chapter 4 \usetikzlibrary{arrows} \tikzstyle{neuron} = [circle, draw, text centered, font=\footnotesize] \begin{tikzpicture}[->,>=stealth', shorten >=1pt, auto, scale=0.25] \node (a) at (10,12); \node (b) at (10,8); \foreach \y in {1,...,20} { \node (in-\y) at (0,\y) {}; \node (out-\y) at (20,\y) {}; \draw (in-\y) -- (a); \draw (in-\y) -- (b); \draw (a) -- (out-\y); \draw (b) -- (out-\y); } \end{tikzpicture} file:data/compression.png - hidden layers can be used as feature detectors. if you force a large amount of data through a *small* hidden layer then the data will be /compressed/ through that layer which will require /discovery/ of structures in the data to achieve the compression. such hidden layers are sometimes called "feature detectors". - Auto-associative network that maps inputs to identical outputs. This could also be used for compression or encryption. the bottleneck hidden layer could be considered a compressed (and probably unintuitive encryption) of the inputs. - Introduce /weight sharing/ where each neuron in the hidden layer shares the same weight structure (e.g. mexican hat). This could be used to for example build an apple detector over images, no matter where the apple is present in the original image the same weight pattern will be present near that part of the image. - /prediction/: say we have a time series, we can take a series of values as input, and then take a single later value as the desired output. In this way we can train a predictor. : +--------+ : +----------->| NN | : | | | output : | input | |--------+ : | +--->| | | : | | +--------+ | : | | | : | | v : -----------------------------------time-series--> The "prediction company" formed out of the Santa Fe institute doing things like this for financial prediction. - In practice we won't know how to build a network, i.e. how many layers and how many neurons in the layers. We want to limit the complexity of the network. You can add a /penalty/ term s.t. when a weight has too much penalty it is removed. - you can start big and cut things out, intermittently remove all small weights from the network, this won't remove neurons or layers but it will simplify the network - you can add weights. start with a single neuron, doing learning with the standard \Delta-rule. whenever an input results in a large error a new neuron is introduced which reduces that error and is connected to every existing neuron. The network is then trained through normal backwards propagation. These can work very well. - GA, the chromosome is generally the adjacency matrix of the neural network (with a set number of neurons constant across the entire species). This matrix could be linearized out into one long vector. Then simple mutation and any length-preserving method of crossover can be used. More generally any method of graph crossover could be used. *** radial basis neural networks - \phi-separability :: a data set is \phi-separable if \exists a function \phi which separates the classes of the set Covers Theorem: Any set of data with two classes (dichotomy) is more likely to be linearly separable the higher the dimension of the space in which the data is embedded. as we non-linearly map our data into a higher dimensional space the linear separability of the data will increase. eventually we can just use a single perceptron to learn the data. we can use /radial basis neurons/ to perform this non-linear mapping if we take a set of i functions \phii, then we can use these i functions to map a point in 2 dimensions to a point in 2i dimensions by passing each coordinate through all i functions. \usetikzlibrary{arrows} \tikzstyle{neuron} = [circle, draw, text centered, font=\footnotesize] \begin{tikzpicture}[->,>=stealth', shorten >=1pt, auto, scale=0.25] \node [neuron] (out) at (20,10); \foreach \x in {2.5,5,...,17.5} { \node [neuron] (in-\x) at (0,\x); } \foreach \y in {1,3,...,20} { \node [neuron] (phi-\y) at (10,\y) {$\phi_{\y}$}; \draw (phi-\y) -- (out); } \foreach \x in {2.5,5,...,17.5} {\foreach \y in {1,3,...,20} {\draw (in-\x) -- (phi-\y);} } \end{tikzpicture} file:data/radial-basis.pdf ** 2010-11-02 Tue Three smaller topics we'll be hitting - radial basis neurons - scalar vector machines - committee machines *** radial basis neurons - H = {x1...xn} - Dichotomy = (H1, H2) - a set of functions $\bar{\phi}(x)$ a dichotomy is \phi-separable if \exists $\bar{w}$ s.t. - $\bar{w}T\bar{\phi}(x) > 0$ if $\bar{x} \in H1$ - $\bar{w}T\bar{\phi}(x) \leq 0$ if $\bar{x} \in H2$ *** radial basis network for X-or \usetikzlibrary{arrows} \tikzstyle{neuron} = [circle, draw, text centered, font=\footnotesize] \begin{tikzpicture}[->,>=stealth', shorten >=1pt, auto, scale=1.75] \node [neuron] (x1) at (0,0.5) {input $x_1$}; \node [neuron] (x2) at (0,-0.5) {input $x_2$}; \node [neuron] (r1) at (1,0.5) {radial 1}; \node [neuron] (r2) at (1,-0.5) {radial 2}; \node [neuron] (out) at (2,0) {regular}; \node (y) at (3,0) {y}; \draw (x1) -- (r1); \draw (x1) -- (r2); \draw (x2) -- (r1); \draw (x2) -- (r2); \draw (r1) -- (out); \draw (r2) -- (out); \draw (out) -- (y); \end{tikzpicture} file:data/radial-basis-network.png - $\phi1 = e-|x-t_{1}|^{2}$ where $t1=(1,1)$ - $\phi2 = e-|x-t_{2}|^{2}$ where $t2=(0,0)$ | x | \phi1 | \phi2 | |-----+------+------| | 1,1 | 1.0 | 0.13 | | 0,1 | 0.36 | 0.36 | | 1,0 | 0.36 | 0.36 | | 0,0 | 0.13 | 1.0 | the non-linearly separable classes of Xor are mapped by \phi1 and \phi2 to a new plane in which they are separable *** interpolative functions An interpolative function will pass through all data points (even if it doesn't accurately reflect the behavior of the original function between the given data points). F is interpolative if F(xi)=di \forall i\in[1..N] \begin{equation*} F(x) = \Sigma1^{n}wi\phii(x-xi) \end{equation*} In matrix notation $\Phi \bar{w} = \bar{d}$ or \begin{equation*} \left[ \begin{array}{ccc} \phi11 & \ldots & \phi1N\\ \vdots & \ddots & \vdots\\ \phiN1 & \ldots & \phiNN \end{array} \right] \left[ \begin{array}{c} w1\\ \vdots\\ wN \end{array} \right] = \left[ \begin{array}{c} d1\\ \vdots\\ dN \end{array} \right] \end{equation*} Then you can find the weight matrix in a single matrix inversion A perceptron allows you to split a space with a hyperplane, however a radial basis neuron allows you to split a space with a hypersphere. **** Application Let's say we want to fit a bunch of data with radial basis neurons using Gaussian functions for our \phi's. If we have 10,000 data points and we only want 10 hidden neurons, then we could just pick 10 points at random to be the centers of our neurons. Random works very well because we tend to sample from dense sample areas. Since 10 does not equal 10,000 we can't use simple matrix inversion to learn our weights because a 10x10,000 matrix is not square. we could also use - clustering algorithms, or - gradient descent -- the free parameters in this approach above are the weights and the centers of the spheres, we could then computer the partial errors over \delta-t and \delta-w and descend on both the centers and weights ** 2010-11-04 Thu *** learning on radial basis functions (RBF learning strategies in the book p.320) types 1) random centers 2) self organizing centers 3) supervised -- gradient descent centers lets assume they're Gaussians for this example \begin{equation*} \phi(|\bar{x}-\bar{t}|) = exp(-\frac{|\bar{x}-\bar{t}|}{2\sigma-2}) \end{equation*} So for a network with a hidden layer of m radial basis neurons all connected to all inputs and all feeding into a single perceptron the output is \begin{equation*} y=\Sigmai=0^{m}wi\phi(|\bar{x}-\bar{ti}|) \end{equation*} initial values for our Gaussians 1) Under type (1) we set the centers ti to random input points, to pick \sigma we want some overlap between the Gaussians positioned around different centers, but we also want different values for points closer to a particular center. So to approximate a good value for \sigma we can find the maximum distance between any two pairs of centers, and then we can set $\sigma2 \simeq \frac{dmax^{2}}{\sqrt{m}}$. Why do we divide this by $\sqrt{m}$ (which is related to the dimensionality of the space)? Not sure, but this is an accepted heuristic. Then to select the values for wi, Once these initial values are selected we can use gradient descent to improve the values of wi \forall i. The centers and the \sigma's are fixed, so we only need update the weights. 2) we select the centers using something like k-means. For k-means we 1) pick a value of k (this is somewhat unsatisfying, we could start with only 1 center, and then add new centers whenever the max center to xi distance is over some preset threshold) 2) randomly pick k centers (possibly from the data) 3) loop over the data (xi's) and associate each x with the closest center 4) for each center we move it towards the center of mass of the associated xi's by some learning parameter \eta 5) keep doing it until the centers stop moving 6) the standard deviations of these centers are easy to compute, just compute the standard deviation of the xi's assigned to each center 3) we can bring in a desired value, compute an error, and take the partial errors along the weights, and then we can do gradient descent on each of the hidden radial basis neurons. If we think of every radial basis neuron as a function like $\phi(\bar{x},\bar{ti},\sigmai)$ then we can take partials of all three of the arguments to \phi, e.g. \begin{equation*} ti(new) = ti(old) - \eta \frac{\delta e2}{\delta \bar{ti}} \end{equation*} We're still guessing at the number of hidden nodes, but having made that choice this works very well at moving around the centers of these nodes. We can use a covariance matrix to allow \sigma to vary across dimensions, we could then learn these mi^{2} elements of the covariance matrix using gradient descent. ** 2010-11-16 Tue *** last time Hebbian learning in a /winner take all/ system with a laterally connected output layer learning can be localized to the winner for any particular input. *** self organizing maps /highly/ recommended book \leftarrow should purchase | Title | Self Organization and Associative Memory | | Author | T. Kohonen | | ISBN | 3-540-18314-0 | also this is Chapter 9 in the text. The locations of neurons relative to each other in some space will begin to become important. Imagine that all output neurons are embedded in a two dimensional planar grid. Imagine local inhibition of these output neurons on this 2D grid. A /self organizing map/ is a neural network which is continuous s.t. \forall \eta \exists \delta s.t. if the output of two inputs x1 and x2 is within \delta in the output space (our 2D plane) then x1 and x2 are within \eta in the input space. typical representation of these 2D output neural networks \usetikzlibrary{arrows} \tikzstyle{neuron} = [circle, draw, text centered, font=\footnotesize] \begin{tikzpicture}[->,>=stealth', shorten >=1pt, auto] \node (input) at (0,4) {$\bar{x}$}; \foreach \x in {3,2,1} { \foreach \y in {1,2,3} { \node [neuron] (\x-\y) at (\x,\y) {}; \draw (input) to (\x-\y); } } % \draw (neuron) -- node[above] {$\Phi$} (output); \end{tikzpicture} file:data/self-organizing-map.png These neurons often have alternating /competitive/ and /cooperative/ phases. In normal competitive Hebbian learning only the winner has his weights changed, however in the Kohonen self-organizing map there is also a cooperative phase in which the winner shares learning with his neighbors, they share slightly less with their neighbors, etc... - competitive :: first we find the /winner/ by computing the cross product of the input $\bar{x}$ with each weight vector $\bar{wi}$ and the winner is the neuron with the smallest inner product. \begin{equation*} J(\bar{x})=minj \in i,n(||\bar{x} - \bar{wi}||) \end{equation*} so the neuron with the weight vector closest to the x vector is the winner. We then move this weight vector slightly (by \eta) closer to the x vector. - cooperative :: For a generic neuron in the network \begin{equation*} \bar{wj}(n+1) = \bar{wj}(n) + \eta h(J(x))(\bar{x}(n) - \bar{wj}(n)) \end{equation*} where the $h$ function need to return the highest values for those neurons nearest to the winner with unity value right on the winner and fairly quickly dropping to zero as we move away from the winner. We can specify $h$ as \begin{equation*} h(J(\bar{x}),j) = \left\{ \begin{array}{rl} 1 & j=J(\bar{x})\\ f(||\bar{r}J(\bar{x}) - \bar{r}j||) & j \neq J(\bar{x}) \end{array} \right. \end{equation*} or actually more like \begin{equation*} h(J,j) = exp\left(\frac{-d(J,j)2}{2\sigma2}\right) \end{equation*} where the distance metric \begin{equation*} d(J,j) = ||\bar{r}J - \bar{r}j|| \end{equation*} and where \sigma is used to control the /spread/ of the weight propagation. So how do you pick \eta and \sigma? At least initially you will want \sigma to allow the weight updates to touch most weights. Generally you will want to diminish both \eta and \sigma as time progresses. pictured in the input space the following /pretty much/ amounts to, whenever an input lands in the input space, all weights shift slightly towards that input with the degree of the weights movement based upon its closeness to the input (or rather its closeness to the closest weight). There are some very cool demonstrations of and visualizations of these sorts of systems available on-line. ** 2010-11-18 Thu *** continuous time systems -- Hopfield network we can turn our Canonical model into a continuous time differential equation, in which case we get \begin{equation*} \frac{\delta vi(t)}{\delta t} = \Sigmai wij(t) yi(t) \end{equation*} This is similar to an RC circuit in which the input is the battery and the neuron is the capacitor. Given this equilibrium equation... \begin{equation*} \frac{\delta vi(t)}{\delta t} = \Sigmai wij(t) yi(t) - \frac{vi(t)}{\tau} \end{equation*} When we first add input if the sum term is a positive constant, then the \delta will initially be positive (because vi is initially 0), however eventually \delta=0 when the sum equals the vi. This would be one time step in our model. we also have... \begin{equation*} yi(t) = \phi(vi,t) \end{equation*} these two equations together form a coupled system of non-linear equations. some points 1) if \phi is linear then it is a coupled system of linear equations and would be more amenable to analysis 2) there is no learning in the above, it gets considerably more complicated in this case \begin{equation*} \frac{\delta wij(t)}{\delta t} = F(?) - \frac{wij(t)}{\tau} \end{equation*} ultimately this is system of three highly related equations 3) this system of equations can generally be bounded in space as the weights and activations won't normally go off to \infty *** dynamical systems aside \begin{equation*} \frac{\delta xj(t)}{\delta t} = Fi(\bar{x}(t)) \end{equation*} - autonomous :: there is no explicit time term inside of the main function - non-autonomous :: the function "F" does depend on time in some way e.g. $Fi(\bar{x}(t),t)$ because /non-autonomous/ systems are very difficult we will focus on systems without an explicit time term. so fixed points in the state space of this system of equations will correspond to states in which the neural network has ceased to learn and has /learned/ some input, this can be thought of as a /memory/. it is possible to /linearize/ these differential equations in some bounded areas of space. ** 2010-11-23 Tue after today we'll be doing adaptive resonance *** basins of attraction to study the stability of our "memory" attractors we'll be using Lyapunov's Theory (direct method) to determine if these attractors exist. Lyapunov's Theory, assume \exists V(x) which is 1) continuous 2) V(xstar)=0 where x^star is an attractor 3) V(x)>0 when x \neq xstar In the range of the attractor (in its basin) $\frac{\delta v(x)}{\delta t}<0$, this defines the region of the attractor (i.e. where $\frac{\delta v(x)}{\delta t}<0$ is true). *** neural dynamics the application of dynamic systems analysis to neural networks - additive model \begin{eqnarray*} \frac{\delta vj(t)}{\delta t} = -vj(t)+\Sigmai=1^{n}wji(t)\phi(vj(t))+Ij\\ yj(t) = \phi(vj(t)) \end{eqnarray*} - alternate additive model \begin{equation*} \frac{\delta yj(t)}{\delta t} = -yj(t)+\phi(\Sigmai=1^{n} wijyi) \end{equation*} *** Hopfield Network and Stability \begin{equation*} \frac{\delta vj}{\delta t}=-\frac{vj}{\tau}+\Sigmai=1^{n} wji \phi(vi(t)) + Ij \end{equation*} with - wji = - wij - the \phi function is invertible - and \exists a Lyapunov's function \begin{equation*} - v = - \frac{1}{2} \Sigmai=1^{n} \Sigmaj=1^{n} wjiyjyi+ \Sigmaj=1^{n}\frac{1}{\tau}\int0^{y}\phi-1_{j}(y)\delta y - \Sigmaj=1^{n}Ijyj \end{equation*} - $\frac{\delta v}{\delta t} < 0$ ** 2010-11-30 Tue *** Adaptive Resonance Theory ART-1 : +--------------------+ dipole layer, each neuron is a pair : F2 |o o o o o o o o o o | <- connected as a flip-flop, and all : +--------------------+ pairs connected as winner-take-all : : +----+ gain control : Full Bidirectional connectivity | GC |<- fully connected : (top down weights are binary) +----+ to F1 and F2 : : +---------------------------------------------+ : F1 |o o o o o o o o o o o o o o o o o o o o o o o| : +---------------------------------------------+ +---+ vigilance, : ^ ^ ^ ^ ^ ^ | e | <- fully connected : | | | | | | +---+ to F1 and F0 : +---------------------------------------------+ : F0 |o o o o o o o o o o o o o o o o o o o o o o o| : +---------------------------------------------+ : ^ ^ ^ ^ ^ : | | | | | : : Input I is binary vector Neurons - GC :: is on unless it is inhibited, any inputs turn it off - F0 :: these neurons are binary on or off depending on their input - F1 :: these are threshold neurons, which have three inputs (one from F0, one from GC and one from each neuron in F2 which is initially only 1), it requires at least 2 inputs to turn on - F2 :: each neuron is a flip-flop pair of neurons with the whole layer connected into a winner-take all network, let Ti be the weights up from F1 to neuron i in F2, then the downward weights of i Bi are always a binary scaling of Ti - e :: the vigilance neuron checks the Hamming distance between the activation of F0 and the activation of F1, to see if the top-down weights (defining the center of the active neuron in F2) is /close enough/ to the input (the activation in F1), we can calculate this with \begin{equation*} \frac{||T1-I||}{||I||} \geq \rho \end{equation*} where T1 is the activation pattern of F1 and I is the input vector. /e/ turns on when the activation is less than \rho. When /e/ turns on it activates the flip-flop neuron of the active neuron in F2 which latches off F2, and when the effects of F2 turning off propagates through the network and the original input activation propagates back up to F2, a new neuron is recruited to activate in response to the input. Lets look at the behavior of this system 1) initially only GC is turned on 2) lets apply a pattern 3) the F0 neurons of the 1 inputs will activate 4) the F1 neurons related to the active F0 neurons will activate (because they now have two inputs) 5) the active F1 neurons then push activation to the recruited neurons in F2, and the winner turns on 6) when the winner turns on it begins activating GC and activating back from F2 to F1 (according to its top-down weight vector) 7) without GC on all neurons in F1 which aren't activated by the active neuron in F2 turn off, those neurons in F1 which are turned on are now the intersection of those in the input pattern and those with top-down weights from the active neuron in F2 This is like /leader clustering/ where we test for /closest/ and /close-enough/. The /closest/ part of this is controlled by the winner-take-all formation in F2, the Vigilance neuron /e/ is responsible for the /close-enough/ test. We then learn the weights between F1 and F2 with Hebbian learning. All weights are held between 1 and 0, as the activation /resonates/ then those neurons in F1 which /resonate/ with the active neuron in F2 will have their weights to and from F2 tend to 1 and all others have their weights tend to 0. This is similar to /template matching/ in the pattern matching literature. ** 2010-12-02 Thu *** Adaptive Resonance Theory ART (continued) : k=1 nf2 : +---------------------------+ : F2| ------------------ | \ : | | | : | T | | : | k | | : | | | : F1| ----------------------- | | : | |\ / : | | e : F0| ----------------------- |/ : | | : +---------------------------+ - closest :: How do we compute the closest neuron in F2? The activation normalized by the number of up weights (size of the template) \begin{equation*} k = argmaxk=1,nf2\left\{\frac{||I \wedge Tk||}{\beta + ||Tk||}\right\} \end{equation*} \beta has the effect of breaking ties in favor of the larger template, \beta is typically set to $\frac{1}{nf0}$. - close enough :: How do we decide if this is close enough \begin{equation*} \frac{||I \wedge Tk||}{||I||} \ge \rho \end{equation*} If the above is not true, then we reset and add a new neuron to F2 If it is true then we /learn/ and update Tk using the /gated Hebbian/ we discussed last time \uparrow the above is *the entirety* of the ART algorithm \uparrow some properties: - this can all be described by a set of coupled differential equations - the order of presentation of training points affects the learned structure *** limits on the maximum required number of epics we can prove this using /compliment coding/ -- meaning \forall inputs I we concatenate I and its compliment IC (I,IC). The size of this concatenation will always be equal to the number of bits in I. This will converge (number of nodes and values of weights) in 1 epic. *** how to use real-valued inputs thermometer code - normalize your inputs to between zero and 1 - then multiply each normalize value by nf0 and represent it by that many consecutive 1's and pad the rest with zeros this real value case with unary encoding can be /very/ easily visualized geometrically where every neuron in F2 becomes a box in the number of dimensions as there are real values in the input *** fuzzy-ART - we replace \wedge with a /fuzzy-\wedge/ which is just min, it returns the min of its inputs (which don't have to be 0 or 1 but can be any real value between 0 and 1). - similarly compliment becomes 1- so the compliment of (0.1,0.3) becomes (0.9,0.7) *** ART-2 Not very popular, is the case where the F2 row is not winner-take-all but rather can have multiple neurons turn on. *** ART-Map for supervised learning : x : / \ : / \ : x x Association Matrix : \ / : \ / : X : : : +----------+ +----------+ : | | | | : | A | | B | : | | | | : | | | | : +----------+ +----------+ : x d /x/ and /d/ are the inputs and the desired values, these train A and B which are both ART-1 architectures, and their outputs go into an association matrix in which the related outputs are associated. * reading ** 1st Chapter *** Benefits of Neural Networks - nonlinearity :: in the weighting between neurons, each neuron is non-linear as are their sum - input-output mapping :: ideal for supervised learning - adaptable :: easily re-weighted to adapt to a changing environment, however shouldn't change to fast to fleeting disturbance this is called the /stability-plasticity dilemma/ - evidential response :: can return /confidence/ along with it's clarifications - contextual information :: each neuron can potentially affect each other neuron allowing for natural spread of contextual information - fault tolerant :: naturally distributed so if any single neuron is damaged the global behavior is not impaired but not drastically affected /robust/, /graceful degradation/ - VLSI :: massively parallel, take advantage of parallel hardware - uniform :: regardless of problem domain the same structure (interconnected neurons) is used, allows for modularity and composability - biological analogy :: as an analog of a biological system, many good ideas can be borrowed from nature *** Types of activation functions :PROPERTIES: :CUSTOM_ID: reading-activation-functions :END: (see notes-activation-functions) - Threshold :: \begin{equation} \phi(v) = \left\{ \begin{array}{lcl} 1 &if& v \geq 0\\ 0 &if& v < 0 \end{array} \right. \end{equation} - Piecewise Linear :: \begin{equation} \phi(v) = \left\{ \begin{array}{lcl} 1 &if& v \geq +\frac{1}{2}\\ v &if& +\frac{1}{2} > v > -\frac{1}{2}\\ 0 &if& v \leq -\frac{1}{2} \end{array} \right. \end{equation} - Sigmoid :: \begin{equation} \phi(v) = \frac{1}{1+exp(-av)} \end{equation} - Stochastic :: \begin{equation} \phi(v) = \left\{ \begin{array}{lcl} +1 &\text{with probability}& P(v)\\ -1 &\text{with probability}& 1-P(v) \end{array} \right. \end{equation} where \begin{equation} \phi(v) = \frac{1}{1+exp(-v/\tau)} \end{equation} where \tau is a /pseudo temperature/ used to control the amount of noise in the system *** Network Architectures - single-layer feed-forward :: single layer of input nodes which connect to a single layer of output nodes, acyclic - multi-layer feed-forward :: like the above but with /hidden layers/ between the input and output layer, these can be particularly useful when the size of the input layer is large. These are /fully connected/ when every node on a layer is connected to every node on the adjacent layers - recurrent networks :: neural networks which has as least one feedback loop, which will typically involve /unit-delay elements/. feedback loops have significant effect on the behavior of the network *** Knowledge Representation It is difficult to talk about /knowledge/ being represented inside of a neural network, as any such knowledge will be stored implicitly in the structure of the network. As such any /knowledge/ which a neural network may contain about it's /world/ (inputs and outputs) is directed toward acting in the world. Knowledge Rules 1) similar inputs from similar outputs should result in similar (e.g. by Euclidean distance of a vector of neuron activation) internal states and should thus be classified similarly 2) opposite of (1) items in different categories should be given different internal representations 3) important features should be allotted a large number of neurons in the network 4) prior information and invariance should be built into the design of the network - biologically plausible - smaller size - limits the search space of the network - faster information transition through the network - cheaper to build becu *** How to Build Prior Information and Invariance into a Network no general rules. two ad-hoc rules for prior information 1) Restricting network architecture through use of local connections known as /receptive fields/ 2) Constraining the choice of weights through /weight-sharing/ invariance, e.g. same object but viewed form different angles, same voice but spoken loudly or softly methods of entraining invariance into a network 1) invariance by structure 2) invariance by training 3) invariant feature space, assuming \exists features which are constant across all variants of the same input ** 2nd Chapter Learning 1) neural network is /stimulated/ 2) it /changes/ 3) it responds /differently/ because of the change *** learning paradigms **** Error-correction Learning - directed to a particular neuron - the output of that neuron is compared to the desired output - the difference /error/ is used to re-weight the inputs to the neuron - this process continues until the neural net hits some steady state (this is implemented with /back-propagation/) **** Memory-based Learning - each input is associated with an output - when a new unknown input is seen it is classified as the same value as the nearest (Euclidean) known input **** Hebbian Learning associative learning 1) if the neurons on either side of an axon are activated simultaneously then the strength of the axon is increased 2) if the neurons are activated asynchronously then the synapse is weakened or eliminated this sort of learning is - time-dependent - local - interactive (depends on both side of the synapse) - conjunctional or correlational strong physiological evidence that associative /Hebbian/ learning takes place in the brain **** Competitive Learning - each input neuron is attached to every output neuron - random weights on all of these attachments - for each set of inputs the output neuron with the highest activation is the /winner/, and all of it's active input connections are strengthened - this process continues until the weights stabilize with k output neurons this performs similar to k-means clustering, in which each output neuron is finally associated with a cluster of similar inputs **** Boltzman Learning the neural network or /Boltzman machine/ is characterized by an energy function. \begin{equation} E = -\frac{1}{2}\Sigmaj\Sigmak \neq jwkjxkxj \end{equation} the machine operates by selecting a neuron at random during the learning process and flipping it's output with probability \begin{equation} P(xk \rightarrow -xk) = \frac{1}{1 + exp(-\Delta Ek/\tau)} \end{equation} two running conditions - /clamped condition/ in which the visible neurons (attached to the environment) can't be changed - /free-running condition/ in which all neurons can be flipped let - $pkj^{+}$ denote the correlation between neurons j and k when in clamped condition - $pkj^{-}$ denote the correlation j and k when free-running condition then the change in weight is \begin{equation} \Delta wjk = \eta(pkj^{+} - pkj^{-}), j \neq k \end{equation} *** credit assignment problem how to assign credit and blame to inner portions of a neural network, this is sometimes complicated through temporal delay, requiring assignment of credit to past actions/events *** Learning without a Teacher these first two require a critic digraph fsm { environment -> critic [label="primary reinforcement"]; environment -> critic [label="state"]; environment -> "neural network" [label="state"]; critic -> "neural network" [label="heuristic reinforcement"]; "neural network" -> environment [label="actions"]; } file:data/critic.png - in /reinforcement learning/ the system matches input to output to maximize some scalar performance measure. - in /delayed reinforcement/ learning the system attempts to minimize a /cost-to-go/ function, which is the cost of a sequence of actions taken over some sequence of inputs. learn from the results of actions. related to /dynamic programming/ digraph fsm { environment -> "neural network"; } file:data/wo-critic.png an example of unsupervised learning: a two-layer neural network in which the first layer is the /input/ layer and the second the /competitive/ layer, s.t. the neurons in the competitive layer compete with each other to respond to an input. *** learning tasks **** pattern association - associative memory :: distributed memory which learns by /association/ - auto-association :: (unsupervised) a set of patterns are repeatedly presented to the network, and it tries to /store/ them s.t. when presented with a noisy version of a pattern it returns the original pattern - hetero-association :: (supervised) arbitrary set of input patterns are paired with another arbitrary set of output patterns **** pattern recognition assignments of /inputs/ to /classes/ digraph fsm { "input pattern" -> "unsupervised network for\lfeature extraction"; "unsupervised network for\lfeature extraction" -> "supervised network for\lclassification" [label="feature vector"]; } file:data/pattern-association-design.png **** function approximation Attempt to match some function $d = f(x)$ where x is the input vector and d is the desired output. Supervised learning can be used to train the network to match $f$. * questions - in implementations, is a neuron normally allowed to reference itself? - what state is normally contained in a neuron?