#+TITLE: Neural Networks
#+OPTIONS: toc:2 num:nil ^:t TeX:t LaTeX:t
#+STARTUP: hideblocks

* meta
| Prof.  | Thomas P. Caudell                                                   |
| Office | ECE Rm 235D                                                         |
| Text   | "Neural Networks: a comprehensive foundation" Haykin Second Edition |

All homework should be submitted electronically
- book file:nn-a-comprehensive-foundation.pdf

* class notes
** 2010-08-24 Tue
#+begin_src latex  :file data/canonical-model.pdf :border 1em :packages '(("" "tikz")) :exports none
  \tikzstyle{neuron} = [circle, draw, text centered, font=\footnotesize]
  \begin{tikzpicture}[->,>=stealth', shorten >=1pt, auto]
    \node [neuron] (neuron) at (0,0)     {$\Sigma$};
    \node          (input)  at (-2,0)    {$\overrightarrow{x}$};
    \node          (weight) at (-0.75,1) {$\overrightarrow{w}$};
    \node          (output) at (1,0)     {$y$};

    \draw (-1.5,1)    -- (neuron);
    \draw (-1.5,0.5)  -- (neuron);
    \draw (-1.5,0)    -- (neuron);
    \draw (-1.5,-0.5) -- (neuron);
    \draw (-1.5,-1)   -- (neuron);
    \draw (neuron)    -- node[above] {$\Phi$} (output);

Canonical Model: /embodied/, always exist with inputs and outputs
- input vector $\overrightarrow{x}$
- many incoming axons with weights $\overrightarrow{w}$
- \Sigma of inputs $v = \overrightarrow{w}T\overrightarrow{x}$
- \Phi some non-linear post-processing of sum
  - pure linear identity
  - binary on/off
- output y

- single neuron
- levels of neurons
- trees of neurons
- arbitrary graphs (cycles) -- this introduces /time/ or /memory/ into
  the system

** 2010-08-26 Thu
NN work initially derived from efforts to understand brain.  Today
we'll talk about biological models of the brain, basically all the
stuff we'll be throwing out.

- neuron doctrine :: brain composed of /discrete/ cells, not one
     contiguous tissue (emerged late 1800s)
     : 0>----- 0>------ 0>------

two kinds of cells in the brain
- neurons :: 1010 neurons in the brain -- many, many more at birth
- glial :: ~100 times as many glial as neurons
     - interface between body and neurons /blood brain interface/
     - clean up the waste produced by the neurons
     - provide scaffolding/structure on which neurons grow

Three parts of neuron
:     dendrites                   soma                  axons
:     ---------                   ----                  -----
:      inputs                   computer               outputs
:  -----------\
:              \            /------------------\
:              |            |    soma          |
:  ------------+            |    50nm          |  length 200microns -> 2m
:              |            |                  |>-------------------------
:              +------------+                  | \             ^
:              |            |                  |  \            |
: -------------+            \------------------/   \     myelin coating
:              |                                    \
: -------------/                                     axon hillock

general types of neurons
- unipolar :: dendrite and axon are connected to each other (no computing)
- bipolar :: dendritic tree and a single axon (as shown above) --
     e.g. sensing, some dendrites in eye actually sense light, some in
     skin actually detect mechanical pressure, etc..
- multiple polar :: many bushy dendritic trees, and a single axon --
     e.g. in spine and related to motor control
- pyramidal :: connical body with multiple potentially long dendritic
     trees out the point of the cone, and one branching axon coming
     out the base -- in cerebral cortex, used for higher order cognition
- purkinje cell :: very well organized comb-like dendrites, can have
     100s of thousands of inputs, used in motor control, in cerebellum

- electrolytes consisting of sodium, calcium, clourine, and potassium
  and ions (not electrons) which flow down axons as charge
- neurons have an internal negative charge on the order of 60-70
- slowly accumulates positive charge until the /axon hillock/ fires
  and send the charge down the axon and resets the neuron to a resting
  charge -- this happens on a period of ~1ms
- pulse is /regenerated/ on the way down the axon ensuring that the
  height and the width of the pulse is maintained from the beginning
  to the end of the axon -- these are like voltage-dependent valves
  and pumps along the axon
- myelin, is mainly fat, so it looks white, so /white matter/ in the
  brain is mainly connections, and /grey matter/ in the brain has more
  neuronal bodies
- signals travel along an axon w/o myelin at ~10 meters per second,
  with glial cells wrapping the axon in myelin, which insulates
  portions of the axon s.t. those portions of /skipped/ by the
  traveling spike resulting in clock speeds of up to ~100 meters per
- max clock-speed of a neuron is ~1 kilo-hertz
- a typical neuron could have on the order of 10,000 synapses

reaction time
- for say breaking in your car can be ~ 1/2 second, that's like 5-10
  serial steps of neurons, plus the flow down the spine to the motor

:                       /-----------------
:                       |   Soma,
:                 cleft |   dendrite,
:                 ~30nm |   axon,
:  ------------\        |   or even another
:   synaptic   |        |   synaptic bulb
:   bulb       |        |
:              | chem   |
:   transmits  | signal |
:   electric   | -----> |
:   sig to     |        |
:   chem sig   |        |
:  ------------/        |
:                       |
:                       |
:                       \-----------------

learning takes place largely at the synapse, per electric pulse how
much chemical is released.

** 2010-08-31 Tue
#+begin_src latex :file data/neurons.pdf :border 1em :packages '(("" "tikz")) :exports none
  \tikzstyle{neuron} = [circle, draw, text centered, font=\footnotesize]
  \begin{tikzpicture}[-),>=stealth', shorten >=1pt, auto]
    \node          (input)   at (-2,0)    {$\overrightarrow{x}$};
    \node [neuron] (neuron1) at (0,0)     {$1$};
    \node [neuron] (neuron2) at (1,0)     {$2$};

    \draw (-1.5,1)    -- (neuron1);
    \draw (-1.5,0.5)  -- (neuron1);
    \draw (-1.5,0)    -- (neuron1);
    \draw (-1.5,-0.5) -- (neuron1);
    \draw (-1.5,-1)   -- (neuron1);
    \draw (neuron1)   -- (neuron2);
    \draw (neuron2)   -- (1.75,0);

Input of charge along dendrites
- soma is constantly leaking charge
- each incoming impulse jumps up the charge in the soma
- inputs arriving at different distances down the dendrites will take
  different amounts of time to arrive
- complex spatio-temporal integration

Hebe's rules
- if a neuron's fires are correlated with the firing of a synapse on
  the neuron, then the strength of the synapse's effect on the neuron
  will be increased
  - (1) above is /pre-synaptic/ neuron
  - (2) above is /post-synaptic/ neuron
  - when they fire together the synapse increases in strength due to the
    sympathetic electrical and chemical processes

Cerebral Cortex flattens out to ~2sq feet ~imm thick, this is the
darker /gray matter/ (no myelin), under this sheet are bundles of
connections between areas of the cortex (more mylin) /white matter/.

** 2010-09-02 Thu
*** bias
in many cases learning can not take place without an initial bias,
where does this initial bias come from?
#+begin_src latex  :file data/bias.pdf :border 1em :packages '(("" "tikz")) :exports none
  \tikzstyle{neuron} = [circle, draw, text centered, font=\footnotesize]
  \begin{tikzpicture}[->,>=stealth', shorten >=1pt, auto]
    \node          (bias)   at (0,1)     {bias};
    \node [neuron] (neuron) at (0,0)     {$\Sigma$};
    \node          (input)  at (-2,0)    {$\overrightarrow{x}$};
    \node          (weight) at (-0.75,1) {$\overrightarrow{w}$};
    \node          (output) at (1.5,0)     {$y$};

    \draw (bias)      -- (neuron);
    \draw (-1.5,1)    -- (neuron);
    \draw (-1.5,0.5)  -- (neuron);
    \draw (-1.5,0)    -- (neuron);
    \draw (-1.5,-0.5) -- (neuron);
    \draw (-1.5,-1)   -- (neuron);
    \draw (neuron)    -- node[above] {$\Phi$} (output);
  v = \Sigman_{i=1}wixi+w0

/bias/ can be considered (and implemented as) an axon with constant
input or a separate parameter to \Sigma

*** \Pi neuron
#+begin_src latex  :file data/pi-neuron.pdf :border 1em :packages '(("" "tikz")) :exports none
  \tikzstyle{neuron} = [circle, draw, text centered, font=\footnotesize]
  \begin{tikzpicture}[->,>=stealth', shorten >=1pt, auto]
    \node [neuron] (neuron) at (0,0)     {$\Pi$};
    \node          (input)  at (-2,0)    {$\overrightarrow{x}$};
    \node          (weight) at (-0.75,1) {$\overrightarrow{w}$};
    \node          (output) at (1.5,0)     {$y$};

    \draw (-1.5,1)    -- (neuron);
    \draw (-1.5,0.5)  -- (neuron);
    \draw (-1.5,0)    -- (neuron);
    \draw (-1.5,-0.5) -- (neuron);
    \draw (-1.5,-1)   -- (neuron);
    \draw (neuron)    -- node[above] {$\Phi$} (output);
Unlike a \Sigma neuron, a \Pi takes the product of it's inputs rather than
the sum.

*** adapting activation functions (\phi) weights and structures
    :CUSTOM_ID: notes-activation-functions
(see reading-activation-functions)

activation functions (\phi) in increasing complexity -- these will all be
monotonically non-decreasing (biological plausibility)
- constant
- linear
- piecewise linear
- sigmoid $\frac{1}{1+e-a(v+b)}$
  - equals 1 at +\infty
  - equals 0 at -\infty
  - where /a/ controls slope and /b/ controls intersection at origin
- hyperbolic tangent, a sigmoid shifted down so it equals -1 at -\infty
- there is one which is the /most/ biologically plausible
    \phi(v) = \frac{v2}{1+v2}
- stochastic activation functions
  - in one the neuron fires with some probability
  - on the other type the output /itself/ is a probability, however
    this is problematic because there is no way for a set of neurons
    to normalize their outputs, (/probabilistic neural networks/)

we like these to be
- monotonically non-decreasing
- bounded
- continuously differentiable

*** radial basis neurons
Rather than a sum or product of inputs, these take the difference
between each input and its weight.  So the larger the difference
between the input vector $\bar{x}$ and the weight vector $\bar{w}$ the
more active the neuron.

  v = |\bar{w}-\bar{x}|2
y = e-v^{2}

*** architectures
- input layer l0
- single layer has input and 1 processing layer (l1)

recurrent system
- like a single-layer feed forward, but all output neurons are
  connected (/lateral/ recurrent system)
- or with outputs from some li going back into some lj s.t. j<i

recurrent systems can have very weird behavior, in terminology of
control systems they are /non-linear/ (see /non-linear dynamical

*** invariants
a variety of inputs can come from the same object (distance,
orientation, etc...), needs to learn only this invariant information

** 2010-09-07 Tue
The text book is written by someone from a signal-processing
background, he includes many flow-diagrams and we can ignore them if
we like.

Today we'll finish Chapt. 1
- chapter 2 is learning
- chapter 3 is the first /neural network/ chapter (single layer
- chapter 4 multi-layer perceptrons

*** Structure
#+begin_src latex  :file data/knowledge-rep.pdf :border 1em :packages '(("" "tikz")) :exports none
  \tikzstyle{off} = [circle, draw]
  \tikzstyle{on}  = [circle, draw, fill=blue!40]
  \begin{tikzpicture}[->,>=stealth', shorten >=1pt, auto]
    \node [off] (01) at (0,0);
    \node [off] (02) at (0,1);
    \node [off] (03) at (0,2);
    \node [off] (04) at (0,3);
    \node [off] (05) at (0,4);
    \node       (x)  at (-1,2) {$\bar{x}$};

    \node [off] (11) at (1,1.5);
    \node [on]  (12) at (1,2.5);
    \node [off] (13) at (1,3.5);

    \node [on]  (21) at (2,0);
    \node [on]  (22) at (2,1);
    \node [off] (23) at (2,2);
    \node [on]  (24) at (2,3);

    \draw (01) -- (11);
    \draw (02) -- (12);
    \draw (02) -- (13);
    \draw (03) -- (11);
    \draw (03) -- (12);
    \draw (04) -- (13);
    \draw (05) -- (13);

    \draw (11) -- (21);
    \draw (12) -- (22);
    \draw (13) -- (23);
    \draw (12) -- (24);
    \draw (11) -- (24);
#+Caption: Activation (blue) of a network given an input $\bar{x}$

How do you represent knowledge in a neural network?  Some pattern of
activation in the network.

- /distance/, \forall $\bar{x}$ \exists some $\bar{y}$ which is the /activation/
  of the network due to the input.  Each $\bar{y}$ can be thought of
  as a point in an /n/ dimensional space (/n/ neurons in the network).
  We can take the Euclidean distance between these points as measures
  of their similarities.
  - Manhattan distance is L1
      L1 = dkj = |\bar{y}k - \bar{y}j| = \left(\sumn_{l=1}{(ykl - ylj)}2\right)\frac{1}{2}
  - Euclidean distance is L2
  - another interesting one is L\infty
  - also dot product of the vectors is interesting
  - other metrics could be /statistical/, the mean of the activated
    values or something
  - edit or hamming distance

    A /metric/ or /distance/ must be
    - positive :: d \geq 0
    - triangular :: d12+d23 \geq d13
    - symmetric :: d12 = d21

- an /important/ input should stimulate /more/ neurons

- /prior/ information can be built into the structure of the network
  or the pre-processor of the network

- three ways to handle /invariants/ -- it's very possible for your
  system to lock onto the wrong invariant if you're not careful in
  selection of test data
  1) structure
  2) training
  3) feature space transformation (preprocessing) e.g. we might
     calculate the /moments/ of each image in a series of images
     - moments :: the following are examples of moments, something
          $\frac{y(x)}{mo}$ for input images could be used to control
          for the overall brightness of the system, these can also be
          used for translation (e.g. x'=(x-m1)), rotation,
          etc... assuming there's only one item of interest in the
       - m0 = \int\infty y(x) dx (area under the curve)
       - m1 = \int\infty y(x)x dx (expected value of x or mean)
       - mn = \int\infty y(x)xn dx

  standard geometrical invariants
  - translation -- could add another layer that /or/'s together a
    bunch of inputs from different locations
  - rotation
  - scale -- ratios are invariant over different scales

** 2010-09-09 Thu
*** chapter 2
#+begin_src latex  :file data/turner.pdf :border 1em :packages '(("" "tikz")) :exports none
  \tikzstyle{neuron} = [circle, draw, text centered, font=\footnotesize]
  \begin{tikzpicture}[->,>=stealth', shorten >=1pt, auto]
    \node [neuron] (neuron) at (0,0)       {$\Sigma$};
    \node          (input1) at (-1.5,0.5)  {$x_{1}$};
    \node          (input2) at (-1.5,-0.5) {$x_{2}$};
    \node          (output) at (1,0)       {$d$};

    \draw (input1)  -- node[above] {$w_1$} (neuron);
    \draw (input2)  -- node[below] {$w_2$} (neuron);
    \draw (neuron)  -- node[above] {$\Phi$} (output);

Consider a neuron w/2 inputs x1 and x2 and for each combination we
have a desired output value d.

table of desired behavior
| x1  | x2 | d | y |
| ... |    |   |   |

In this example let error /e/ equal $\frac{1}{2}e2=(d-y)2$, the
$\frac{1}{2}$ is there for the kinetic energy analogy.

Kinetic Energy
e = \frac{1}{2}mv2

So for the above, lets make our neuron linear, and have our desired
value be either positive or negative.

We want to minimize the error using a /gradient descent algorithm/.
 w1(n+1) = w1(n)-\eta\frac{\delta e2(n)}{\delta w1}
where \eta provides a scaling factor to convert between units of error
and units of weight.  So what's our derivative?
  \frac{\delta e2}{\delta w1} = 2e()x1()

**** learning algorithms
1) supervised vs. unsupervised
2) local vs. global
3) statistical vs. deterministic
4) memorization vs. generalization
5) fast vs. slow, meaning the rate of weight change per experience,
   fast learning typically involves allot of forgetting (see
   /stability plasticity dilemma/)

** 2010-09-21 Tue
on paper

** 2010-09-23 Thu
*** optimization
- ECE506 is entirely dedicated to optimization

we have some function, and we want to find the minimum
#+begin_src gnuplot :file data/opt-func.png :exports none :results silent
  set xrange[-5:5]
  set yrange[0:25]
  plot x**2+5, '-' w p ls 2
  3 14

- we can do /gradient descent/ with
  \bar{\Delta}E = \frac{\delta E}{\delta w1}\hat{w1} + \frac{\delta E}{w2}\hat{w2} + \frac{\delta E}{w3}\hat{w3}

  to update our weights with
  \bar{w}(n+1) = \bar{w}(n) - \eta\bar{\Delta}E

  Using this we can prove that the error will not increase.
  Considering the single-dimension case with a Taylor expansion
  E(w(n+1)) &=& E(w(n))+\frac{\delta E}{\delta w} (w(n+1)-w(n) + \ldots\\
            &=& E(w(n)) - \eta \left(\frac{\delta E}{\delta w}\frac{\delta E}{\delta w}\right)

- using /Newton's Method/ we can compute the $\Delta w$ required to take us
  directly to the minimum
    \Delta E &=& E(w(n+1)) - E(w(n))\\
        &=& \frac{\delta E}{\delta w}\Delta w + \frac{1}{2}\frac{\delta2 E}{\delta w2}\Delta w2\\
        &=& 0
  so solving for $\Delta E$ we can get
    \Delta E &=& \frac{-\frac{\delta E}{\delta w}}{\frac{1}{2}\frac{\delta2 E}{\delta w2}}\\
        &=& -H-1(n)\Delta E(n)
  where H is a /Hessian/ (and NxN matrix of all possible partial
  derivatives of a N-length vector)

*** training
we have a set of training vectors
| x | d |
|   |   |

for each training vector we can do /gradient descent/ of the weights
towards that vector

incremental learning
- linear \phi
    w(n+1) = w(n) + \eta e(n) x(n)
- non-linear \phi
    w(n+1) = w(n) + \eta \phi'(v(n)) e(n) x(n)

an /epic/ is a run through all of our training vectors

after an epic we can assess our progress as the overall error
  E(k) = \sum1^N{e2(n)}
to get our cumulative error in the same scale as our per-vector error
we can take the /root mean square/ (RMS) error
  ERMS(k) = \sqrt{\frac{1}{N}\sum1^N{e2(n)}}

** 2010-09-28 Tue
*** Questions
- HW 2.12 :: what is the question asking?  these are two normalized
     Gaussians, we take the difference of these two Gaussians (Mexican
     hat).  What happens if we translate this across along the x axis.
- HW 2.10 :: two sums, write out the expression as the positive sum of
     the wx's minus the sum of the cy's or something... there are a
     number of ways this can be expressed
- general :: assume that the internal activation of a network under no
     input is set to 0
- Eulerian integration :: $\frac{\delta y}{\delta t} = f(y)$, and we know the
     value at y=0.  we can put this initial value in and use $\frac{\delta
     y}{\delta t}$ to algebraically compute $\delta y$ given some $\delta t$.  We can
     then just keep doing that.

*** Project
- this weekend we'll get the API code

- the first step is to run a dumb agent that does nothing or provides
  a random sequence of actions, we then try to beat that

- so how could we use the LMS neuron for the project.  we could use a
  competitive layer of /winner take all neurons/ along our line of
  sight to select the brightest spot in our field of vision, then turn
  towards that spot.

  right before we eat an object we'll have a strong RGB input in our
  center neuron, we can treat this center neuron as an /LMS neuron/
  with a desired output of a positive $\Delta energy$.

  this could be a simple starting architecture.

- our /experimental setup/ should report both average length of
  lifetimes and standard deviations on this length over a number of
  trials -- maybe even a /t test/?

- if we want to we can exceed the page limit with an appendix of
  additional figures

- would be good to try to compute the upper bound on the possible
  life-span given some assumptions

- we'll get a tentative outline

*** Perceptrons
- error minimization ::
  - linear \phi
  - minimize $\frac{1}{2}e2(n)$ where $e(n) = d(n) - y(n)$
- perceptron ::
  - non-linear \phi
  - minimizing another criterion function aside from the squared error

** 2010-09-30 Thu
some current research uses complex numbers for activation propagation
to propagate activation with a unit amplitude, but with both frequency
and phase

*** perceptron
#+begin_src latex  :file data/perceptron.pdf :border 1em :packages '(("" "tikz")) :exports none
  \tikzstyle{neuron} = [circle, draw, text centered, font=\footnotesize]
  \begin{tikzpicture}[->,>=stealth', shorten >=1pt, auto]
    \node [neuron] (neuron) at (0,0)     {$\Sigma$};
    \node          (input)  at (-2,0)    {$\overrightarrow{x}$};
    \node          (weight) at (-0.75,1) {$\overrightarrow{w}$};
    \node          (bias)   at (0,2)     {bias = 1};
    \node          (output) at (1,0)     {$y$};

    \draw (-1.5,1)    -- (neuron);
    \draw (-1.5,0.5)  -- (neuron);
    \draw (-1.5,0)    -- (neuron);
    \draw (-1.5,-0.5) -- (neuron);
    \draw (-1.5,-1)   -- (neuron);
    \draw (bias)      -- node[right] {$w_0$} (neuron);
    \draw (neuron)    -- node[above] {$\phi$} (output);
  \phi = \left\{
     1 &: v > 0\\
    -1 &: v \leq 0

the only other neural network architecture that provably converges is
/adaptive resonance/

** 2010-10-05 Tue
*** perceptron (cont)
- treat bias just like any other weight
- w0 is the bias weights, which is updated like any other
- if error then update weights with
    \bar{w}(n+1) = \bar{w}(n) + \eta(\bar{w}(old)-\bar{x})
    \bar{w}(n+1) = \bar{w}(n) +- \eta\bar{x}
  where /if error/ means
    if \left\{
      \bar{w}T\bar{x} \geq 0 &and& \bar{x} \in c+1\\
      \bar{w}T\bar{x} < 0 &and& \bar{x} \in c-1

*** simulation of multilayer perceptrons
for a multilayer feed-forward network we can use a matrix
representation of the neurons and their weights, then the running of
the neural network could be reduced to matrix multiplication.

the following computes the activation
  \bar{v}(n+1) = \bar{W}(n) \bar{y}(n)
and the output of the entire network
  \bar{y}(n+1) = \Phi(\bar{v}(n+1))

*** multilayer perceptrons
- learning /internal weights/, how to update weights which are further
  back in the network?

** 2010-10-12 Tue
*** back-propagation learning
#+begin_src latex  :file data/back-prop.pdf :border 1em :packages '(("" "tikz")) :exports none
  \tikzstyle{neuron} = [circle, draw, text centered, font=\footnotesize]
  \begin{tikzpicture}[->,>=stealth', shorten >=1pt, auto]
    \node [neuron] (00) at (0,0)     {};
    \node [neuron] (01) at (0,1)     {};
    \node at (0,1.5) {$l_1$};
    \node [neuron] (10) at (1,0)     {};
    \node [neuron] (11) at (1,1)     {};
    \node at (1,1.5) {$l_2$};
    \node [neuron] (20) at (2,0)     {};
    \node [neuron] (21) at (2,1)     {};
    \node at (2,1.5) {$l_3$};

    \node at (0,-0.5) {$k$};
    \node at (1,-0.5) {$j$};
    \node at (2,-0.5) {$i$};

    \node (d1) at (3,1) {$d_1$};
    \node (d2) at (3,0) {$d_2$};

    \draw (-0.75,0) -- (00);
    \draw (-0.75,1) -- (01);
    \draw (00) -- node[below] {$w_{jk}$} (10);
    \draw (00) -- (11);
    \draw (01) -- (10);
    \draw (01) -- (11);
    \draw (10) -- node[below] {$w_{ij}$} (20);
    \draw (10) -- (21);
    \draw (11) -- (20);
    \draw (11) -- (21);
    \draw (20) -- (d2);
    \draw (21) -- (d1);

- dependencies :: E \leftarrow e \leftarrow y \leftarrow v \leftarrow w
- partial error ::
       ei = di - yi
- error ::
       E(n) = \frac{1}{2}\Sigmai{e2_{i}(n)}
- output layer ::
       \Delta wij &=& - \eta \frac{\delta E}{\delta wij}\\
             &=& - \eta ei\frac{\delta e}{\delta wij}\\
             &=& - \eta ei \frac{\delta ei}{\delta yi}\frac{\delta yi}{\delta wij}\\
             &=& \eta ei \phii' yi
     where $\phi'i = \frac{\delta}{\delta vi}\phii(vi)$
- local gradient :: how the overall error changes as the activation of
     a single neuron changes $\deltai=-\frac{\delta E}{\delta vi}$ and hence the
     change in weights between any two neurons is as follows
       \Delta wij = \eta \deltai yi
     this is like local Hebbian learning

Now layer by layer
- Output Layer
    \Delta wij = - \eta \frac{\delta E}{\delta vi} = \eta \deltai yi
- Hidden Layer
    \Delta wjk = - \eta \frac{\delta E}{\delta vj}yk
    \frac{\delta E}{\delta vj} = \Sigmai{ei \frac{\delta ei}{\delta vj}}
  and e changes with v, and y changes with v, and vi changes with
    \Delta wjk &=& - \Sigma{ei \phi'i \frac{\delta vi}{\delta yi} \frac{\delta yi}{\delta vj}}\\
          &=& - \Sigma{ei \phi'i wij \phi'j}\\
          &=& \eta yk \phi'j \Sigma{ei \phi'i wij}\\
          &=& \eta \deltaj yk
  finally we get
    \deltaj = \phij' \Sigmai{\deltai wij}

Weight changes propagate back through the network, the \deltaj is dependent
on the sum of the \deltai's.  The \delta for each neuron need be computed only

This is a two-pass algorithm.  for each pattern we pass through
forward computing the v, y, and the \phi' values, saving the y and \phi'
values.  then on the way back we compute the e and \phi' values to get
the \delta values of the output layer, and work backwards.

** 2010-10-19 Tue
*note*: there may be a mistake in the /summary of back propagation/
        section (eq. 4.47) the correct equation is (eq. 4.39)

*** back-propagation review
- forward pass
  - clamp on the inputs
  - compute activations for nodes and their outputs through the
    network, and store these
  - at the output we compute the errors ei(n)
- backward pass
  - \forall layers
    - compute the \delta's, $\deltaj(n) = \phi'j\Sigmai{\deltaiwij}$
    - compute the $\Delta w(n) = -\eta \deltajyk$
    - calculate \phi'
    - loop back to the next previous layer and repeat

*** momentum
- without momentum
    \Delta wij(n) = - \eta \deltaj(n) yk(n)
- with momentum
    \Delta wij(n) = \alpha \Delta wjk(n-1) - \eta \deltaj(n) yk(n)

  if \alpha and -\eta sum to one, then the above is a /convex combination/

** 2010-10-21 Thu
*** Back Propagation Learning Algorithm
1) hidden neurons: compute and store on the forward pass
   - $vj = \Sigma wjkxk$
   - $yi=\phi(vi)$
   - $\phi'(vi)$
2) output neurons: compute and store on the forward pass
   - $vi= \Sigma wijyj$
   - $yi = \phi(vi)$
   - $\phi'(vi)$
3) then can compute the error as $ei = di-yi$
4) output neuron: backwards
   - $\deltai = ei\phi'(vi)$
   - $\Delta wij = \eta \deltai yj$
5) hidden neurons: backwards
   - $\deltaj = \phi'(vj) \Sigmai{\deltaiwij}$
   - $\Delta wjk = \eta \deltaj yk$
6) finally do one more forward pass through the network in which we
   add all of the $\Delta w$ values to our weights

For back propagation with momentum we save the old $\Delta w$ so that we
can use it to calculate our new $\Delta w$.
  \Delta wij(n) = \alpha \Delta wij(n-1) + \eta \deltai yj(n)

*** Training
- 2 inputs and 4 outputs
- for a full pass you can track a /pattern error/
  $\frac{1}{2}\Sigmai{e2_{i}}$, however for a more intuitive metric it may
  be useful to look at the RMS error which is "of the same size" as
  the errors themselves $\sqrt{\frac{1}{|i|}\Sigmai{e2_{i}}}$
- the epic error could be taken as the RMS error over the entire set
  $\sqrt{\frac{1}{|epic||i|}\Sigman\Sigmaiei^{2}(n)}$, we should plot these by epic
- after each epic we can turn off learning (backward pass) and compute
  the errors generated from the /testing/ set of samples giving us
  another error (i.e. errortesting).  We should plot both errortesting
  and errortraining on the same scale (note we should re-run the
  training data w/o learning).  This is called a /generalization
- the training error *will* monotonically decrease, however it is
  possible that the testing error could begin to rise if we're
  over-fitting the training data.
- would be good to look at both incremental and batch update of the
  weights (e.g. do or do not update mid-epic)
- we should also look at how the order of presentation affects the
  performance of the network (only has an effect when doing
  incremental weight updates)
- stopping criteria
  - some error threshold
  - testing error starts to increase
  - etc...
- 3 architectures, 3 numbers of nodes, possibly to vary \eta, \alpha, and even
  breaking some connections or removing some neurons after training to
  see how the network holds up
- weight initialization is yet another thing we could vary across
  multiple runs.  There could be many local minima which we could land
  in depending on our starting position (or initial weights).  For a
  given output neuron (expanded using a Taylor's expansion)
    y0 &=& \phi\left[\Sigmaiwoi[\Sigma wijxj]\right]\\
       &=& \Sigmalal(\Sigmaiwoi\Sigmaman(\Sigma wij xi)m)l
  So you could have a very large number of local minima.  Typically
  you want to pick your weights in a random distribution centered
  around 0 -- small weights lead to large values of \phi' and large
  changes in weight.

** 2010-10-26 Tue
/Bayes error/ is the theoretical best generalization error achievable.
So for example in our back-prop assignment, we won't get better (at
least on the test data) than the Bayes error which is around 13-14%.
Note this is "classification error" or percent correct, not RMS error.
This could be a good stopping criteria.

*** heuristics
(see /convergence heuristics/ in the text book)
- you can look at the variance of your training data, and get a feel
  for what the \sigmaw of your weights should be
- for our homework assignment, the most effective solution will be to
  center our input data on the origin, and force a unit standard
  deviation on the input -- this will keep us from saturating our
  neurons thereby reducing their information capacity to binary
  on/off. Note: this is a moment transformation, subtracting first
  moment and dividing by second moment.
- also, maybe set target values to something achievable (e.g. 0.1 and
  0.9 instead of the asymptotic 0 and 1)

*** universal approximation theorem
(p.208 in the text)

For certain types of bounded functions over a finite domain \exists a
single-hidden-layer feed forward neural network which can arbitrarily
approximate that function (i.e. \forall \epsilon \exists a number n s.t. a network with n
neurons in the hidden layer can approximate the function).

  F(\bar{x}) = \Sigmai=1^{n}\alphaj\phij(\Sigma wjixi+bi)

This is like a Fourier Transform, a sum of orthonormal parts to
approximate an arbitrary function.

*** back propagation to do other things
We could also for example take the partial of /a/ (the slope of the
sigma function) of a neuron, and use pack-propagation to adjust these

We can take partials of our inputs $\Delta xi = \eta \frac{\delta E}{\delta xi}$ to
guess what x would likely give us any particular output /y/.  You
would need an initial guess of inputs, but for any initial guess
back-prop could be used to move form the guess input to another input
which is more appropriate for a particular desired output.

** 2010-10-28 Thu
*** finishing off Chapter 4
#+begin_src latex  :file data/compression.pdf :border 1em :packages '(("" "tikz")) :exports none
  \tikzstyle{neuron} = [circle, draw, text centered, font=\footnotesize]
  \begin{tikzpicture}[->,>=stealth', shorten >=1pt, auto, scale=0.25]
    \node (a) at (10,12);
    \node (b) at (10,8);
    \foreach \y in {1,...,20}
      \node (in-\y) at (0,\y) {};
      \node (out-\y) at (20,\y) {};
      \draw (in-\y) -- (a);
      \draw (in-\y) -- (b);
      \draw (a) -- (out-\y);
      \draw (b) -- (out-\y);

- hidden layers can be used as feature detectors.  if you force a
  large amount of data through a *small* hidden layer then the data
  will be /compressed/ through that layer which will require
  /discovery/ of structures in the data to achieve the compression.
  such hidden layers are sometimes called "feature detectors".

- Auto-associative network that maps inputs to identical outputs.
  This could also be used for compression or encryption.  the
  bottleneck hidden layer could be considered a compressed (and
  probably unintuitive encryption) of the inputs.

- Introduce /weight sharing/ where each neuron in the hidden layer
  shares the same weight structure (e.g. mexican hat).  This could be
  used to for example build an apple detector over images, no matter
  where the apple is present in the original image the same weight
  pattern will be present near that part of the image.

- /prediction/: say we have a time series, we can take a series of
  values as input, and then take a single later value as the desired
  output.  In this way we can train a predictor.
  :               +--------+
  :  +----------->|   NN   |
  :  |            |        | output
  :  |  input     |        |--------+
  :  |       +--->|        |        |
  :  |       |    +--------+        |
  :  |       |                      |
  :  |       |                      v
  : -----------------------------------time-series-->
  The "prediction company" formed out of the Santa Fe institute doing
  things like this for financial prediction.

- In practice we won't know how to build a network, i.e. how many
  layers and how many neurons in the layers.  We want to limit the
  complexity of the network.  You can add a /penalty/ term s.t. when a
  weight has too much penalty it is removed.
  - you can start big and cut things out, intermittently remove all
    small weights from the network, this won't remove neurons or
    layers but it will simplify the network
  - you can add weights.  start with a single neuron, doing learning
    with the standard \Delta-rule.  whenever an input results in a large
    error a new neuron is introduced which reduces that error and is
    connected to every existing neuron.  The network is then trained
    through normal backwards propagation.  These can work very well.
  - GA, the chromosome is generally the adjacency matrix of the neural
    network (with a set number of neurons constant across the entire
    species).  This matrix could be linearized out into one long
    vector.  Then simple mutation and any length-preserving method of
    crossover can be used.  More generally any method of graph
    crossover could be used.

*** radial basis neural networks
- \phi-separability :: a data set is \phi-separable if \exists a function \phi which
    separates the classes of the set

Covers Theorem: Any set of data with two classes (dichotomy) is more
likely to be linearly separable the higher the dimension of the space
in which the data is embedded.

as we non-linearly map our data into a higher dimensional space the
linear separability of the data will increase.  eventually we can just
use a single perceptron to learn the data.

we can use /radial basis neurons/ to perform this non-linear mapping

if we take a set of i functions \phii, then we can use these i functions
to map a point in 2 dimensions to a point in 2i dimensions by passing
each coordinate through all i functions.
#+begin_src latex  :file data/radial-basis.pdf :border 1em :packages '(("" "tikz")) :exports none
  \tikzstyle{neuron} = [circle, draw, text centered, font=\footnotesize]
  \begin{tikzpicture}[->,>=stealth', shorten >=1pt, auto, scale=0.25]
    \node [neuron] (out) at (20,10);
    \foreach \x in {2.5,5,...,17.5}
      \node [neuron] (in-\x) at (0,\x);
    \foreach \y in {1,3,...,20}
      \node [neuron] (phi-\y) at (10,\y) {$\phi_{\y}$};
      \draw (phi-\y) -- (out);
    \foreach \x in {2.5,5,...,17.5}
    {\foreach \y in {1,3,...,20}
      {\draw (in-\x) -- (phi-\y);}


** 2010-11-02 Tue
Three smaller topics we'll be hitting
- radial basis neurons
- scalar vector machines
- committee machines

*** radial basis neurons
- H = {x1...xn}
- Dichotomy = (H1, H2)
- a set of functions $\bar{\phi}(x)$

a dichotomy is \phi-separable if \exists $\bar{w}$ s.t.
- $\bar{w}T\bar{\phi}(x) > 0$ if $\bar{x} \in H1$
- $\bar{w}T\bar{\phi}(x) \leq 0$ if $\bar{x} \in H2$

*** radial basis network for X-or
#+begin_src latex  :file data/radial-basis-network.pdf :border 1em :packages '(("" "tikz")) :exports none
  \tikzstyle{neuron} = [circle, draw, text centered, font=\footnotesize]
  \begin{tikzpicture}[->,>=stealth', shorten >=1pt, auto, scale=1.75]
    \node [neuron] (x1) at (0,0.5) {input $x_1$};
    \node [neuron] (x2) at (0,-0.5) {input $x_2$};
    \node [neuron] (r1) at (1,0.5) {radial 1};
    \node [neuron] (r2) at (1,-0.5) {radial 2};
    \node [neuron] (out) at (2,0) {regular};
    \node (y) at (3,0) {y};
    \draw (x1) -- (r1);
    \draw (x1) -- (r2);
    \draw (x2) -- (r1);
    \draw (x2) -- (r2);
    \draw (r1) -- (out);
    \draw (r2) -- (out);
    \draw (out) -- (y);

- $\phi1 = e-|x-t_{1}|^{2}$ where $t1=(1,1)$
- $\phi2 = e-|x-t_{2}|^{2}$ where $t2=(0,0)$

| x   |   \phi1 |   \phi2 |
| 1,1 |  1.0 | 0.13 |
| 0,1 | 0.36 | 0.36 |
| 1,0 | 0.36 | 0.36 |
| 0,0 | 0.13 |  1.0 |

the non-linearly separable classes of Xor are mapped by \phi1 and \phi2 to a
new plane in which they are separable

*** interpolative functions
An interpolative function will pass through all data points (even if
it doesn't accurately reflect the behavior of the original function
between the given data points).

F is interpolative if F(xi)=di \forall i\in[1..N]

  F(x) = \Sigma1^{n}wi\phii(x-xi)

In matrix notation $\Phi \bar{w} = \bar{d}$ or
  \phi11 & \ldots & \phi1N\\
  \vdots & \ddots & \vdots\\
  \phiN1 & \ldots & \phiNN

Then you can find the weight matrix in a single matrix inversion

A perceptron allows you to split a space with a hyperplane, however a
radial basis neuron allows you to split a space with a hypersphere.

**** Application
Let's say we want to fit a bunch of data with radial basis neurons
using Gaussian functions for our \phi's.  If we have 10,000 data points
and we only want 10 hidden neurons, then we could just pick 10 points
at random to be the centers of our neurons.  Random works very well
because we tend to sample from dense sample areas.

Since 10 does not equal 10,000 we can't use simple matrix inversion to
learn our weights because a 10x10,000 matrix is not square.

we could also use
- clustering algorithms, or
- gradient descent -- the free parameters in this approach above are
  the weights and the centers of the spheres, we could then computer
  the partial errors over \delta-t and \delta-w and descend on both the centers
  and weights

** 2010-11-04 Thu
*** learning on radial basis functions
(RBF learning strategies in the book p.320)

1) random centers
2) self organizing centers
3) supervised -- gradient descent centers

lets assume they're Gaussians for this example
  \phi(|\bar{x}-\bar{t}|) = exp(-\frac{|\bar{x}-\bar{t}|}{2\sigma-2})

So for a network with a hidden layer of m radial basis neurons all
connected to all inputs and all feeding into a single perceptron the
output is

initial values for our Gaussians
1) Under type (1) we set the centers ti to random input points, to
   pick \sigma we want some overlap between the Gaussians positioned around
   different centers, but we also want different values for points
   closer to a particular center.  So to approximate a good value for
   \sigma we can find the maximum distance between any two pairs of
   centers, and then we can set $\sigma2 \simeq \frac{dmax^{2}}{\sqrt{m}}$.

   Why do we divide this by $\sqrt{m}$ (which is related to the
   dimensionality of the space)?  Not sure, but this is an accepted

   Then to select the values for wi,

   Once these initial values are selected we can use gradient descent
   to improve the values of wi \forall i.  The centers and the
   \sigma's are fixed, so we only need update the weights.

2) we select the centers using something like k-means.  For k-means we
   1) pick a value of k (this is somewhat unsatisfying, we could start
      with only 1 center, and then add new centers whenever the max
      center to xi distance is over some preset threshold)
   2) randomly pick k centers (possibly from the data)
   3) loop over the data (xi's) and associate each x with the closest
   4) for each center we move it towards the center of mass of the
      associated xi's by some learning parameter \eta
   5) keep doing it until the centers stop moving
   6) the standard deviations of these centers are easy to compute,
      just compute the standard deviation of the xi's assigned to each

3) we can bring in a desired value, compute an error, and take the
   partial errors along the weights, and then we can do gradient
   descent on each of the hidden radial basis neurons.

   If we think of every radial basis neuron as a function like
   $\phi(\bar{x},\bar{ti},\sigmai)$ then we can take partials of all
   three of the arguments to \phi, e.g.
     ti(new) = ti(old) - \eta \frac{\delta e2}{\delta \bar{ti}}

   We're still guessing at the number of hidden nodes, but having made
   that choice this works very well at moving around the centers of
   these nodes.

   We can use a covariance matrix to allow \sigma to vary across
   dimensions, we could then learn these mi^{2} elements of the
   covariance matrix using gradient descent.

** 2010-11-16 Tue
*** last time
Hebbian learning in a /winner take all/ system with a laterally
connected output layer learning can be localized to the winner for any
particular input.

*** self organizing maps
/highly/ recommended book \leftarrow should purchase
| Title  | Self Organization and Associative Memory |
| Author | T. Kohonen                               |
| ISBN   | 3-540-18314-0                            |

also this is Chapter 9 in the text.

The locations of neurons relative to each other in some space will
begin to become important.

Imagine that all output neurons are embedded in a two dimensional
planar grid.  Imagine local inhibition of these output neurons on this
2D grid.

A /self organizing map/ is a neural network which is continuous s.t. \forall
\eta \exists \delta s.t. if the output of two inputs x1 and x2 is within \delta in the
output space (our 2D plane) then x1 and x2 are within \eta in the input

typical representation of these 2D output neural networks
#+begin_src latex  :file data/self-organizing-map.pdf :border 1em :packages '(("" "tikz")) :exports none :results silent
  \tikzstyle{neuron} = [circle, draw, text centered, font=\footnotesize]
  \begin{tikzpicture}[->,>=stealth', shorten >=1pt, auto]
    \node (input) at (0,4) {$\bar{x}$};
    \foreach \x in {3,2,1}
      \foreach \y in {1,2,3}
        \node [neuron] (\x-\y) at (\x,\y) {};
        \draw (input) to (\x-\y);
    % \draw (neuron)    -- node[above] {$\Phi$} (output);

These neurons often have alternating /competitive/ and /cooperative/
phases.  In normal competitive Hebbian learning only the winner has
his weights changed, however in the Kohonen self-organizing map there
is also a cooperative phase in which the winner shares learning with
his neighbors, they share slightly less with their neighbors, etc...

- competitive :: first we find the /winner/ by computing the cross
     product of the input $\bar{x}$ with each weight vector $\bar{wi}$
     and the winner is the neuron with the smallest inner product.
       J(\bar{x})=minj \in i,n(||\bar{x} - \bar{wi}||)
     so the neuron with the weight vector closest to the x vector is
     the winner.  We then move this weight vector slightly (by \eta)
     closer to the x vector.

- cooperative :: For a generic neuron in the network
       \bar{wj}(n+1) = \bar{wj}(n) + \eta h(J(x))(\bar{x}(n) - \bar{wj}(n))
     where the $h$ function need to return the highest values for
     those neurons nearest to the winner with unity value right on the
     winner and fairly quickly dropping to zero as we move away from
     the winner.  We can specify $h$ as
     h(J(\bar{x}),j) = \left\{
       1 & j=J(\bar{x})\\
       f(||\bar{r}J(\bar{x}) - \bar{r}j||) & j \neq  J(\bar{x})
     or actually more like
       h(J,j) = exp\left(\frac{-d(J,j)2}{2\sigma2}\right)
     where the distance metric
       d(J,j) = ||\bar{r}J - \bar{r}j||
     and where \sigma is used to control the /spread/ of the weight

     So how do you pick \eta and \sigma?  At least initially you will want \sigma
     to allow the weight updates to touch most weights.  Generally you
     will want to diminish both \eta and \sigma as time progresses.

pictured in the input space the following /pretty much/ amounts to,
whenever an input lands in the input space, all weights shift slightly
towards that input with the degree of the weights movement based upon
its closeness to the input (or rather its closeness to the closest

There are some very cool demonstrations of and visualizations of these
sorts of systems available on-line.

** 2010-11-18 Thu
*** continuous time systems -- Hopfield network
we can turn our Canonical model into a continuous time differential
equation, in which case we get
  \frac{\delta vi(t)}{\delta t} = \Sigmai wij(t) yi(t)
This is similar to an RC circuit in which the input is the battery and
the neuron is the capacitor.

Given this equilibrium equation...
  \frac{\delta vi(t)}{\delta t} = \Sigmai wij(t) yi(t) - \frac{vi(t)}{\tau}
When we first add input if the sum term is a positive constant, then
the \delta will initially be positive (because vi is initially 0), however
eventually \delta=0 when the sum equals the vi.  This would be one time
step in our model.

we also have...
  yi(t) = \phi(vi,t)
these two equations together form a coupled system of non-linear

some points
1) if \phi is linear then it is a coupled system of linear equations and
   would be more amenable to analysis
2) there is no learning in the above, it gets considerably more
   complicated in this case
     \frac{\delta wij(t)}{\delta t} = F(?) - \frac{wij(t)}{\tau}
   ultimately this is system of three highly related equations
3) this system of equations can generally be bounded in space as the
   weights and activations won't normally go off to \infty

*** dynamical systems aside
  \frac{\delta xj(t)}{\delta t} = Fi(\bar{x}(t))

- autonomous :: there is no explicit time term inside of the main
- non-autonomous :: the function "F" does depend on time in some way
     e.g. $Fi(\bar{x}(t),t)$

because /non-autonomous/ systems are very difficult we will focus on
systems without an explicit time term.

so fixed points in the state space of this system of equations will
correspond to states in which the neural network has ceased to learn
and has /learned/ some input, this can be thought of as a /memory/.

it is possible to /linearize/ these differential equations in some
bounded areas of space.

** 2010-11-23 Tue
after today we'll be doing adaptive resonance

*** basins of attraction
to study the stability of our "memory" attractors we'll be using
Lyapunov's Theory (direct method) to determine if these attractors

Lyapunov's Theory, assume \exists V(x) which is
1) continuous
2) V(xstar)=0 where x^star is an attractor
3) V(x)>0 when x \neq xstar

In the range of the attractor (in its basin) $\frac{\delta v(x)}{\delta t}<0$,
this defines the region of the attractor (i.e. where $\frac{\delta v(x)}{\delta
t}<0$ is true).

*** neural dynamics
the application of dynamic systems analysis to neural networks

- additive model
    \frac{\delta vj(t)}{\delta t} = -vj(t)+\Sigmai=1^{n}wji(t)\phi(vj(t))+Ij\\
    yj(t) = \phi(vj(t))

- alternate additive model
    \frac{\delta yj(t)}{\delta t} = -yj(t)+\phi(\Sigmai=1^{n} wijyi)

*** Hopfield Network and Stability
  \frac{\delta vj}{\delta t}=-\frac{vj}{\tau}+\Sigmai=1^{n} wji \phi(vi(t)) + Ij
- wji = - wij
- the \phi function is invertible
- and \exists a Lyapunov's function
    - v = - \frac{1}{2} \Sigmai=1^{n} \Sigmaj=1^{n} wjiyjyi+
    \Sigmaj=1^{n}\frac{1}{\tau}\int0^{y}\phi-1_{j}(y)\delta y - \Sigmaj=1^{n}Ijyj
- $\frac{\delta v}{\delta t} < 0$

** 2010-11-30 Tue
*** Adaptive Resonance Theory ART-1
:                +--------------------+      dipole layer, each neuron is a pair
: F2             |o o o o o o o o o o |   <- connected as a flip-flop, and all
:                +--------------------+      pairs connected as winner-take-all
:                                                +----+   gain control
:              Full Bidirectional connectivity   | GC |<- fully connected
:               (top down weights are binary)    +----+   to F1 and F2
:     +---------------------------------------------+
: F1  |o o o o o o o o o o o o o o o o o o o o o o o|
:     +---------------------------------------------+    +---+    vigilance,
:       ^        ^    ^           ^      ^     ^         | e | <- fully connected
:       |        |    |           |      |     |         +---+    to F1 and F0
:     +---------------------------------------------+
: F0  |o o o o o o o o o o o o o o o o o o o o o o o|
:     +---------------------------------------------+
:        ^         ^       ^       ^           ^
:        |         |       |       |           |
:            Input I is binary vector

- GC :: is on unless it is inhibited, any inputs turn it off
- F0 :: these neurons are binary on or off depending on their input
- F1 :: these are threshold neurons, which have three inputs (one from
        F0, one from GC and one from each neuron in F2 which is
        initially only 1), it requires at least 2 inputs to turn on
- F2 :: each neuron is a flip-flop pair of neurons with the whole
        layer connected into a winner-take all network, let Ti be the
        weights up from F1 to neuron i in F2, then the downward
        weights of i Bi are always a binary scaling of Ti
- e ::  the vigilance neuron checks the Hamming distance between the
        activation of F0 and the activation of F1, to see if the
        top-down weights (defining the center of the active neuron in
        F2) is /close enough/ to the input (the activation in F1), we
        can calculate this with
          \frac{||T1-I||}{||I||} \geq \rho
        where T1 is the activation pattern of F1 and I is the input
        vector.  /e/ turns on when the activation is less than \rho.
        When /e/ turns on it activates the flip-flop neuron of the
        active neuron in F2 which latches off F2, and when the effects
        of F2 turning off propagates through the network and the
        original input activation propagates back up to F2, a new
        neuron is recruited to activate in response to the input.

Lets look at the behavior of this system
1) initially only GC is turned on
2) lets apply a pattern
3) the F0 neurons of the 1 inputs will activate
4) the F1 neurons related to the active F0 neurons will activate
   (because they now have two inputs)
5) the active F1 neurons then push activation to the recruited neurons
   in F2, and the winner turns on
6) when the winner turns on it begins activating GC and activating
   back from F2 to F1 (according to its top-down weight vector)
7) without GC on all neurons in F1 which aren't activated by the
   active neuron in F2 turn off, those neurons in F1 which are turned
   on are now the intersection of those in the input pattern and those
   with top-down weights from the active neuron in F2

This is like /leader clustering/ where we test for /closest/ and
/close-enough/.  The /closest/ part of this is controlled by the
winner-take-all formation in F2, the Vigilance neuron /e/ is
responsible for the /close-enough/ test.

We then learn the weights between F1 and F2 with Hebbian learning.
All weights are held between 1 and 0, as the activation /resonates/
then those neurons in F1 which /resonate/ with the active neuron in F2
will have their weights to and from F2 tend to 1 and all others have
their weights tend to 0.  This is similar to /template matching/ in
the pattern matching literature.

** 2010-12-02 Thu
*** Adaptive Resonance Theory ART (continued)

:            k=1             nf2
:      +---------------------------+
:    F2|     ------------------    | \   
:      |                           |  |
:      |        T                  |  |
:      |         k                 |  |
:      |                           |  |
:    F1|  -----------------------  |  |
:      |                           |\ /
:      |                           | e
:    F0|  -----------------------  |/
:      |                           |
:      +---------------------------+

- closest :: How do we compute the closest neuron in F2?
  The activation normalized by the number of up weights (size of the
    k = argmaxk=1,nf2\left\{\frac{||I \wedge Tk||}{\beta + ||Tk||}\right\}
  \beta has the effect of breaking ties in favor of the larger template, \beta
  is typically set to $\frac{1}{nf0}$.

- close enough :: How do we decide if this is close enough
    \frac{||I \wedge Tk||}{||I||} \ge \rho

  If the above is not true, then we reset and add a new neuron to F2

  If it is true then we /learn/ and update Tk using the /gated
  Hebbian/ we discussed last time

\uparrow the above is *the entirety* of the ART algorithm \uparrow

some properties:
- this can all be described by a set of coupled differential equations
- the order of presentation of training points affects the learned structure

*** limits on the maximum required number of epics
we can prove this using /compliment coding/ -- meaning \forall inputs I we
concatenate I and its compliment IC (I,IC).  The size of this
concatenation will always be equal to the number of bits in I.

This will converge (number of nodes and values of weights) in 1 epic.

*** how to use real-valued inputs
thermometer code
- normalize your inputs to between zero and 1
- then multiply each normalize value by nf0 and represent it by that
  many consecutive 1's and pad the rest with zeros

this real value case with unary encoding can be /very/ easily
visualized geometrically where every neuron in F2 becomes a box in the
number of dimensions as there are real values in the input

*** fuzzy-ART
- we replace \wedge with a /fuzzy-\wedge/ which is just min, it returns the min
  of its inputs (which don't have to be 0 or 1 but can be any real
  value between 0 and 1).
- similarly compliment becomes 1- so the compliment of (0.1,0.3)
  becomes (0.9,0.7)

*** ART-2
Not very popular, is the case where the F2 row is not winner-take-all
but rather can have multiple neurons turn on.

*** ART-Map
for supervised learning

:                    x
:                   / \
:                  /   \
:                 x     x Association Matrix
:                  \   /
:                   \ /
:                    X
:     +----------+      +----------+
:     |          |      |          |
:     |    A     |      |    B     |
:     |          |      |          |
:     |          |      |          |
:     +----------+      +----------+
:         x                  d

/x/ and /d/ are the inputs and the desired values, these train A and B
which are both ART-1 architectures, and their outputs go into an
association matrix in which the related outputs are associated.

* reading
** 1st Chapter
*** Benefits of Neural Networks
- nonlinearity :: in the weighting between neurons, each neuron is
     non-linear as are their sum
- input-output mapping :: ideal for supervised learning
- adaptable :: easily re-weighted to adapt to a changing environment,
     however shouldn't change to fast to fleeting disturbance this is
     called the /stability-plasticity dilemma/
- evidential response :: can return /confidence/ along with it's
- contextual information :: each neuron can potentially affect each
     other neuron allowing for natural spread of contextual information
- fault tolerant :: naturally distributed so if any single neuron is
     damaged the global behavior is not impaired but not drastically
     affected /robust/, /graceful degradation/
- VLSI :: massively parallel, take advantage of parallel hardware
- uniform :: regardless of problem domain the same structure
     (interconnected neurons) is used, allows for modularity and
- biological analogy :: as an analog of a biological system, many good
     ideas can be borrowed from nature

*** Types of activation functions
    :CUSTOM_ID: reading-activation-functions
(see notes-activation-functions)

- Threshold ::
    \phi(v) = \left\{
         1 &if& v \geq 0\\
         0 &if& v < 0
- Piecewise Linear ::
    \phi(v) = \left\{
         1 &if& v \geq +\frac{1}{2}\\
         v &if& +\frac{1}{2} > v > -\frac{1}{2}\\
         0 &if& v \leq -\frac{1}{2}
- Sigmoid ::
    \phi(v) = \frac{1}{1+exp(-av)}
- Stochastic ::
    \phi(v) = \left\{
         +1 &\text{with probability}& P(v)\\
         -1 &\text{with probability}& 1-P(v)
    \phi(v) = \frac{1}{1+exp(-v/\tau)}
  where \tau is a /pseudo temperature/ used to control the amount of
  noise in the system

*** Network Architectures
- single-layer feed-forward :: single layer of input nodes which
     connect to a single layer of output nodes, acyclic
- multi-layer feed-forward :: like the above but with /hidden layers/
     between the input and output layer, these can be particularly
     useful when the size of the input layer is large.  These are
     /fully connected/ when every node on a layer is connected to
     every node on the adjacent layers
- recurrent networks :: neural networks which has as least one
     feedback loop, which will typically involve /unit-delay
     elements/.  feedback loops have significant effect on the
     behavior of the network

*** Knowledge Representation
It is difficult to talk about /knowledge/ being represented inside of
a neural network, as any such knowledge will be stored implicitly in
the structure of the network.  As such any /knowledge/ which a neural
network may contain about it's /world/ (inputs and outputs) is
directed toward acting in the world.

Knowledge Rules
1) similar inputs from similar outputs should result in similar
   (e.g. by Euclidean distance of a vector of neuron activation)
   internal states and should thus be classified similarly
2) opposite of (1) items in different categories should be given
   different internal representations
3) important features should be allotted a large number of neurons in
   the network
4) prior information and invariance should be built into the design of
   the network
   - biologically plausible
   - smaller size
     - limits the search space of the network
     - faster information transition through the network
     - cheaper to build becu

*** How to Build Prior Information and Invariance into a Network
no general rules.

two ad-hoc rules for prior information
1) Restricting network architecture through use of local connections
   known as /receptive fields/
2) Constraining the choice of weights through /weight-sharing/

invariance, e.g. same object but viewed form different angles, same
voice but spoken loudly or softly

methods of entraining invariance into a network
1) invariance by structure
2) invariance by training
3) invariant feature space, assuming \exists features which are constant
   across all variants of the same input

** 2nd Chapter
1) neural network is /stimulated/
2) it /changes/
3) it responds /differently/ because of the change

*** learning paradigms
**** Error-correction Learning
- directed to a particular neuron
- the output of that neuron is compared to the desired output
- the difference /error/ is used to re-weight the inputs to the neuron
- this process continues until the neural net hits some steady state

(this is implemented with /back-propagation/)

**** Memory-based Learning
- each input is associated with an output
- when a new unknown input is seen it is classified as the same value
  as the nearest (Euclidean) known input

**** Hebbian Learning
associative learning
1) if the neurons on either side of an axon are activated
   simultaneously then the strength of the axon is increased
2) if the neurons are activated asynchronously then the synapse is
   weakened or eliminated

this sort of learning is
- time-dependent
- local
- interactive (depends on both side of the synapse)
- conjunctional or correlational

strong physiological evidence that associative /Hebbian/ learning
takes place in the brain

**** Competitive Learning
- each input neuron is attached to every output neuron
- random weights on all of these attachments
- for each set of inputs the output neuron with the highest activation
  is the /winner/, and all of it's active input connections are
- this process continues until the weights stabilize

with k output neurons this performs similar to k-means clustering, in
which each output neuron is finally associated with a cluster of
similar inputs

**** Boltzman Learning
the neural network or /Boltzman machine/ is characterized by an energy
E = -\frac{1}{2}\Sigmaj\Sigmak \neq jwkjxkxj

the machine operates by selecting a neuron at random during the
learning process and flipping it's output with probability
P(xk \rightarrow -xk) = \frac{1}{1 + exp(-\Delta Ek/\tau)}

two running conditions
- /clamped condition/ in which the visible neurons (attached to the
  environment) can't be changed
- /free-running condition/ in which all neurons can be flipped

- $pkj^{+}$ denote the correlation between neurons j and k when in
  clamped condition
- $pkj^{-}$ denote the correlation j and k when free-running condition
then the change in weight is
  \Delta wjk = \eta(pkj^{+} - pkj^{-}), j \neq k

*** credit assignment problem
how to assign credit and blame to inner portions of a neural network,
this is sometimes complicated through temporal delay, requiring
assignment of credit to past actions/events

*** Learning without a Teacher
these first two require a critic
#+begin_src dot :file data/critic.png :results silent :exports none
  digraph fsm {
    environment -> critic [label="primary reinforcement"];
    environment -> critic [label="state"];
    environment -> "neural network" [label="state"];
    critic -> "neural network" [label="heuristic reinforcement"];
    "neural network" -> environment [label="actions"];
#+Caption: Reinforcement Learning with a critic

- in /reinforcement learning/ the system matches input to output to
  maximize some scalar performance measure.
- in /delayed reinforcement/ learning the system attempts to minimize
  a /cost-to-go/ function, which is the cost of a sequence of actions
  taken over some sequence of inputs.  learn from the results of
  actions.  related to /dynamic programming/

#+begin_src dot :file data/wo-critic.png :results silent :exports none
  digraph fsm {
    environment -> "neural network";
#+Caption: Unsupervised learning w/o a critic

an example of unsupervised learning: a two-layer neural network in
which the first layer is the /input/ layer and the second the
/competitive/ layer, s.t. the neurons in the competitive layer compete
with each other to respond to an input.

*** learning tasks
**** pattern association
- associative memory :: distributed memory which learns by /association/
- auto-association :: (unsupervised) a set of patterns are repeatedly
     presented to the network, and it tries to /store/ them s.t. when
     presented with a noisy version of a pattern it returns the
     original pattern
- hetero-association :: (supervised) arbitrary set of input patterns
     are paired with another arbitrary set of output patterns

**** pattern recognition
assignments of /inputs/ to /classes/

#+begin_src dot :file data/pattern-association-design.png :results silent :exports none
  digraph fsm {
    "input pattern" -> "unsupervised network for\lfeature extraction";
    "unsupervised network for\lfeature extraction" -> "supervised network for\lclassification" [label="feature vector"];
#+Caption: classical design for pattern association network

**** function approximation
Attempt to match some function $d = f(x)$ where x is the input vector
and d is the desired output.  Supervised learning can be used to train
the network to match $f$.

* questions
- in implementations, is a neuron normally allowed to reference
- what state is normally contained in a neuron?