#+TITLE: Basic Graduate Evolution
#+EXPORT_EXCLUDE_TAGS: hide
#+OPTIONS: toc:2 num:nil ^:nil

* Notes
** Review genetic basics <2013-01-15 Tue>
*** DNA and RNA
nucleotide \simeq base

Bases
- purines (A T), 3 H bonds, less stable (fewer bonds)
- pyrimidine (GC), 3 H bonds

| RNA | normally single stranded |
| DNA | double stranded          |

Each strand has a 5' end and a 3' end.  The convention is to write
from the 5' and to the 3' end when transcribing DNA.  The strands are
anti-parallel.

RNA (as compared to DNA)
- single stranded
- much shorter (often 20-24 nucleotides long)
- Bases: (A C G U)

*** Genes
- traditionally, something that codes for a protein, i.e., gene \rightarrow
  messenger-RNA \rightarrow protein (this was previously the only functionality
  known for DNA segments)
- now, genes also include sequences which do other things, e.g.,
  transcribe RNA, some RNA sequences are functional as themselves

Types of Genes
- protein coding
- RNA
- regularity

Eukaryotic protein coding genes comprise both transcribed () and
non-transcribed () parts.

Gene's are transcribed by polymerase, which first anchors to a
"promoter region" which flanks the gene, and then travels the gene
transcribing its contents.

A "codon" is 3 nucleatides which are transcribed into 1 amino-acid.

Gene Structure (we're talking about protein coding gene)
- TATA box :: in the promoter region -19 from the gene start, in a
     GC-rich region of the DNA.  Not required, but prevalent

A Gene with the promoter region.
:     promoter region       Gene->
:   ---->  --  <---  --     ----------   --------   ---------
:   GC    CAAT   GC  TATA    
:   box    box  box  box   

RNA polymerase
| I   | RNA            |
| II  | protein coding |
| III | small RNA      |

**** Genes specifying proteins

Transcription and Translation process
1. DNA -> pre-RNA (which *does* include non-coding introns)
2. pre-RNA -> mature mRNA, this is done by the spicing machinery
   (these arose early in evolution and are very intricate machines).
   Genes are spliced using the "GT-AT rule" (introns often start with
   GT and end with AG)
3. Mature mRNA -> the remainder of the gene is split up into codons
   and transcribed, it must start with one of three start codons (UGA
   UAA UAG).  The start code does not code for an amino acid.  This is
   still capped by small untranslated sections on either side.
4. Protein -> only the translated regions converted to amino acids

There are many other regulatory sequences in introns or between genes
which do things like increase or decrease the speed of translation.

**** RNA-specifying genes (transcribed to RNA, not translated to protein)
- transfer RNA
- ribosomal RNA
- similar sequences between (eukaryotes and prokaryotes) which means
  they arose early and are important

Degenerate genetic code
- 20 Amino Acids
- 64 codons (4 \times 4 \times 4)

**** Regulatory Genes
- regulate the expression of another gene
- not transcribed or translated

enhancers or repressors (change tempo of translation)

notable examples
- replicator genes :: initialize and terminate replication
- telomeres :: "cap" at the end of a chromosomes, these erode with
     age, these don't erode as quickly in sea turtles
- segregator genes :: help split the DNA pair (sisters) for
     translation, this is where the zippers attach to unzip
- recombination genes :: sites for recombination during meiosis
     (crossover more likely to happen here)

*** Amino Acids
Composed of
- a central carbon
- an amino end
- a carboxyl end
- a hydrogen
- a side chain -- this is the most variable and the most important

Simplest is glycine
- single H side chain
- fits in nooks and crannies of proteins

Five classes
- positively charged
- negatively charged
- hydrophobic
- neutral
- special

*** Protein
string of amino acid

- secondary structure -- two most common 2-D structures for folded proteins
  - \alpha helix
  - \beta pleated sheet
- tertiary structure -- these combine into 3-D structures of the protein
- quartinary structure -- two tertiary molecules
** More genetic basics <2013-01-17 Thu>
*** phase
Losing the /phase/ (correct codon alignment) can be caused by
mutations and garbles the translation resulting in a garbled protein.
The phase is determined by the start or initiation codon, and is
called the "reading frame" or "open reading frame" (ORF).

*** degeneracy
Genetic code is degenerate but not ambiguous.

file:data/codon_table.jpg

Multiple codons coding for an amino acid will often differ in the
third position.

Terminology
- fourfold-degenerate site :: any nucleotide in this position will
     specify the same amino acid.
- twofold-degenerate site :: two of the four nucleotides at this
     position will specify the same amino acid.
- non-degenerate site :: every nucleotide here results in a different
     amino acid, called an "amino-acid substitution".
- synonymous codons :: different codons which code for the same amino
     acid.

In most codons the first two positions are /non-degenerate sites/.

numbers of codons
- 61 sense codons (remaining 3 are STOP codons)
- 549 possible codon nucleotide substitutions
  - 61 codons \times 3 positions \times (4-1) alternatives for each position
- assuming equally probable mutations (not true), then 70% of all 3rd
  position changes are synonymous
- 2nd site is the most sensitive to substitution, 0% of changes are
  synonymous
- in the 1st site only 4% of changes are synonymous

Main evolutionary forces (on populations)
- mutation
- selection
- migration (new alleles from outside)
- drift

*** mutations
Mutations are hereditary changes in the genetic material due to
errors in DNA replication or DNA repair.

Types of mutations
- substitutions
- recombination (crossing-over and gene conversion)
- deletions/insertions ("indels")
- duplication
- inversions

Evolutionarily mutations that occur in germ cells are the only
hereditary ones.

**** classes of mutations
***** substitutions
also called "point mutations"

classifications
1. transitions (purine to purine, pyrimidine to pyrimidine), there
   are four possibilities for this
   - a \leftrightarrow g
   - t \leftrightarrow c
2. transversions change the type, there are eight possibilities for
   this

If these were equally likely you would expect twice as many
transversions as compared to transitions.

You actually see many more transitions than translations.

Also classified by effect (only applies to protein coding genes)
1. silent or synonymous (no amino acid change)
2. replacement or non-synonymous (cause aa change)
   - missense (now codes a new aa)
   - nonsense (changes to a termination codon, will prematurely
     terminate translation)

***** crossing-over and gene conversion
- homologous recombination :: occurs between strands which are similar
     through a shared common ancestry

This happens most often during repair.  Say a chromosome breaks (very
common) there is a machine which comes along and repairs the broken
chromosome (sometimes this causes a change).

This also happens during meiosis.

Two types of homologous recombination
1. crossing over (reciprocal recombination), two chromosomes pair up
   and each gets a bit of the other
2. gene-conversion (nonreciprocal recombination), a bit of one goes
   to another, but one remains unchanged

***** insertions and deletions
- unequal crossing over :: May be caused by unequal crossing over,
     this is often caused by similar sub-sequences, this could cause
     an insertion and a deletion.  Generally 10-13 nucleotides long.

     :  -----=====-----                -------------
     :                          x
     :    ---------======-----     -----====---====------

- replication slippage :: (look up in a text book), when a repeating
     pattern is offset.  These can be thousands of base pairs long

- retro-transposition :: Selfish elements which are prone to copy
     themselves anywhere throughout the genome (discovered by Barbara
     McKlintoc in corn).  Sometimes these elements will grab a nearby
     section of the sequence and bring them along.

***** inversion
A rotation of a double-stranded segment by 180.

:  |     |
: a b c d e f g h
:    to
: a d c b e f g h

These often occur between genes where they don't have much effect.

**** spatial distribution of mutation
Often 100-fold differences in mutation rates between elements of the
sequence.  These are often very repetitive sequences of the genome.

Some groups of nucleotides are prone to change.  E.g., =CG= is easily
methelated causing the C to become a T.  =TT= is also a hot spot.

palindromes

epigenetics (non-genetic), e.g., chemical elements changing genetic
interpretations.
- chemicals inhibiting expression
- proteins binding up DNA into little balls, by histomes, this
  inhibits transcription

**** substitution in non-coding sequences and pseudo-genes
important to determine the pattern of spontaneous mutation

no selective pressure

mutation accumulation experiments, you maintain the lowest possible
population size

you could bottleneck a population, no selection, only genetic drift

Pseudo-genes are dead (premature STOP or something), these can also
indicate baseline rates.
- can compare to an active duplicate
- can compare to a homolog (inactive in people, compared to active in
  chimps)

Trend from GC to AT, so non-coding regions become =AT=-rich.

** Dynamics of Genes in Populations <2013-01-22 Tue>
- Q :: Why does population size matter?
- A :: Selection does not work in small populations.

*** Evolution as a population-level process
Big "macro-level" changes (e.g., between species) are results of the
same small "micro-level" changes between individuals.

Four major evolutionary forces
- mutation
- random genetic drift -- ∃! animal, the Atlantic eel, which has an
  effectively infinite population size, because they all come together
  to one place annually to mate.
- natural selection
- gene flow

The study of gene changes in populations is /population genetics/.
- what influences mutant allele over time
- how is genetic variability maintained
- probability of going to fixation
- how fast will replacement take place
- influence of chance effects on molecular genetic change

*** Definitions
(see #terms)

locus, allele, allele frequencies, genotype, phenotype, discrete
trait, continuous or quantitative trait, homozygous vs. heterozygous,
genotypic vs. phenotypic ratios, dominant vs. recessive allele,
evolution, natural selection, fitness (w)

*** Punnett Square
- BB is homozygous
- Bb is heterozygous

Mendel bred purple \times purple plants and got both purple and white
plants. Specifically 705 purple and 224 white.

Diploid parent will produce single-ploid gametes (else there would be
a combinatorial explosion in the ploidy of the offspring).

A Punnett Square
|        |   | pollen      |             |
|        |   | B           | b           |
|--------+---+-------------+-------------|
| pistil | B | BB (purple) | Bb (purple) |
|        | b | Bb (purple) | bb (purple) |

- phenotypic ratio of above is
  - 3 purple
  - 1 white
- genotypic ratio of above is
  - 1 BB
  - 2 Bb
  - 1 bb

*** Changes in allele frequencies
Problem
- 1000 peppered moths in Manchester
- dark melanic form of allele is dominant (M)
- ancestral is recessive (m)
- 825 melanic
- 175 peppered
- 512 of melanic are heterozygous

Some calculations
- Phenotypic ratio is
  875/175 or 4.7 melanic to peppered

- Genotypic ratio is
  313 MM, 512 Mm, 175 mm or
  1.78 : 2.93 : 1

- Allele frequencies of M and m
  616 + 512 / 2000 = 0.569
  (2 * 175) + 512 / 2000 = 0.431

Allele frequencies are changed by
- selection
- drift
- migration

2 Mathematical approaches
- deterministic :: (analytic) can predict changes unambiguously.  The
     first of these was "Harvey Weinberg".
- stochastic :: probabilistic, associates probability distributions
     with environmental conditions

Deterministic assumptions
- infinite population size
- constant environment

Needed for Darwinian selection (influenced by Menthusian principles)
- variation
- environmental limit to population size (carry capacity)
- differential reproduction (because of the above)

*** Types of mutation
:                 +------------ mutation ------------------+
:                 |                  |                     |
:                 |                  |                     |
:                 |                  |                     |
:            deleterious          neutral            advantageous
:                 |                  |                     |
:                 |                  |                     |
:                 |                  |                     |
:                 |                  |                     |
:            purifying            chance             positive selection
:            selection            events                   or
:                                                     overdominant     
:                                                      selection

Normally selection /reduces/ genetic variation, however "overdominant"
selection can increase genetic variation.  This is when the
heterozygote has the highest fitness.

*** Hardy-Weinberg principle
1 locus, 2 alleles (A1 A2)
- 3 possibly diploid genotypes
- allelic frequencies are
  - f(A1) = p
  - f(A2) = q
  - p + q = 1

| genotype | A1A1 | A1A2 | A2A2 |
|----------+------------+------------+------------|
|          | p2      | 2pq        | q2      |

genotypic frequencies
  - f(A1) = p2
  - f(A2) = 2pq
  - p + q = q2

This is *the* null model.

Back to our problem.
- melanic is dominant and is 87% (could be heterozygotes)
- \rightarrow 13% is non-melanic
- \rightarrow q = 0.13
- \rightarrow q = \sqrt{0.13} = 0.36
- \rightarrow p = 1 - q = 0.64
- \rightarrow f(Mm) = 2pq = 2 \times 0.36 \times 0.64

Graph (frequency of a, by frequency of genotype in population)

#+begin_src gnuplot :exports both :file data/hardy-weinberg.png
  set xrange [0:1]
  set xlabel 'frequency of A'
  plot x * x title 'AA', (1-x) * (1-x) title 'aa', 2 * x * (1-x) title 'Aa'
#+end_src

#+RESULTS:
file:data/hardy-weinberg.png

*** Natural Selection changes allelic frequencies
| genotype  | A1 A1     | A1 A2  | A2 A2       |
|-----------+-----------+--------+-------------|
| fitness   | w11       | w12    | w22         |
|-----------+-----------+--------+-------------|
| frequency |           |        |             |
| after     | p * p w11 | 2pqw12 | q * q * w22 |
| selection |           |        |             |

\begin{equation*}
  q' = \frac{pqw12 + q2w22}{p2w11 + 2pqw12 + q2w22}
\end{equation*}

change in frequency
\begin{equation*}
  \delta q = q' - q
\end{equation*}

\begin{equation*}
  \delta q = \frac{pq(p(w12 - w11) + q(w22 - w12))}{p2w11 + 2pqw12 + q2w22}
\end{equation*}

Example
- heterozygous individuals have lighter eye spots (increased predation)
- relative fitness of genotypes
  | SS |   1 |
  | Ss | 0.9 |
  | ss | 0.6 |
- p(S) = 0.7

p = 0.7
q = 0.3

Frequencies in the original
| p  | 0.49 |
| pq | 0.09 |
| q  | 0.42 |

Relative fitness
| p  |   1 |
| pq | 0.9 |
| q  | 0.6 |

Next generation frequencies
- f(SS)' = p2w11 = 0.49 \times 1
- f(Ss)' = 2pqw12 = 0.42 \times 0.9
- f(ss)' = q2w22 = 0.09 \times 0.6

Next generation's allelic frequencies
- f(SS) \times 2 + f(Ss)
- f(ss) \times 2 + f(Ss)

** Dynamics of Genes in Populations <2013-01-24 Thu>
Selection is limited in that it can't reduce global population
fitness, so /drift/ is essential to navigate landscapes with valleys.

*** changing allele frequencies with overdominance
- Overdominance is also called heterozygote superiority.
- When the heterozygote has a higher fitness than either homozygote.
- this is one of the few times (along with frequency dependent
  selection) in which selection *increases* genetic diversity
- called "balancing" or "stabilizing" selection

The equilibrium frequency ($\hat{p}$) is given by
\begin{equation*}
  \hat{p} = \frac{w11 - w22}{2w12 - w11 - w22}
\end{equation*}

When w11=0.9, w12=1, and w22=0.8 then $\hat{p} = 0.667$.

$\bar{w}$ is the /average/ fitness of the entire population.

*** Examples of Overdominant Selection -- Sickle-cell Anemia
"find them and grind them" \leftarrow experimental identification of
overdominant selection.  Cavalli-Sforza is prolific in this area.

| type      | codon | amino-acid | cell shape |
|-----------+-------+------------+------------|
| wild type | =GAG= | glu        | doughnut   |
| mutant    | =GTG= | val        | cycle      |

The wild type allele is more dominance, but it is not a perfect
dominance relation.

| alleles | anemia | malaria    | fitness |
|---------+--------+------------+---------|
| SS      | normal | vulnerable |     0.9 |
| Ss      | slight | resistant  |     1.0 |
| ss      | severe |            |     0.2 |

*** Underdominance
This is an /unstable/ equilibrium, any deviation from equilibrium will
fall away.

This is another instance of a valley which selection can not traverse.

*** Drift
- Changes in allele frequency due solely to chance effects.
- Moral is not to assume that every trait is adaptive.
- Especially important in our currently world of many species with
  severely reduced population sizes.  Note: this will reduce the
  likelihood that these populations will be able to adapt to climate
  change.

Stochastic events from ecology have huge effects on small populations.
- alley effect
- catastrophe

file:data/drift-in-pops.png

#+begin_src lisp
  ;; the above generated with the following
  (loop :for pop-size :in '(100000 1000000) :do
     (with-open-file (out (format nil "/tmp/pop~d.data" pop-size) :direction :output)
       (gen-drift out :pop-size pop-size)))
#+end_src

*** Wright-Fisher model of random genetic drift
- depiction of the sampling process in populations of finite size
- the distribution of frequencies of gametes is expected to follow a
  binomial distribution

Process
1. N individuals in P0produce \infin gametes
2. 2N gametes are selected from the pool of \infin gametes
3. N individuals in P1

Consider a diploid population of N individuals w/2N genes

when 2N gametes are sampled from the \infin gamete pool the probability
Pi of i genes of type A is given by

\begin{equation*}
  Pi = \frac{(2N)!}{i! (2N-i)!} piq2N-i
\end{equation*}

Make some graphs of a population of a given size with some number of
genes, should the frequency of each gene over a number of generations
to show the increased effect of chance in smaller populations.

*** Pea pod
A pot of 100 seeds, 50 round and 50 wrinkled.

Enumerate all possible samples and the related probability of such a
sample.

- Probability of four round seeds = $\frac{4!}{4! 0!} 0.54 0.50 = 2-4$
- Probability of three round seeds = $\frac{4!}{3! 1!} 0.53 0.51 = 4 \times 2-3 \times 2-1$
- Probability of two of each = $\frac{4!}{2! 2!} 0.52 0.52 = 4 \times 2-2 \times 2-2$

** Population Size and Neutral Theory <2013-01-29 Tue>
*** Wright-Fisher model of random genetic drift
N individuals \rightarrow \infty gametes \rightarrow N individuals \rightarrow \infty gametes

\begin{equation*}
  Pi = \frac{(2N)!}{i! (2N-i)!} piq2N-i
\end{equation*}

For an idealized population and makes the following assumptions
- all individuals contribute gametes equally to the next generation
- population size is constant
- non-overlapping generations

*** effective population size
Ne: The size of an idealized population which would have the same
effect of random sampling on gene frequency as that in the actual
population.

N: The observed actual population size (census size).

Generally Ne << N, or roughly $\frac{N}{3}$, because of
- Pre- and post-reproductive individuals should not be counted.
- difference in ratio of males to females, in which case
  \begin{eqnarray*}
    Ne &=& \frac{4NmNf}{Nm+Nf}, where\\
    Nm &=& \text{num males}\\
    Nf &=& \text{num females}
  \end{eqnarray*}

Also due to long-term variations in pop-size
- environmental catastrophes
- cyclical model of reproduction
- local extinction and re-colonization events

The long term size is the harmonic mean of population sizes (where n
is the number of generations).  Thus dips in population size affect
the long term size *more* than temporary peaks.
\begin{equation*}
  Ne = \frac{n}{(\frac{1}{N1} + \frac{1}{N2} + \ldots + \frac{1}{Nn})}
\end{equation*}

Long bottlenecks have *much* more of an impact on maintained genetic
diversity than short bottlenecks.  It is impressive how much variation
may be maintained even through very narrow short bottlenecks.

*** gene substitution and related topics (definitions)
- gene substitution :: the process whereby a mutant allele completely
     replace the predominant or wild-type allele in the population
- fixation time :: time (usually measured in # generations) it takes
     for a mutant allele to become fixed in a population
- fixation probability :: the chance that a new mutant allele will
     reach fixation in a population
- rate of gene substitution :: number of substitutions or fixations
     per unit time

*** fixation probability
- fixation probability determined by
  1. initial frequency (often 1/2N)
  2. selective advantage
  3. the effective population size Ne \leftarrow very important

Fitness for an allele is equal to a combination of S and Ne
- Selection coefficient (S) or (w - S)
  | S=0 | neutral     |
  | S>0 | beneficial  |
  | S<0 | deleterious |
- in small populations basically everything is neutral

P probability of fixating
- neutral :: p = \frac{1}{2N}
- small selection coefficients :: P = \frac{2s}{(1 - e-4Ns)}
- positive values of s and large N :: P = 2s

**** example of fixation probability
- Ne = 1000 = N
- a new mutant arises

so initial frequency = 1/2000

1. let this new mutant be neutral, so s = 0

   then probability of fixation P = 1/2000.

2. with a slight selective advantage, e.g., s=0.01

   then P = \frac{2s}{(1 - e-4Ns)} \simeq 2s = 0.02

3. with a slight selective disadvantage, e.g., s=-0.001

   then P = \frac{2s}{(1 - e-4Ns)} \simeq 2s = 3.7314723e-5

#+begin_src lisp
  (defun adv (s N) (/ (* 2 s) (- 1 (exp (* -4 N s)))))
#+end_src

*** fixation time
the time required for fixation or loss of a neutral allele depends upon
- initial frequency
- population size N

times shorten as frequency approaches 1 or 0

For a new mutation (Kimura and Ohta 1969) with initial frequency
1/2N, the mean time to fixation is \approx
- neutral allele :: $\bar{t} = 4N generations$
- allele with a selective advantage of S :: $\bar{t} = \frac{2}{s}\ln{\left(\frac{2}{N}\right)}$

**** example (lets say a mouse)
- Ne = 106
- generation time = 2 years

For a neutral mutant allele in will take 4N = 4(106) = 8 mil. years.

For a slightly selective allele (s=0.01)

\frac{2}{0.01} \times \ln{(2/N)} = 5,800 years

*** gene substitutions (neutral mutations)
Rate of gene substitutions (reaching fixations) per unit time.

neutral mutations
- neutral mutation rate = \mu per gene per generation
- number of mutations arising at a locus in a diploid pop of size N =
  2N\mu per generation
- but the probability of fixation of each mutation is P = 1/2N

Therefore the substitution rate is

K = (number of mutations)(probability of fixation)

or

\begin{equation*}
  K = 2 N \mu P = (2N\mu)(1/2N) = \mu
\end{equation*}

substitution rate = mutation rate (thanks Kimura)

*** gene substitutions (advantageous mutations)
- advantageous mutation rate \mu per gene per generation
- mutations per locus is 2N\mu per generation
- probability of fixation is P=2s

therefore

K = (num of mutations) (probability of fixation) (selection coefficient)

\begin{equation*}
  K = 2N\mu P = (2N\mu)(2s) = 4Ns\mu
\end{equation*}

*** getting started with neutral theory
Rates and patterns of nucleotide substitutions, and molecular clock.

Darwin didn't know the /mechanisms/ of heredity.

Mendel was treated as a crackpot because most people studied
bio-metric continuous quantitative traits with normal distributions.
So Mendel went back to gardening.

New-Darwinian theory Darwin and Mendel combined.
- mutation ultimate source of genetic variation
- natural selection given the dominant "creative" role in shaping
  genetic make-up of populations

Selectionism
- natural selection *only* evolutionary force
- polymorphism must be maintained by balancing selection
- gene substitution must be due to selection of advantageous mutations

* Terms
  :PROPERTIES:
  :CUSTOM_ID: terms
  :END:
- epigenetic :: Heritable changes in gene expression caused by
     mechanisms other than changes in the underlying DNA sequence.
- co-dominance :: when heterogeneous alleles express both component alleles
- overdominance :: when the heterozygote is more fit than either
     homozygote, this is also called "heterozygote superiority"
- balancing selection :: selection in overdominance
- stabalizing selection :: selection in overdominance
- underdominance :: (or heterozygote inferiority) when the
     heterozygote has the lowest fitness
- drift :: stochastic evolutionary force which overwhelms selection in
     small populations
- locus :: chomosomal location of a gene (often a synonym for gene)
- allele :: alternate forms of a gene
- allele frequencies :: relative proportions of alleles in a population
- genotype :: genetic constitution of an individual
- phenotype :: observable characteristic or trait of an organism
- discrete trait :: finite number of phenotypes in discrete classes,
     often controlled by one or a few genes
- continuous or quantitative trait :: what is sounds like, controlled
     by at least 100 genes, most traits fall into these categories
- homozygous vs. heterozygous :: whether alleles are of the same type
     or different
- genotypic vs. phenotypic ratios :: 
- dominant vs. recessive allele :: expressed
- evolution :: change in allele frequency
- natural selection :: differential reproduction of genetically
     distinct individuals or genotypes within a population
- fitness (w) :: measure of an individuals ability to reproduce
  - absolute fitness -- total progeny (might only count progeny which
    make it to reproductive age)
  - relative fitness -- progeny relative to rest of the population,
    the most fit genotype is assigned a fitness of 1