The following

content is provided under a Creative

Commons license. Your support will help MIT

OpenCourseWare continue to offer high quality

educational resources for free. To make a donation or

view additional materials from hundreds of MIT courses,

visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: Why don’t

we get started? So today we’re going to talk

about comparative genomics. And first, a brief review

of what we did last time. So last time we talked

about global alignment of protein sequences,

including the Needleman-Wunsch and Smith-Waterman algorithms. And we talked about gap

penalties a little bit and started to introduce the

PAM series of matrices which are well described in the text. So what I wanted to

do is just briefly go over what I started to

talk about at the end, about Markov models of evolution. Because they’re relevant,

not only for the PAM series, but also for some other

topics in the course. A short unit on

molecular evolution we’re going to do today. And then they also introduce

hidden Markov models that will come up

later in the course. So the example that we gave of

a Markov model was DNA sequence evolution in successive

generations where the observation here is that the

base at a particular position at generation n+1 here depends

on the base at that generation and the base at generation n. But conditional on knowing

the base at generation n, you don’t learn

anything from knowing what that base was

at generation n-1. That’s the essence of

the Markov properties. So here’s the formal

definition, as we saw before. Any questions on this? And I asked you to review

your conditional probability if it was rusty, because

that’s very relevant. OK so in this example you might,

if you had a random variable x that represented the genotype

at a particular locus, let’s say the

apolipoprotein locus, and it had alleles

A and a, then you might write something

like the probability that Bart’s genotype

is a homozygous given his grandfather’s genotype

and his dad’s genotype is equal to just the conditional

probability given his father’s genotype. So those are the sorts

of things that you can do with Markov chains So when you’re working

with Markov chains matrices are extremely useful. So another thing that will

be helpful in this part of the course and then again

in Professor Fraenkel’s part, where he’s

talking– he’ll use also some ideas from

linear algebra– is to review your basics

of matrices and vector multiplication. OK so, if you now make a

model of molecular evolution where sn is– so s

is this variable that represents a particular

base in the genome and is the generation. And then to describe

the evolution of this base over

time, we’re going to imagine that its evolution

is described by a Markov chain. And a Markov chain can be

described by, in this case, a 4 by 4 matrix, since there are

four possible nucleotides at generation i, for example,

and four possible at generation i plus one. And you simply need

to specify what the conditional probability

that the base will be, of any possible base, at

the next generation, given what it is at the current generation. So here’s the matrix up here. And it describes, for

example, the probability of going from a c to an a. So then in general

you might know that that base is a g

at the first generation. But in general you

won’t necessarily know what base it is if

you’re modeling events that may happen in the future. So the most general

way of describing what’s happening at

that base is a vector of probabilities of the

four possible bases– so qa, qc, qg, qt, with those

probabilities summing up to 1. And so then it turns out

that with this notation that the content of the

vector at generation n plus 1 is equal to simply the

vector at generation n multiplied on the

right by the matrix, just using the standard

vector matrix multiplication. So for example,

if we have vectors with four things in

them, and we have a 4 by 4 matrix, then to get

this term here in this vector you multiply– you basically

take the dot product of this vector times

this first column. The vector times

the first column will give you that entry. And this times this

column will give you that entry in the

vector, and so forth. And you can see that the

way this makes sense, the way the matrix is defined,

that first column tells you the probability that you’ll have

an a at the next generation, conditional on each

of the four bases at the previous generation. And so you just multiply by

the probabilities of those four bases times the appropriate

conditional probability here. And those are all

the ways that you can be an a

generation, n plus 1. And so it’s also true that if

you want to go further in time, so from generation

n to generation n plus k– k is some integer–

then this just corresponds to sequential multiplication

by the matrix k– I’m sorry, by the matrix p. So qn plus 1 equals q times p. And then qn plus 2 will

equal q– I’m sorry. That’s a really bad q,

but– qn plus 1 times p, which will equal q times p

squared, where p squared means matrix multiplication, again

using the standard rules of matrix multiplication

that you can look up. So one of the things you

might think about here is what happens

after a long time? If you start from some vector

q– for example, q is 0010. That is, it’s 100% chance of g. What would happen if

you run this matrix on that over a long

period of time. And we’ll come back to that

question a little bit later. So thinking about the

Dayhoff matrices– and again, I’m not going to go

into detail here, because it’s well

described in the text. Dayhoff looked at these highly

identical alignments, these 85% identical alignments, and

calculated the mutability of each residue and these

mutation probabilities for how often each residue

changes into each other one and then scaled them so that on

average the chance of mutating is 1% and then took

these probabilities, these frequencies,

of mutation m, a, b, divided by the frequency of

the residue b, took the log, and then just multiplied by

two just for scaling purposes, and came up with a– and then

rounded to the nearest integer, again for practical purposes. And that’s how she came

up with her PAM 1 matrix. And then you can use

matrix multiplication to derive all the

successive PAM series. Just multiply the PAM1 matrix

times itself to get the PAM2 and recalculate the scores. So if you actually use

PAM matrices in practice there are some issues. And these are also well

described in the text. And the fundamental

problem seems to be that the way the

proteins evolve over short periods of time and

the way they evolve over long periods of time

is somewhat different. And basically this model, this

Markov model of evolution, is not quite right, that things

don’t– what you see in a short periods of time– it does not

match long periods of time. And why is that? A number of possible reasons. But keep in mind that

in addition to proteins simply changing their

amino acid sequence, other things can

happen in evolution. You can have insertions

and deletions that are not captured

by this Markov model. And you can also have birth

and death of proteins. A protein can evolve according

to this model for millions of years. And then it can become unneeded,

and just be lost, for example. So real protein evolution

is more complicated. And so about 20 years ago

or so Henikoff and Henikoff decided to develop a

new type of matrix. And the way they did it was

to identify these things called blocks, which are regions

of reasonably high similarity, but not as high as

Dayhoff required. So there were many more–

Dayhoff was working the ’70s. They were working in the ’90s. So there were many more

proteins available. And they could identify,

with confidence, basically a much larger data

set, including more distantly related, but

still confidently alignable, protein sequences. And they derived new parameters. And in the end this matrix they

came up with called BLOSUM62 seems to work well in

a variety of contexts when comparing moderately

distantly related proteins or quite

distantly related proteins. If you’re comparing

very similar proteins it almost doesn’t matter. Any reasonable

matrix will probably give you the right answer. But when you’re comparing

the more distant ones, that’s where it

becomes challenging. And so this is the

BLOSUM62 matrix here. And you can see it’s similar

to the PAM matrices in that– I think we showed PAM 250

last time– in that you have a diagonal with all

positive numbers. And it’s also similar in

that, for example, trytophan down here has a higher

positive score than others. It’s plus 9. And cysteine is also

one of the higher ones. But those are less extreme. And basically, maybe over short

periods of evolutionary time, you don’t change your cysteine. But over longer periods

there is some rewiring of disulfide bonding, and

so cysteines can change. Something like that

may be going on. So we’ve just talked about

pairwise sequence alignments. But in practice you

often have, especially these days you often have,

many proteins though. So you want to align three or

five or 10 different proteins together to find

out which residues are most conserved, for example. And so basically

the principles are similar to pairwise alignment. But now you want

to find alignments that bring the greatest

number of single characters into register. So if you’re aligning

three proteins, you really want to have

columns where all three are the same residue, or

very similar residues. And you need to then

define scoring systems, define gap penalties,

and so forth. This is also reasonably

well described in the text. I just wanted to

make one comment about the sort of computational

complexity of multiple sequence alignment. So if you think about

pairwise sequence alignment, say with Needleman-Wunsch

or Smith-Waterman, with a sequence

of length– let’s say you’re aligning one

protein of sequence length n to another of life n, what is

the computational complexity of that calculation in using

this big O notation that we’ve talked about? Let’s just say

standard gap penalties, linear gap penalties. Anyone? Or does it matter? Yeah, go ahead. STUDENT: n squared. PROFESSOR: It’s n squared. So even though this has gaps,

with local– with ungapped it was also n

squared, or n times n, So why is it that gaps

don’t make it worse? Or do they? Any thoughts on that? STUDENT: You put a constant

number of gaps in the sequence. So it’s just stating the

essence of the complexity should still be n squared. PROFESSOR: You put a

constant number of gaps? The– I mean, yeah– let’s just

hear a few different comments. And then we’ll try to summarize. Go ahead. STUDENT: So we’re

still only filling out an n by n matrix

at any given time. PROFESSOR: You’re still filling

out an n by n matrix, right. There happen to be

a few more things. The recursion is slightly

more complicated. But there’s a few

more things you have to calculate

to fill in each. But it’s like three

things, or four things. It’s not– so it doesn’t

grow with the size. So it’s just still n squared,

but with a larger constant. OK, good. And then if you did

affine gap penalty, remember where you

had a gap opening penalty and a gap

extension, what then? Does that make it worse? Or is it still n squared? STUDENT: I think

it’s still n squared. PROFESSOR: Why is that? STUDENT: Computing the affine

gap penalty is no more than o of n, right? PROFESSOR: Yeah,

basically with the affine you have to keep track of

two things at each place. So yeah, it is. You’re right. It’s still n squared. It’s just you got to keep track

of two numbers in each place there. OK, good. And so what about when

we go to three proteins? So how would you

generalize, let’s say, the Needleman-Wunsch algorithm

to align three proteins? Any ideas? What structure would

you use, or what– analogous to a matrix–

yeah, in the back. STUDENT: Another way to do

this would be have a 3D matrix. PROFESSOR: OK, a 3D

matrix, like a cube. And can everyone visualize that? So yeah, basically

you could have a version of Needleman-Wunsch

that was on a cube. And it started in

the 0, 0, 0 corner and went down to the n, n,

n corner, filling in in 3D. OK so what kind of

computational complexity do you think that

algorithm would have? STUDENT: n cubed? PROFESSOR: n cubed. Yeah, makes sense. There would be a similar

number, a few operations to fill in each

element in the cube. And there’s n cubed. So the way that the problem

grows with n is as n cubed. And what about in general,

if you have k sequences? STUDENT: n to the k? PROFESSOR: n to the k. So is this practical? With three proteins and modern

computers you could do it. You could implement

Needleman-Wunsch on a cube. But what about with 20 proteins? Is that practical? So it’s really not. So if proteins are 500

residues long and there’s 500 to the 20th, right. It starts to explode. So that approach

really only works in two dimensions and a little

bit in three dimensions. And it becomes impractical. So you need to use a

variety of shortcuts. And so this is, again,

described pretty well in chapter six of the text. And a commonly used– if

you’re looking for a default multiple sequence aligner,

CLUSTALW is a common one. There’s a web

interface if you just need to do one or

two alignments. That works fine. You can also download a

version called CLUSTALX and run it locally. And it does a lot of things

with pairwise alignments and then combining the

pairwise alignments. It aligns the two

closest things first and then brings in the

next closest, and so forth. And it does a lot of

tricks that are– they’re basically heuristics. They’re things

that usually work, give you a reasonable

answer, but don’t necessarily guarantee that you will find the

optimal alignment if you were to do it on a 20 dimensional

cube, for example. So they work reasonably

well in practice. And then there’s a variety

of other algorithms. OK, good. So that’s a review of what

we’ve mostly been talking about. And now I want to introduce

a couple of new topics. So we’re going to briefly

talk a little bit more about Markov models

of sequence evolution. And these are closely related

to some classic evolutionary theory from

Jukes-Cantor and Kimura. So we’ll just

briefly mention that. And we’ll talk a little

bit about different types of selection that sequences can

undergo– so neutral, negative, and positive– and how

you might distinguish among those for protein

coding sequences. And this will basically

serve as an intro into the main topic today,

which is comparative genomics. And comparative genomics– it’s

not really a field, exactly. It’s more of an approach. But I wanted to give you

some actual concrete examples of computational biology

research, successful research that has led to various types of

insights into gene regulation, in this case,

mostly to emphasize that computational biology

is not just a bag of tools. We’ve mostly been

talking about tools. We introduced tools

for local alignment and multiple alignment and

statistics and so forth. But really it’s a

living, breathing field with active research. And even using–

comparative genomics is one of my favorite

areas within this field. Because it’s very powerful. And you can often use

very simple ideas. And simple algorithms

can sometimes give you a really interesting

biological result, if you have the right

sequences and ask the question the right way. So I have posted a dozen of my

favorite comparative genomics papers in a special

section on the website. Obviously I’m not asking

you to read all of these. But I’m going to give you a few

insights and approaches that were used in each of

these papers here, just to give you a flavor

of some of the things that you can do with

comparative genomics, in the hopes that this might

inspire some of your projects. So hopefully you’re

going to start thinking about finding teammates

and thinking about projects. And this will hopefully

help in that direction. Of course, they don’t have to be

comparative genomics projects. You could do anything

in computational biology or systems biology

in this class. But that’s just one area

to start thinking about. Yeah, I’ll also–

I’m sorry, I think I haven’t posted this yet. But I will also post

this review by Sabeti that has a good discussion of

positive selection a little bit later. Again, not required. All right, so let’s go

back to this question that I posed earlier. We have a Markov model of

DNA sequence evolution. And we– sn is the

base at generation n. And then what happens

after a long time? If you take any vector–

q, to start with, might be a known

base, for example– and apply that

matrix many times, what happens as n

goes to infinity. And so it turns out that

there’s fairly classical theory here that gives us an answer. This is not all the

theory that exists, but this describes

the typical case. So the theory says that if all

of the elements in the matrix are greater than 0,

and then of course all of the– pij’s, when you sum

over j, they have to equal 1. That’s just for it to be a

well-defined Markov chain. Because you’re

going from i to j. And so from any base

you have to go– the probability of going

to one of those four bases has to sum to 1. And so if those conditions

hold, then there is a unique vector r such

that r equals r times p. And the limit of

q times p to the n equals r, independent

of what q was. So basically, wherever

you were starting from– you could

have been starting from 100% g, or 50%

a, 50% g, or 100% c– you apply this

matrix many, many times, you will eventually

approach this vector r. And the theory doesn’t

say what r is, exactly. But it says that r

equals r times p. And that turns out to basically

implicitly define what r is. That is, you can solve

for r using that equation. And r, for this reason, because

the matrix doesn’t move r, r is called the

stationary distribution. And it’s often also called

the limiting distribution, for obvious reasons. And if you want to read more,

like where this theory comes from, here’s a

reasonable reference. So any questions

about this theory? All the elements

in the matrix have to be strictly greater

than 1– I’m sorry, strictly greater than 0. Otherwise, really no conditions. All right, question? Yeah, go ahead. STUDENT: Does the [INAUDIBLE]

distribution ever change, based on the sequence, or are

we assuming that it doesn’t? PROFESSOR: The theory

says it only depends on p. It doesn’t depend on q. So it depends on the model

of how the changes happen, the conditional probability

of what the base will be at the next

generation given what it is at the current generation. It doesn’t depend

where you start. q is what your

starting point is, what base you’re initially at. Does that make sense? And this is obviously

a very simplified case, where we’re just modeling

evolution of one base, and we’re not thinking

about whether the rates vary at different positions

or within– this is the simplest case. But it’s important to

understand the simplest case before you start

to generalize that. OK, so let’s do

some examples here. So here are some matrices. So it turns out the math is

a lot easier if you limit yourself to a two-letter

alphabet instead of four. So that’s what I’ve done here. So let’s look at these matrices

and think about what they mean. So we have two-letter alphabet. R is purine. Y is pyrimidine. These matrices describe

the conditional probability that, at the next generation,

you’ll be, for example– oops, here we go. That, for example, if

you start at purine, that you’ll remain purine

at the next generation. That would be 1 minus P. And

the probability that you’ll change to pyrimidine is P. And

the probability of pyrimidine will remain as a

pyrimidine is 1 minus P. So what is the stationary

distribution of this matrix? OK, so if p is small, this

describes a typical model, where most of the

time you remain– DNA replication and

repair is faithful. You maintain the same base. But occasionally a mutation

happens with probability p. Anyone want to guess what the

stationary distribution is or describe a strategy

for finding it? Like what do we know

about this distribution? Or imagine you

start with a purine and then you apply

this matrix many times to that vector that’s 1

comma 0, what will happen? Yeah, Levi. STUDENT: Probably 50-50 because

any other that way you skew it it would be pushed

towards the center because there’s more

[INAUDIBLE] the other. PROFESSOR: OK,

everyone get that? So Levi’s comment was

that it’s probably 50-50. Because mutation

probabilities are symmetrical. Purine-pyrimidine and

pyrimidine-purine are the same. So if you were to start

with say, lots of purine, then there will be more

mutation toward pyrimidine in a given generation. So if you think about this

is your population of R and that’s your population

of Y, then if this is bigger than that, you’ll tend

to push it more that way. And there will be less

mutation coming this way, until they’re equal. And then you’ll have equal

flux going both directions. So that’s a good way

to think about it. And that’s correct. Can you think of how

would you show that? What’s a way of solving for

the stationary distribution? Anyone? So remember, we’ll

just get back one. The theory says

that R equals RP. That’s the key. R equals RP. So what is R? Well we don’t know R. So we

let that be a general vector. So notice there’s only

one free parameter. Because the two components

have to sum to 1. It’s a frequency vector,

so x and 1 minus x. And we just multiply

this times the matrix. So you take x comma 1 minus x. And you multiply

it by this matrix. The matrix is 1 minus P P.

I’m using too much space here. I’ll just make it a little

smaller– P 1 minus P. And that’s going

to equal R. And so we’ll get x times 1

minus P plus– remember, it’s dot product of this

times this column, right? So x times 1 minus P

plus 1 minus x times P. That’s the first component. And the second

component will be xp plus 1 minus x times 1 minus p. OK, everyone got that? So now what do we do? STUDENT: r. PROFESSOR: What’s that? STUDENT: Make that

equal to the initial r. PROFESSOR: Yeah, make that

equal to the initial r. So it’s two equations

and– well, you really only need one equation here. Because we’ve already

simplified it. In general there will

be two equations. There will be one

equation that says that the components of

the vector sum to 1. And there will be another

equation coming from here. But we can just use

either one, either term. So we know that

the first component of a vector– if this vector

is equal to that vector, then the first components

have to be equal, right? So x equals x

times– times what? Times 1 minus p, just

combining these two. And then plus what are all the–

I’m sorry, that’s 1 minus p– 1 minus p here. And then there’s another

term here, minus another p. And then there’s a

term that’s just p. And so then what do you do? You just solve for x. And I think when

you work this out you’ll get two p x equals

p, so x equals 1/2. Right, everyone got that? OK, so yeah. So if x is 1/2, then the

vector is 1/2 comma 1/2, which is the unbiased. All right, what about

this next matrix, right below– 1

minus p 1 minus q. p and q are two positive

numbers that are different. So now there’s actually

a different probability of mutating purine to pyrimidine

and pyrimidine to purine. So Levi, can we

apply your approach to see what the answer is? STUDENT: Not exactly. PROFESSOR: Not exactly? OK, yeah, it’s not as obvious. It’s not symmetrical anymore. But can anyone guess

what the answer might be? Yeah, go ahead Diego. STUDENT: It’ll go either

all the way to one side or depending on q and d. PROFESSOR: All the way to

one side or all the way to the other? So meaning it’ll be all purine

or all pyrimidine again. STUDENT: Yeah,

depending on which– PROFESSOR: Which is bigger? OK, anyone else have

an alternative theory? Yeah, go ahead. What was your name again? STUDENT: Daniel. PROFESSOR: Sorry, Daniel? STUDENT: Daniel, yeah. PROFESSOR: Daniel. OK, go ahead. STUDENT: It’ll reach some

intermediate equilibrium once they balance

each other out. And that would be exactly–

I’m not sure– some ratio of q to p. PROFESSOR: OK. How many people think

that might happen? OK, some people. OK Daniel has maybe

slightly more supporters. So let’s see. So how are we going

to solve this? How do we figure out what the

stationary distribution is? You just use that same approach. So you can do– you

have x 1 minus x times that matrix, which is got

the 1 minus p p q 1 minus q. OK, and so now you’ll

get x 1 minus p. Anyway, go through

the same operations. Solve for x. And you will get– I think I put

the answer on the slide here. You will get q over p plus q. So as Danny predicted, some

ratio involving q’s and p’s. And does this make sense? Seeing what the

answer is, can you rationalize why that’s true? STUDENT: It’s like a

kind of equilibrium. You have one mode of

force play pushing one way and another

different one in this case pushing the other. PROFESSOR: Yeah, that’s

basically the same idea. And so they have

to be in balance. So the one that has less, where

the mutation rate is a lower, will end up being bigger, so

that the amount that flows out will be the same as the

amount that flows in. You can apply Levi’s

idea of thinking about how much flux

is going in each way. So there’s going to be some

flux p in one direction, q in the other direction. And you want x times p to

equal 1 minus x times q. And this is the

value of that works. OK, good? What about this guy down here? So this is a very special matrix

called the identity matrix. And what kind of model

of evolution is this? STUDENT: There’s no mutation. PROFESSOR: There’s no evolution. This is like a perfect

replication repair system. The base never changes. So what’s a stationary

distribution? STUDENT: It’s all– PROFESSOR: What’s that? STUDENT: It’ll just

stay where it is. PROFESSOR: It’ll

stay where it is. That’s right. So any vector is

stationary for this matrix. Remember that the

theory said there’s a unique stationary

distribution. This seems to be inconsistent. Why is it not inconsistent? Sally? STUDENT: We defined all of the

variables to be greater than 0. So when you have anything that’s

[INAUDIBLE] that is equal to 0. PROFESSOR: Right, so a

condition of the theorem is that all the entries be

strictly greater than 0. And this is why. If you have 0s, in there

then crazy things can happen. Wherever you start, that’s where

you end up with this matrix. So every vector is stationary. And what about this crazy

matrix over here, matrix q? What does it do? Joe. STUDENT: It’s going to

swap them back and forth. PROFESSOR: It swaps

them back and forth. So this is like a

hyper mutable organism that has such a high

mutation rate that it always mutates every base

to the other kind. It’s never happy

with its genome. It always wants to switch

it, get something better. And so what can you say about

the stationary distribution for this matrix? Jeff? STUDENT: There isn’t

going to be one. PROFESSOR: There

isn’t going to be one? Anyone else? STUDENT: Well, actually, I

guess 1, 1, like 0.5, 0.5. PROFESSOR: 0.5, 0.5

would be stationary. Because you’re– STUDENT: But you

won’t converge to it. PROFESSOR: But you

won’t converge to it. That’s right. it’s

stationary, but not limiting. And again, the

theory doesn’t apply. Because there’s some

0s in this matrix. But you can still

think about that. OK, everyone got that? All right, good. OK so let’s talk now

about Jukes-Cantor. So Jukes-Cantor is very

much a Markov model of DNA sequence evolution. And it simply has– now

we’ve got four bases. It’s got probability alpha

of mutating from each base to any other base. And so the overall mutation

rate, or probability of substitution, at one

generation is three alpha. Because from the base G there’s

an alpha probability mutate to A, an alpha probability

to C, an alpha to T, so the three alpha. And you can basically

write a recursion that describes

what’s going on here. So if you start

with a G at time 0, the probability of a G at

time 1 is 1 minus 3 alpha. It’s a probability

that you didn’t mutate. But then, at generation

two, you have to consider two cases really. First of all, if you

didn’t mutate, that’s PG1. Then you have a 1

minus alpha probability of not mutating

again, so remaining G. But you might have mutated. With probability 1

minus PG 1 you mutated. And then whatever

you were– might be a C– you have

an alpha probably of mutating back to G.

Does that make sense? Everyone clear why there’s a 3

in one place and only a 1 alpha in the other? All right, so you can

actually solve this recursion. And you get this

expression here, P G of t equals 1/4 plus 3/4 E

to the minus 4 alpha t. OK so what does that

tell you about– we know from our

previous discussion what the stationary distribution

of this Markov chain is going to be. What will it be? What’s the stationary

distribution? STUDENT: 1/4 of each. PROFESSOR: 1/4 of each. And why, Daniel, is that? STUDENT: Because the probability

of them moving to any base is the same? PROFESSOR: Right, it’s

totally symmetrical. So that has to be the

answer by symmetry. And you could solve it. You could use this same

approach with defining a value– the theory applies

if alpha is greater than 0 and less than 1– or

less than– I think it has to be less than

a quarter, actually, or something like that. And you can apply the theory. So there will be a

stationary distribution. You can set up a vector. Now you have to have four

terms in it and multiplication. And then you’ll get a

system of basically four equations and four unknowns. And you can solve that

system using linear algebra and get the answer. And yeah, the answer will

be 1/4, as you guessed. And so what this

Jukes-Cantor expression tells you is how quickly does

it get to that equilibrium. We’re thinking about G.

You can start at 100% G. And it will then approach 1/4. You can see 1/4

is clearly what’s going to happen in the limit. Because as t gets big that

second term is going to 0. And so what does the

distribution look like? How rapidly do you approach 1/4? You approach it exponentially. So you start at 1 here. And this is 0. This is 1/4. You’ll start here. And you’ll go like that. You go rapidly at the beginning. And then you get just

very gradual approach 1/4. So you can do a little bit more

algebra with this expression. And here’s where the really

useful part comes in. And you can show

that K, which we’ll define as the true number

of substitutions that have occurred at this particular

base that we’re considering, is related to D, where D is

the fraction of positions that differ when you just take

say the parental sequence and the daughter sequence,

the eventual sequence that you get to. You just match those two. And you count up

the differences. That’s D. And then K

is the actual number of substitutions

that have occurred. And those are related by this

equation, K equals minus 3/4, natural log, 1 minus 4/3 d. So let’s try to

think about, first of all, what is the

shape of that curve? What does that look like? Here’s 0. I’ll put 1 over here. So we all know that log–

if it was just simply log of something

between 0 and 1, it would look like

what– look like that. Starts from negative infinity

and comes up to 0 at 1. But it’s actually not log

of D. It’s log of 1 minus D, or 1 minus a constant times

D. So that will flip it. So the minus infinity

will be there. It will come in like that. And then we also have minus 3/4. There’s a minus in front

of this whole thing. So all these logs are of

numbers that are less than 1. So they’re all negative. But then it’ll get flipped. So it’ll actually

look like that. And it will go to

infinity where? Where does this go to infinity? So if this is now

K is on this axis. And yeah, sorry if

that wasn’t clear. D is here. So this is just

again, this is if we did log of D it

would look like this. If we do log of 1 minus

something times D, that’ll flip it. And then if we do minus that,

it’ll flip it again that way. OK so now K, as a function of

D, is going to look like this. Sometimes people

like to put– anyway, but let’s just think about this. So it’s going to go to

up to infinity somewhere. And where is that? STUDENT: 3/4. PROFESSOR: 3/4. So does that make sense? Can someone tell

us what’s going on and what is the use of

this whole thing here? Yeah, in the back. What’s your name? STUDENT: Julianne. PROFESSOR: Yeah, Julianne. Go ahead. STUDENT: [INAUDIBLE] 0. So part, it would give

you negative infinite. And so you just

solve for D in there. PROFESSOR: OK, so when D is

3/4 you’ll get 1 minus 1. You get 0. That’ll be negative infinity. And then there’s

a minus in front, so it’ll be constant infinity. So that’s true. And does that intuitively

make sense to you? We have a sequence. It’s evolving randomly,

according to this model. And then we have that

ancestral sequence. And then we have a modern

descendant of that sequence, millions of generations– or

maybe thousands of generations, or some large number

of generations away. We line up those two sequences. We count how many matches

and how many mismatches. What’s the fraction

of mismatches, of differences we have? Basically if that– let’s

look at a different case. What if d is very small? What if it’s like 1%. Then what happens? If d is small, turns out

k is pretty much like d. It grows linearly with

d in the beginning. So does that make sense? That makes sense. Because k is the true number

of substitutions that happen. When you go one

generation, the true number of substitutions and

the measured number of substitutions is the same. Because there’s

no back mutations. But when you go further,

there’s an increasing chance of a back– there’s

an increasing chance of a mutation, therefore

increasing chance that you also have

a back mutation. And so this is what

happens at long time. So basically this is linear

here and then goes up like that. And so what this

allows you to do is d something that

you can measure. And then k is something

that you want to know. The point is, if I measure

the difference between human and chimp sequence, it

might be only 1% different. And if I have an idea of

mutation rate per generation, I configure out how

many generations apart, or how much time has passed,

since humans split from chimp. But if I go to mouse, where the

average base might be– there might be only a 50%

matching– if that’s true, there have been a

lot of changes there. There will be a lot of bases

that have changed once, as well as a lot that

may have changed twice, and may have actually

changed back. And so that let’s say human

and mouse are 50% identical. That 50% identical–

I can’t just compare it to let’s

say the 1% with chimp and say it’s 50 times longer. That 50% will be

an underestimate of the true difference. Because there’s been some

back mutations as well. And so you have to

use this formula to figure out what the

true evolutionary time is, the true number of

changes that happened. Yeah, go ahead. STUDENT: Does simple count

refer to just the difference in the amount of mutations? Or what’s– PROFESSOR: The simple count

is what you actually observe. So you have a

stretch of sequence– let’s say the beta globin

genomic locus in human. You line it up to the beta

globin locus in chimp. You count what fraction

of positions differ? What fractions are different? That’s d. And then k is– actually, it’s

slightly complicated here. Because if this is

human and that’s chimp, then k is more like–

because you don’t actually observe the ancestor. You observe chimp. So you have to go back to the

ancestor and then forward. So that’s the relevant

number of generations. And so k will tell

you how many changes must have occurred to

give you that observed fraction of differences. And for short

distances, it’s linear. And then for long, it’s

logarithmic, basically. Yeah, question. STUDENT: So I’m guessing all

of [INAUDIBLE] that selection is absent. PROFESSOR: Right, right. This is ignoring selection. That’s a good point. So think about this. And let me if other

questions come up. So this actually

came up the other day when we were talking about

DNA substitution models. So Kimura and

others have observed that transitions occur much

more often than transversions, maybe two to three

times as often, and so proposed a

matrix like this. And now you can

use what you know about stationary

distributions to solve for the limiting or stationary

distribution of this matrix. And actually, you will find

it’s still symmetrical. It’s a little bit

more complicated now, but you’ll still

get that 1/4, 1/4. But then more

recently others have observed that

really, dinucleotides matter in terms

of mutation rates, particularly in

vertebrates So what’s special about vertebrates is

that they have methylation machinery that methylates

CPG dinucleotides on the C. And that makes those

C’s hypermutable. They mutate at about 10 times

the rate of any other base. And so you can give a

higher mutation rate to C, but that doesn’t

really capture it. It’s really a higher mutation

rate of C’s that are next to G’s. And so you can

define a model that’s 16 by 16, which has

dinucleotide mutation rates. And that’s actually

a better model of DNA sequence evolution. And it’s just the math

gets a little hairier if you want to calculate

stationary distribution. But again, it can be done. And it’s actually

pretty easy to simulate. Knowing that it will

converge to the stationary, you can just run the

thing many times. And you’ll get to the answer. And there’s even been

strand-specific models proposed, where there are

some differences between how the repair machinery treats

the two DNA strands that are related to transcription

coupled repair. So you actually get

some asymmetries there. And this is a

reasonably rich area. And you can look at some

of these references. All right, so one more

topic, while we’re on evolution– this

is very classical. But I just wanted to make sure

that everyone has seen it. If you are looking specifically

at protein coding sequences, exons, and you know the reading

frame, you can just align them. And then you can look

at two different types of substitutions. You can look at what are

called the nonsynonymous substitutions, so changes

to the codons that change the underlying amino acid,

the encoded amino acid. And you define

often a term that’s either called Ka or dN,

depending who you read, that is the fraction of

nonsynonymous substitutions divided by nonsynonymous sites. And in this case let’s

do synonymous first. So you can also look

at the other changes. So these are now

synonymous changes which are base

changes to triplets that do not change the

encoded amino acid. So in this case, there

are three of those. And a lot of

evolutionary approaches are just based on calculating

these two numbers. You count synonymous changes. You divide by

synonymous sites, count non-synonymous substitutions,

divide by non-synonymous sites. And so what do we

mean synonymous site? Well if you have only amino

acids that are fourfold, that have fourfold

degenerate codons, which is all of them are

like that in this case, then for example GG– or

let’s see what’s up here. Yeah, CC anything

codes for proline. Do we have any of those? Actually, these are not

all fourfold degenerate. I apologize. But glycine, for example–

so GG anything is glycine. So in this triplet,

this triplet here, there’s one synonymous site. The third side is

a synonymous site. You can change that without

changing the amino acid. But the other two

are non-synonymous. So to do first

approximation, you take non-synonymous

substitutions and divide by the number

of codons– I’m sorry, the number of codons

times 2, since there are two non-synonymous

positions in each codon. And you take synonymous

substitutions, divide by the number of codons. OK, does that make sense? One per codon. OK and so what do you

then do with this? You can correct this

value using– basically this is the

Jukes-Cantor correction that we just calculated,

this 3/4 log 1 minus 4/3. That applies to codon evolution

as well as individual base evolution. And what people

often do with this is they calculate Ka

and Ks for a whole gene. Let’s say you have

alignments of all human genes to their orthologs

in mouse– that is, the corresponding

homologous gene in mouse. And you calculate Ka Ks. And then you can

look at those genes where this ratio is

significantly less than 1, or around 1, or greater than 1. And that actually

tells you something about how that– the

type of selection that that gene is experiencing. So what would you

expect to see– or if I told you

we’ve got two genes and the Ka/Ks ratio

is much less than 1. It’s like 0.2. What would that tell you? Or what could you infer

about the selection that’s happening to that gene? Ka/Ks is much less than 1. Any ideas? Julianne, yeah. STUDENT: The protein

sequence is important– or the amino acid sequence. PROFESSOR: Yeah, exactly. The amino acid

sequence is important. Because you assume

that those synonymous sites and non-synonymous

sites– they’re going to mutate at

the same rate, right? The mutation processes don’t

know about protein coding. So what you’re seeing

is an absence, a loss, of the non-synonymous changes. 80% of those

non-synonymous changes have been kicked

out by evolution. You’re only seeing 20%. And you’re using, assuming

the non-synonymous are neutral– I’m sorry. I seem to have trouble

with these words today. But you assume that the

synonymous ones are neutral. And then that’s

calibrates everything. And then you see that the

non-synonymous are much lower. Therefore you must have

lost– these ones must have been kicked

out by evolution. So the amino acid

sequence is important. And it’s optimal in some sense. The protein works– the organism

does not want to change it. Or changes to that

protein sequence make the protein worse. And so you don’t see them. And that’s what you see

for most protein coding genes in the genome– a Ka/Ks

ratio that’s well below one. It says we care

what the protein is. And it’s pretty good already. And we don’t want to change it. All right, what

about a gene that has a Ka/Ks ratio of around 1? Anyone have an idea what would

that tell you about that gene? There are some– Daniel? STUDENT: The sequence is– it

doesn’t particularly matter. Maybe it’s a non-coding,

non-regulatory patch of DNA. I assume there

must be something. PROFESSOR: Yeah, so it could

be that it’s not really protein coding after all. It’s non-coding. Then this whole triplet thing we

were doing to it is arbitrary. So you don’t expect any

particular distribution. That’s true. Any other possibilities? Yeah, Tim. STUDENT: Could be that there

are opposite forces that are equilibrating. For example, we’re

taking the unit of the G. But maybe in one

half of the G there’s a strong selective

pressure for non-synonymous and in the other half it’s

strong selective pressure for synonymous. Alternatively, it could be

in the same par of the gene, but it’s involved in

two different processes. It’s diatropic. So in one process it’s

selecting this one thing. PROFESSOR: Yeah, or

one period of time, if you’re looking at 10

million years of evolution, it could have been for this

first five million years it was under negative selection, and

then it was under positive. And it averages out. Yes, all those things are

possible, but kind of unusual. And so maybe if

you saw that the– if you plotted

Ka/Ks along the gene and you saw that it was high

in one area and low in another, then that would tell

you that you probably shouldn’t be taking the

average across the gene. And that would be a

good thing to look for. But what if– again, so

we said if Ka/Ks is near 1 it could be that it’s not really

a protein coding gene at all. That’s certainly possible. It could also be though

that it’s a pseudogene. Or it’s a gene that is no

longer needed by the organism. It still codes for protein,

but the organism just could care less

about its function. It’s something that maybe

evolved in some other time. It helps you adapt to

when the temperature gets below minus 20. But it never gets

below minus 20 anymore. And so there’s no selection

on it, or something like that. So neutral indicates– this

is called neutral evolution. And then what about a gene which

has a Ka/Ks ratio significantly greater than 1? Any thoughts on what that might

mean and what kind of genes might happen to–

yes, what’s your name? STUDENT: Simona. PROFESSOR: Simona, go ahead. STUDENT: It might be a gene

that’s selected against, so something that’s detrimental

to the cell or the organism. PROFESSOR: It’s detrimental–

so the existing protein is bad for you, so you

want to change it. So it’s better to change

it to something else. That’s true. Can you think of an example

where that might be the case? STUDENT: A gene that

produces a toxin. PROFESSOR: A gene

that produces toxin. You might just lose

the gene completely if it produced a toxin. Any other examples you can

think of or other people? Yeah, Jeff. STUDENT: Maybe a

pigment that makes the organism more susceptible

to being eaten by a predator. PROFESSOR: OK, yeah if

it was a polar organism and it happened to have this

gene that made the fur dark and it showed up against the

snow, or something like that. And you can imagine that. Or a very common

case is, for example, a receptor that’s used by

a virus to enter the cell. It probably had

some other purpose. But if the virus

is very virulent, you really just want

to change that receptor so that the virus can’t

attack it anymore. So you see this kind

of thing is much rarer. It’s only less than

1% of genes probably are under positive selection,

depending on how you measure it and what time

period you look at. But it tends to be really

recent, really strong selection for changing the

protein sequence. And the most common– well,

probably the most common– is these immune arms races

between a host and a pathogen. But there are other cases too. You can have very

strong selection where– well, I don’t

want to– basically where a protein is maladapted, like

the organism moves from a very cold environment to a

very warm environment. And you just need to

change a lot of stuff to make those proteins

better adapted. Occasionally you can get

positive selection there. Yeah, go ahead. STUDENT: So the situation

where K or Ks is 1– could it be possible that

the mRNA is under selection? PROFESSOR: Yeah, so that

basically we have always been implicitly assuming that

the synonymous substitution rate was neutral. But it could actually

be it’s not neutral. That’s under negative

selection too. And it happens

that they balance. That’s also possible. So for that, to

assess that, you might want to compare the synonymous

substitution rate of that gene to neighboring genes. And if you find it’s

much lower, that could indicate that

the coding sequences– the third base of codons

is under selection– could be for splicing, maybe. It could be for RNA secondary

structure, translation, different other–

that’s a good point. So yeah, you guys have

already poked holes in this. This is a method. It gives you something. You’ll see it used. It gives you some inferences. But there are cases where

it doesn’t fully work. OK, good. So in the remaining

time I wanted to do some examples of

comparative genomics. So as I mentioned

before, these are chosen to just give you some

examples of types of things you can learn about

gene regulation by comparing genomes again,

often by using really simple methods,

just blasting all the genes against each

other or things like this. And also, if you do choose

to read some of these papers, it can give you some

experience looking at this literature in

regulatory genomics. So the papers I’ve chosen–

we’ll start with Bejerano et al from 2002, who basically sought

to identify regulatory elements that are things that are

under evolutionary constraint. That’s all he was

trying to find. Didn’t know what

their functions were. But they turned out to be

interesting nonetheless, which is maybe a

little surprising. And then this other work from

Eddy Rubin’s lab and others– Steve Brenner’s lab–

actually characterized some of these extremely

conserved regions and assessed their function. And then Bejerano came

back a few years later and actually had a paper about

where these extremely conserved regions actually came from. So we’ll talk about those. Then we’ll look at

some papers that have to do with inferring

the regulatory targets of a transacting factor. And the factors that

we’ll consider here will be microRNAs,

mostly, Either trying to understand what the

rules are for microRNA targeting and these

Lewis et al papers, or trying to identify

the regulatory targets in the genome. And then, time permitting, we’ll

talk about a few other examples of slightly more exotic things. Graveley identified

a pair– or pairs– of interacting

regulatory elements through a clever comparative

genomic approach. And then I’ll talk about

these two examples at the end if there’s time, where a new

class of transacting factors was inferred from the

locations of the encoded genes in the genome. And also an inference was

made about the functions of some repetitive

elements from, again, looking at the matching

between these elements and another genome. All right, so first

example– Bejerano “Ultraconserved elements.” So they defined, in a fairly

arbitrary way, ultraconserved elements as unusually

long segments that 100% identical between

human, mouse, and rat. This was in 2000–

I’m sorry, I might have the wrong– it’s

either 2004 or 2002. I forget. This was basically when the

first three mammalian genomes had been sequenced, which

were human, mouse, and rat. And there were whole

genome alignments. So they basically said let’s

try to use these whole genome alignments to find

what’s the most conserved thing in mammals. So they wanted to see if

there’s anything 100% conserved. And so they did

statistics to say what’s an unusually long

region of 100% identity. Any ideas how you would do

that calculation, what kind of statistics you would use? They used a really

simple approach. What they did was they

took one megabase segments of the genome, assuming it

might vary across the genome. They took ancestral repetitive

elements– so repetitive elements that were

inserted, that were present in mouse,

rat, and human– and assumed that they

were neutrally evolving, they were not under selection. And then therefor you could look

at the number of differences and get an idea what the

background rate of mutation is. And they use that. And they found that

that rate was– this is from their supplementary

data– that was never greater than 0.68. And so they just said well, if

we have a probability of– I’m sorry. One is heads. So if they’re all

three the same– yeah, so if we have a probability

of 0.7 of heads, meaning that they’re

all three the same, then the chance that you

have 200 heads in a row would be 1 minus P P to the 200,

just like [INAUDIBLE] trials. And you can just multiply that

times the size of the genome. And you say it’s extremely

unlikely that you’ll ever see anything where there’s 200

identical nucleotides in a row. So that’s what they defined

as an ultraconserved element. So it all seems

very silly for now, until you actually

get to what they find. So they looked at where

are these elements around the genome. They found about 100 overlapped

exons of known protein coding genes, 100 are in

introns, and the remainder are in intergenic regions. So then they looked at

well what kind of genes contain exons with

overlapping– or contain ultraconserved elements

that overlap exons? Those are type 1 genes. And what kind of

genes are next to the intergenic

ultraconserved elements, to try to get some clues about

the function of these elements. And so they did this early

gene ontology analysis. And what they found was that

the ultraconserved elements that overlapped exons

tended to fall in genes that encoded RNA-binding

proteins, particular splicing factors, by an order of

magnitude more frequent. And then the type 2

genes, the ones that were next to these intergenic

ultraconserved regions, tended to be

transcription factors. In particular, homeobox

transcription factors were the most enriched class. So this gave them some clues

about what might be going on. Particularly the second

class was followed up by Eddy Rubins’s

lab at Berkeley. And they tested 167 extremely

conserved sequences. So some of them were these

ultraconserved elements. And some of them were

just highly conserved, but not quite 100% conserved. And they had an assay

where they have a reporter. It’s a lacZ with a– you

take a minimal promoter, fuse in to lacZ, and then

you take your element of interest and

fuse it upstream. And then you do staining

of whole mount embryos. And you say what pattern

of gene expression does this element

drive, or does it drive a pattern of

gene expression? And so 45% of the time it

drove a particular pattern of gene expression. So it functioned as an enhancer. And these are the types

of patterns that they saw. So they saw often

forebrain, sometimes midbrain, neural

tube, lim, et cetera. So many of these

things are enhancers that drive particular

developmental patterns of gene expression. So that out to be actually–

that was a pretty good way to identify

developmental enhancers. So they wondered, is

there anything special about these ultraconserved

regions, these 100% identical regions, versus

others that are 95% identical. And so they tested

a bunch of each. And they found absolutely

no difference there. They drive similar

types of expression. And you can even find

individual instances of them that drive pretty much exactly

the same pattern of expression. So this whole 100%

identical thing was just a purely– it

was purely arbitrary. But still, it’s useful. These things are among the

most interesting enhancers that have been identified. So what about the– oh yeah,

so where did they come from? OK, so this is totally

from left field. Bejerano was looking at some of

these ultraconserved elements, probably just blasting them

against different genomes as they came out, and noticed

something very, very strange. And that was there

had recently been some sequencing from coelacanth. So for those of you who

aren’t fish experts, this is a lobed fin fish,

where they found fossils from dating back to

400 million years. And they noticed that these

fossils– the morphology never changed. From 400 million, 300 million

years, you could see this fish. It was exactly like this. And it has lobed fins. That was why they’re

interested in it. Because the fins– they

have a round structure. They look almost like

limbs, like maybe this guy could have evolved into

something that would eventually live on land. Anyway, but they

thought it was extinct. And then somebody caught one. In the ’70s, in the West Indian

Ocean, from deep water fishing, they pulled one up,

and it looked exactly like these fossils from

400 million years before. And so then of course

somebody took some DNA and did some sequencing. And what Bejerano noticed is

that this one megabase or so coelacanth sequence had a

very common repeat in it that was around 500 bases or so,

that looked like a SINE element. SINE elements– short,

interspersed nuclear element, like Alus, if you’re

familiar with those, so some sort of repetitive element. And this repetitive

element was very similar to these ultraconserved

enhancers in mammals. So something that

we normally think of as the least

conserved of all, like a repetitive element

that inserts itself randomly in the genome, had become–

some of these elements had become among the

most conserved sequences later in evolution. So how does that make

any sense at all? Anyone have a theory on that? I can tell you how

they interpreted it. So their theory– here’s some

text from their– anyway, you can look at the paper

for the details here. But their theory is basically

that once you have a repetitive element– initially it’s

a parasitic element, inserts itself

randomly in the genome, doesn’t actually do anything. But once you have hundreds

of them, by chance there will be perhaps

a set of genes that have this

element next to them, where you’d like to

control them coordinately. You’d like to turn all those

genes on or all those genes off in a particular circumstance–

a stress response, during development,

something like that. And so then it’s relatively

easy to evolve a transcription factor, for example,

that will bind to some sequence

in that element. And then it’ll turn

on all those genes. Of course, it’ll turn

out all the genes that have the

elements near them. So it’ll probably turn

on some extra genes that you don’t want. But you can then– selection

will then tune these elements. It gives you a quick way of

generating a large-scale gene expression response. Because you’ve got so many

of these things scattered across the genome. And so this– that’s as good

as an explanation as we have, I would say, for what

is going on here. And there’s been some

theories about this. And they point out

that actually something like 50% of our genome actually

comes from transposons, if you go back far enough. Some are recent,

some are ancient. And that maybe a lot of the

regulatory elements– not just these ultraconserved

enhancers, but others– may have evolved in this way. So basically you insert a bunch

of random junk throughout. And then the fact that

it’s all identical, because it derived

from a common source, you use– that fact

actually turns it into something that’s useful,

a useful regulatory element. All right, just wanted

to throw that out. So what about the exonic

ultraconserved elements? So here’s one. This is a 600 18

nucleotide region that’s 10% identical between

human, mouse, and rat. It’s one of the

longest in the genome. And where is it? It’s in a splicing

factor gene called SRp20. And it’s actually not in

the protein coding part. It’s in a essentially non-coding

exon of this splicing factor. So it’s this yellow exon here. And what you’ll notice is

there’s this little red thing here. That’s a stop codon. So this gene is

spliced– produces two different isoforms. The full length is the

blue, when you just use all the blue exons. But when you include

this yellow exon, there’s a premature

termination codon that you hit. So you don’t make

full-length protein. Instead, that mRNA is

degraded in a pathway called nonsense mediated mRNA decay. So the purpose of

this exon appears to be so that this gene

can regulate expression of the protein at the

level of splicing. And others have shown that this

protein, the protein product, actually binds to

that exon and promotes the splicing of that exon. So it’s basically a form of

negative auto regulation. The gene– when the

protein gets high, it comes back and shifts the

splicing of its own transcripts to produce a non-functional

form of the message and reduce the

protein expression. So the theory is that this

helps to keep this splicing factor at a constant

level throughout time and between different

cells, which might be important for splicing. But that’s only a theory. It could be something else. And it does not explain why you

need 600 nucleotides perfectly conserved in order to

have this function. So I think these exonic

ones are still fairly mysterious and

worth investigating. A couple examples

from microRNAs– you probably it’s just a

brief review on microRNAs. They are these small,

non-coding RNAs, typically 20 to 22

nucleotides or so. They have a characteristic

RNA secondary structure in their precursor,

often called miRNAs. And they’re produced from

primary transcripts typically, or introns, or

protein coding genes, which are then processed in

the nucleus of an enzyme called drosha into a hairpin

structure, like so. And then that is exported

to the cytoplasm, where it’s further processed

by an enzyme called dicer to produce the mature microRNA,

which enters the risk complex, and which then pairs the

microRNA with mRNA targets, usually in the 3′-UTR. And that either inhibits

their translation or triggers the decay

of those messages. So microRNAs can do– they

can be really important. Weird animation–

but for example, this bantam microRNA in flies

inhibits a proapoptotic gene hid. If you delete bantam,

apoptosis goes crazy. And you can see this

is a normal fly. There’s a little fly in there

with red eyes and so forth. In this guy there’s

just a sack of mush. All the cells– most of

the cells actually died. So microRNAs play

important roles in developmental pathways. And so we wanted to figure out

the rules for their targeting. And so this was an early

study from Ben Lewis, where he looked for conserved

instances of segments, short oligonucleotides,

that match perfectly to different parts

of the microRNA, using again these human,

mouse, rat alignments, which were what was

available at the time. And what he found was that if

you took the set of microRNAs which were known, and you

identified targets of these defined as 7-mers that

are perfectly conserved in 3′-UTRs of

mammalian messages, and then you looked at how many

you got and you compared that to the number of targets

of shuffled microRNA– so where you take the

whole set of microRNAs, randomly permute their sequences

so you generate random stuff, look at how many conserve

targets they have– that there was a significant

signal above background, in the sense of real

conserved targets, specifically only for the

5′-end of the microRNA. Especially, bases 2 to 8 of

the microRNA gave a signal. And no other positions

in the microRNA gave a significant

signal above background. And so that led to the inference

that the 5′-end of the microRNA is what matters,

specifically these bases. And then later,

alignments of actually paralogous microRNA

genes, shown here– so these are

different let-7 genes. You can actually see that

the 5′-end of the microRNA, which the microRNA’s

shown here in blue– this is the fold-back. So you get conservation of the

microRNA and of the other arm of the fold-back,

which is complimentary. Little conservation of the loop,

but the most conserved part of the microRNA is the very

5′-end, consistent with that idea. Just one more example,

because it’s so cool– so this is the dscam

gene in drosophila. And this gene has four different

alternative spliced regions which are each spliced by

mutually exclusive splicing. So there are actually

12 copies of exon 4 and 48 different

copies of exon 6. And messages from this

gene only ever contain one of those particular exons. And so Brent Graveley asked

how does this gene get spliced in a mutually exclusive way? How do you only choose one of

those 48 different versions of exon 6? And so what he did was did some

sequencing from various fly and other insect species of

this locus, did some alignments. And he noticed that there was

this very conserved sequence just stream of exon 5, right

upstream of this cluster. And then, looking

more carefully, he saw that there is another

sequence, just immediately upstream of each of

the alternative exons, that was very similar

between all those exons, and also conserved

across the insects. And then he started

at these for a while, and recognized that

actually this sequence up at the 5′-end is– its consensus

is perfectly complimentary to the sequence that’s found

upstream of all of the other exons. And so what that

suggested, immediately, is that splicing

requires the pairing of this sequence

from exon 5 to one of those downstream sequences. And then you’ll splice

to the next exons that’s immediately downstream and

skip out all of the others. And that’s been

subsequently confirmed, that that’s the mechanism. So this just shows you

that to figure this out by molecular genetics would

have been extremely difficult. But sometimes

comparative genomics, when you ask the right question,

you get a really clear– you can actually get mechanistic

insights from sequences. So that’s it. And I’m actually passing

the baton over to David, who will be– take

over next week.