Submitted by shitboots t3_zdkpgb in MachineLearning

Paper: https://www.cs.toronto.edu/~hinton/FFA13.pdf

Twitter summary: https://twitter.com/martin_gorner/status/1599755684941557761

Abstract:

> The aim of this paper is to introduce a new learning procedure for neural networks and to demonstrate that it works well enough on a few small problems to be worth serious investigation. The Forward-Forward algorithm replaces the forward and backward passes of backpropagation by two forward passes, one with positive (i.e. real) data and the other with negative data which could be generated by the network itself. Each layer has its own objective function which is simply to have high goodness for positive data and low goodness for negative data. The sum of the squared activities in a layer can be used as the goodness but there are many other possibilities, including minus the sum of the squared activities. If the positive and negative passes can be separated in time, the negative passes can be done offline, which makes the learning much simpler in the positive pass and allows video to be pipelined through the network without ever storing activities or stopping to propagate derivatives.

230

Comments

You must log in or register to comment.

lfotofilter t1_iz32jjy wrote

Geoff Hinton by now must know each of the 60,000 digits of MNIST like an old friend.

103

AsIAm t1_iz4g8q4 wrote

He knows the true probability distribution of the MNIST.

42

katprop t1_iz28tqp wrote

I watched his neurips presentation. While I love explorations of alternatives to back prop, does anyone else feel like he’s going a bit off the deep end with saying this paper could explain why people sleep and we’ll use non-binary computers in the future?

56

gambs t1_iz406vc wrote

Hinton has figured out how the brain works every year since the mid-80s, let the man cook

54

gunshoes t1_iz2cili wrote

These OG guys from the PDP days usually do that. I just take it as a bit of garnish for some fun hypotheticals.

52

Direct_Ad_7772 t1_iz4gg0e wrote

I think trying to understand the mind must be in one of his main motivations. If it wasn't for that, he would have not contributed to machine learning to begin with. So going off the deep end is a side effect of whatever it is that made him a great researcher.

9

ReginaldIII t1_iz4ries wrote

Do you have access to the video of his presentation still?

It bothers me greatly that they paywall their presentations even after the conference has ended.

By all means have exclusivity for the duration of the actual conference, and limit commenting and discussion to conference attendees. But as soon as the conference ends they should flip the switch and make everything public. There's literally no reason not to, it isn't going to stop people wanting to attend.

8

logicbloke_ t1_iz5x64f wrote

This 10x, I wish the paper presentations and keynotes were made available online. It doesn't add much effort to record an audio + slides of the presentation.

Doesn't take anything away from the in person conference, which is more about networking and discussion.

3

suedepaid t1_izco494 wrote

I was also frustrated about that, but I went on the website and it looks like they're gonna publish them all in a couple weeks. Still a bit frustrated at the delay, but it's a bit understandable.

3

The_Real_RM t1_iz41rkd wrote

What's funny is that a few decades from now the only relevant brains in the world will be the ones this guy brought to existence. It's just a self fulfilling prophecy

5

ktpr t1_iz2eya1 wrote

If he mentioned those extrapolations in a psychology or neuroscience conference he would be laughed out of the room. World class expertise in one area does not translate to informed speculation in another.

−5

Nameless1995 t1_iz2j0ja wrote

Incidentally, Hinton has a lot of professional experience in psychology/cognitive science: https://www.cs.toronto.edu/~hinton/fullcv.pdf

> Jan 76 - Sept 78 Research Fellow Cognitive Studies Program, Sussex University, England Oct 78 - Sept 80 Visiting Scholar Program in Cognitive Science, University of California, San Diego Oct 80 - Sept 82 Scientific Officer MRC Applied Psychology Unit, Cambridge, England Jan 82 - June 82 Visiting Assistant Professor Psychology Department, University of California, San Diego

46

ktpr t1_iz2te98 wrote

Impressive. Also the latest multiple month appointment was nearly 40 years ago. Boulder of salt here.

−11

uotsca t1_iz2vg3b wrote

Hinton was educated at King's College, Cambridge, graduating in 1970 with a Bachelor of Arts in experimental psychology.

9

kebabmybob t1_iz3n3gc wrote

Cog sci stuff is all sophistry of this exact flavor. With respect to neuroscience you might be right.

5

master3243 t1_iz2f181 wrote

Interesting read, I'm always interested in research about alternatives to backprop.

One important paragraph (for the curious, that won't read the paper):

> The forward-forward algorithm is somewhat slower than backpropagation and does does not generalize quite as well on several of the toy problems investigated in this paper so it is unlikely to replace backpropagation for applications where power is not an issue. The exciting exploration of the abilities of very large models trained on very large datasets will continue to use backpropagation.

> The two areas in which the forward-forward algorithm may be superior to backpropagation are as a model of learning in cortex and as a way of making use of very low-power analog hardware without resorting to reinforcement learning (Jabri and Flower, 1992).

44

whatstheprobability t1_iz58l5i wrote

I feel like this is saying:

  1. this won't generally replace backprop, but it could lead to insight that will lead to algorithms that will replace backprop
  2. this could improve upon backprop for some specific use cases (low power), so even if it doesn't lead to major insights, researchers can still justify spending time on it

Does that sound right?

9

amassivek t1_izoh41k wrote

There is a framework for learning with forward passes, a friendly and thorough tutorial: https://amassivek.github.io/sigprop .

The most interesting insights from the framework:

  • This algorithm provides an explanation for how neurons in the brain without error connections receive learning signals.
  • It works for continuous networks with hebbian learning. This provides evidence for this algorithm as model of learning in the brain.
  • It works for spiking neural networks using only the membrane potential (aka voltage in hardware). This supports applying this algorithm for learning on neuromorphic chips.

The Signal Propagation framework paper: https://arxiv.org/abs/2204.01723 . The Forward-Forward algorithm is an implementation of this framework.

I am an author of this work. I was presenting this work at a reading group when one of the members pointed out the connection between signal propagation and forward forward.

4

kebabmybob t1_iz3ntyw wrote

What a chad, no grad students or anybody on this paper.

32

seiqooq t1_iz43dss wrote

Probably explains why the title of the paper isnt “forward passes are all you need”

58

modeless t1_iz28lbg wrote

This seems more interesting than the capsule stuff he was working on before. Biologically plausible learning rules are cool. Does it work on imagenet though?

20

new_name_who_dis_ t1_iz2b35v wrote

Is this actually biologically plausible? Seems that the idea of negative data is pretty constructed.

I see that Hinton claims it's biologically more plausible, but I don't see any justification for that statement apart from comparing it to other biologically plausible approaches, and more so spending time discussing why backprop is definitely not biologically plausible.

I'm not a neuroscientist so don't have much background on this.

25

modeless t1_iz2bm8r wrote

Well no one knows exactly what the brain is up to in there, but we don't see enough backwards connections or activation storage to make backprop plausible, so this is a way of learning without backwards connections, and that alone makes it more biologically plausible.

26

new_name_who_dis_ t1_iz2c6t0 wrote

I’ve heard that hebbian learning is how brains learn and this doesn’t seem like hebbian learning.

However idk if hebbian learning is even how neuroscientists think we learn in contemporary research

5

whymauri t1_iz38qtl wrote

As of 2019, it is what I was taught in a graduate course on associative memory and emergent dynamics in the brain. We read Hertz's Theory Of Neural Computation. This was right before people worked on Hopfield-Self Attention.

7

fortunum t1_iz2v4li wrote

Check out E-prop for recurrent spiking NN

3

Commyende t1_iz2euh0 wrote

Synapses can be excitatory or inhibitory, so that's basically like positive/negative, but I don't really know if that tracks with this algorithm 100%

8

jms4607 t1_iz38c09 wrote

I think the pos/neg here is more like contrastive learning.

8

new_name_who_dis_ t1_iz2fjjk wrote

It's negative data. It's basically contrastive learning, except without backprop. Like you pass a positive example and then a negative example in each forward pass, and update the weights based on how they fired in each pass.

It's a really cool idea, I'm just interested if it's actually biologically plausible.

I might be wrong but inhibitory synaptic connections sounds like a neural connection with weight 0, i.e. it doesn't fire with the other neuron.

6

Commyende t1_iz2wzk0 wrote

Inhibitory synapses reduce the likelihood of the downstream neuron firing.

6

Red-Portal t1_iz2kafb wrote

Geoff... everything is great but please stop abusing footnotes...

12

kebabmybob t1_iz3nsfu wrote

I like it this way. 100x more readable than your standard terse academic paper which gets off on appearing overly complex.

11

Red-Portal t1_iz3u8k2 wrote

Oh I'm not saying you should just remove the footnotes. I'm saying it's better to blend them into the main text so I don't have to jump back and forth...

2

ppg_dork t1_iz6btrx wrote

No! I think all academic papers should be structured like Infinite Jest!

1

Ulfgardleo t1_iz2ampb wrote

I will start believing in Hinton's algorithms once they proof that it is consistent with some vector field with fixed points that are meaningful optima of some objective function.

9

_der_erlkonig_ t1_iz3k920 wrote

Out of curiosity, why do you include this as a requirement for an algorithm to be good/interesting/useful/etc?

2

Ulfgardleo t1_iz3q7pd wrote

I did not. I did it for Hinton.

A heuristic can be useful without proof, especially in tasks that are very difficult to solve. However, you have to supply strong theoretic arguments why they should work. A biological analog is not enough, especially if it is one that we do not understand, either.

Otherwise you end up like the other category of nature inspired optimization heuristics that pretend to optimize by mimicking the hunting patterns of the Harris hawk. And I wished I made this up just now.

8

chaosmosis t1_iz3ymas wrote

Gimmick animal optimization procedures are my guilty pleasure. They're like intellectually cute to me or something. I get happy every time I come across a new one.

7

Ulfgardleo t1_iz437fi wrote

I have a story to tell about the one time where i got invited as external evaluator for a MSc thesis. I agreed, later opened it and then realized it was a comparison of 10 animal migration algorithms.

This thesis sat on my desk for WEEKS because i did not know how to grade it. How do you grade pseudo science?!? Like, it is not the fault of the students to fall prey to this topic, but I also can't condone them not figuring out that it IS pseudoscience.

3

chaosmosis t1_iz8fzyo wrote

I think the main problem is that they aren't theory driven except in an ad hoc sense. They'd be fine if they hadn't become a fad published on by everyone and their mother.

For actually neat discussions of distributed computing in animals, I don't think it's possible to do better than reading about octopuses. Strong recommend for Other Minds to anyone interested in the area.

2

Red-Portal t1_iz67ufu wrote

Yeah there is a whole "zoo" of those things haha.

3

PolywogowyloP t1_iz36kj7 wrote

I'm excited to see an alternative to backprop, but I think the most exciting part of this for me is the ability to still learn through stochastic layers in the model. I think this could have some major applications in probabilistic models for distributions without reparameterization tricks.

9

jms4607 t1_j1s103c wrote

Are there any problems with the reparam trick?

1

Ford_O t1_iz2eau3 wrote

So that's why I keep getting nightmares.

Jokes aside, this sounds quite plausible. However, I am unsure if this can be ever more efficient than backprop. Yet, this could have huge impact on neuroscience, if it turns that's what happens in sleep.

7

nikgeo25 t1_iz48wsj wrote

Paper reads like an idea he had in the shower. Where's the math and connection to existing work? Normalizing each layer after maximizing a square. Someone's gonna show he's doing some fancy PCA in no time I bet.

6

Wild-Ad3931 t1_iz8urzx wrote

Did anyone understand how weights were updated ?

6

SeverelyCanadian t1_izv8vvr wrote

I wondered this too. It's very unclear, and seems like a central detail is missing.

3

SatoshiNotMe t1_iz4cehg wrote

Odd thing about the abstract: suddenly says “video” near the end. Is it only for video data ?

5

tchumbae t1_iz55i98 wrote

The idea behind the paper is very cool, but there has been previous work that substitutes the backward pass with a second forward pass. Check out this work by G. Dellaferrera and G. Kreiman!

5

nikgeo25 t1_iz97ctl wrote

Also the work by Ma and Wright that uses a form of generalized nonlinear PCA. Search ReduNet

1

Competitive_Dog_6639 t1_iz2oaue wrote

Hinton is awesome and really enjoyed his neurips talk. Naive question: are single layer gradients biologically plausible? My understanding is that gradients back thru multiple layers are not. The FF algorithm still uses gradients for single layers tho right?

4

dasayan05 t1_iz2ucqp wrote

yes, they are like "local" updates I believe

4

IDe- t1_iz356pg wrote

Backprop has really overstayed its welcome. It's great to see people doing something about it.

2

bohreffect t1_iz3gn0s wrote

You're sleeping on differentiable programming then

2

IDe- t1_iz6z4y3 wrote

The issue is that requiring a model to be differentiable puts far too many limitations on the types of models you can formulate. Much of the research in the last few decades has focused on how to deal with issues caused purely because of the artificial constraint of differentiability. It's purely "local optimization" in the space of potential models, when what we really should be doing is "basin-hopping".

2

bohreffect t1_iz74sa2 wrote

But to imply backprop is getting old neglects all of the real world applications that haven't been pushed yet.

I understand there are problems where differentiability is an intractable assumption but saying "oh old thing how gauche" isn't particularly constructive.

1

IDe- t1_iz77rsw wrote

Ah, I didn't intend to say that it's old or useless, just that I think it receives disproportionate research focus/effort.

2

[deleted] t1_iz6e54k wrote

"differentiable"

1

bohreffect t1_iz6emfb wrote

I mean, can you not compute the Jacobian of a constrained optimization program and stack that into any differentiable composition of functions?

People snoozin'.

1

[deleted] t1_iz6hlao wrote

no you can't because it's not actually a Jacobian

1

bohreffect t1_iz6j2xr wrote

The Jacobian of the solution of a constrained optimization program with respect to its parameters, but I thought that was understood amongst the towering intellect of neural network afficiandos, e.g. the original commenter finding backprop to be stale.

Here's the stochastic programming version: Section 3.3. https://proceedings.neurips.cc/paper/2017/file/3fc2c60b5782f641f76bcefc39fb2392-Paper.pdf

1

Ulfgardleo t1_iz9fjio wrote

Funny that stuff always comes back. We used to differentiate SVM solutions wrt kernel parameters like that back in the day.

1

eccstartup t1_iz3mej4 wrote

I would be good if someone could provide the code.

2

ReasonablyBadass t1_iz3nk3n wrote

Can someone ELI5 what negative data means here? How does the network generate it?

2

Paluure t1_iz6zvqd wrote

Basically, for an unsupervised task, it's nonsense data that does not fall under any meaningful class in the training dataset. It can be anything. In the paper, they modify each MNIST image so that it isn't a digit anymore but looks like one. The network doesn't generate negative images, you do, and feed it as "bad data" right after you give it "good data" to create contrast between them for the model to learn.

For a supervised task, "bad data" can also be nonsense (just as in unsupervised task) or can be mislabeled data such as feeding an image of "5" but embedding "4" as the label inside the image. That's obviously wrong, and is considered bad data.

4

ObjectManagerManager t1_iz5xous wrote

(Confession: I haven't read the paper yet). I have a couple of questions:

  1. If each layer has its own objective function, couldn't you train layers back-to-front? e.g., train the first layer to convergence, then train the second layer, and so on. I doubt this would be faster than training it end-to-end, but a) as the early layers adapt, they screw up the representations being fed to the later layers anyways, so it probably wouldn't be too much slower than training it end-to-end, and b) it would use significantly less memory (e.g., if you pre-compute the inputs to a layer just before you begin training it, you could imagine training any arbitrarily deep model with a finite amount of memory).
  2. What's the motivation behind "goodness"? Suppose we're talking about classification. Why doesn't each layer just minimize cross entropy? I guess that'd require each layer to have its own flatten + linear projection layers. But then you wouldn't have to concatenate the label and the input data, and so inference complexity would be (mostly) independent of the number of classes. Thinking of a typical CNN, a layer could be organized:
    1. Batch norm
    2. Activation (e.g., ReLU)
    3. Convolution (the output of which is fed into the next layer)
    4. Pooling
    5. Flatten
    6. Linear projection
    7. Cross entropy loss

Can anyone (who has read the paper) answer these questions?

2

sytelus t1_iz8t24n wrote

Was anyone able to reproduce the results for forward forward algo?

2

kourouklides t1_j05bmni wrote

In my view, this sounds very boring. It would've been revolutionary if he came up with a new Gradiet-Free Deep Learning method in order to completely get rid of gradients. With very few exceptions, during the last 10 years or so, we keep seeing small and incremental changes in ML, but no breakthroughs.

2

Abhijithvega t1_j0esd7x wrote

Transformers? PINNs? Skip connections, adam, hell even RNNs happened less than 10 years ago.

2

kourouklides t1_j0jyi5c wrote

  1. A simple google search would've revealed to you the following: "The concept of RNN was brought up in 1986. And the famous LSTM architecture was invented in 1997." Hence, not even close.
  2. Didn't I specify that "With very few exceptions?" You merely mentioned those exceptions.
  3. Do you realize that in order to attempt to challenge someone's argument you need to specify two quantities in comparison? What specific decade are you comparing it with?
1

WashiBurr t1_iz34l3z wrote

Definitely interesting at the very least.

1

ClassicJewJokes t1_iz75pwk wrote

Capsule Nets 2: Electric Boogaloo. My man off da perc, I like it.

1

wilgamesh t1_izzoduc wrote

Hinton cites Francis Crick's "Function of Sleep" 1983 idea in his list of references.

Like the 2nd forward pass that reduces the fitness function of "negative data", Crick proposed REM sleep is "reverse learning" that removes "undesirable modes."

Quite elegant to see this implemented...

1

Sepic2 t1_j0fzir8 wrote

Maybe a dumb question but i don't see how this method enables learning in any way:

- The (first) forward part calculates loss/goodness, and then you need backpropogation to change weights of the network according to derivatives of the loss/goodness. How does the network learn if weights are not changed and you only calculate goodness?

The paper says: "The positive pass operates on real data and adjusts the weights to increase the goodness in every hidden layer. The negative pass operates on "negative data" and adjusts the weights to decrease the goodness in every hidden layer"

- Could it be that the in the first "forward", you actually do both forward and backward-prop, and the name just sounds fancy with the second "forward" trying to implement contrastive learning in a clever way?

1

kourouklides t1_j0jzimn wrote

Well, nobody really knows if this method actually works because Hinton reached to the part of writing the paper. He didn't reach to the part of actually coding the solution (yet).

1