soggy_mattress t1_jdl4zkg wrote on March 25, 2023 at 4:46 AM

I think of the 100b parameter models as analogous to the first room-sized computers that were built in the 70s. Seems the pattern is to first prove a concept, no matter how inefficiently, and then optimize it as much as possible.

Vegetable-Skill-9700 OP t1_jdl680d wrote on March 25, 2023 at 5:00 AM

That's an interesting analogy!

Short_Change t1_jdlp0cw wrote on March 25, 2023 at 9:20 AM

Actually if his analogy is true, we will have 20 trillion parameters in the future for modern consumption.

Crystal-Ammunition t1_jdlpzsw wrote on March 25, 2023 at 9:34 AM

At that point, the training data world have to almost completely be synthetic, right?

EmmyNoetherRing t1_jdma3em wrote on March 25, 2023 at 1:27 PM

Introspection? Cog-sci/classical AI like to use the term, not always in the best justified fashion I think. But when you’re hallucinating your own new training data it seems relevant.

currentscurrents t1_jdmyjrb wrote on March 25, 2023 at 4:31 PM

Bigger models are more sample efficient, so it should need less data.

But - didn't the Chinchilla paper say bigger models need more data? Yes, but that's only true because right now compute is the limiting factor. They're intentionally trading off more data for less model size.

As computers get faster and models bigger, data will increasingly become the limiting factor, and people will trade off in the opposite direction instead.

itshouldjustglide t1_jdoazux wrote on March 25, 2023 at 10:22 PM

Don't bigger models need more data so that all of the neurons can be trained so as to reduce unnecessary noise and randomness?

ganzzahl t1_jdovu3h wrote on March 26, 2023 at 1:00 AM

I'm also very interested in this – does anyone have papers similar to Chinchilla, but without the training FLOPs restriction, and instead comparing identical dataset sizes?

An aside: I feel like I remember some older MT papers where LSTMs outperformed Transformers for some low resource languages, but I think that's outdated – using transfer learning, multilingual models and synthetic data, I'm fairly certain Transformers always outperform nowadays.

PilotThen t1_jdpnoul wrote on March 26, 2023 at 5:05 AM

I didn't find a paper but I think that is sort of what EleutherAI was doing with their pythia models.

You'll find the models on huggingface and I'd say that they are also interesting from an opensource perspective because of their license (apache-2.0)

(Also open-assistent seems to be building on top of them.)

AllowFreeSpeech t1_je3rjmv wrote on March 29, 2023 at 5:00 AM

20:1 ratio of tokens:params

[deleted] t1_jdlwka7 wrote on March 25, 2023 at 11:07 AM

[removed]

I_will_delete_myself t1_jdnrr46 wrote on March 25, 2023 at 7:58 PM

At that point we will run out of data. It will require more data efficient methods.

hadaev t1_jdlym7s wrote on March 25, 2023 at 11:32 AM

Idk, internet is big.

CacheMeUp t1_jdxvq8t wrote on March 28, 2023 at 12:04 AM

Perhaps the challenge is not the size of the internet (it's indeed big and easy to generate new content), but rather the uniqueness and novelty of the information. Anecdotally, looking at the first page of Google results often shows various low-informativeness webpages, where only a few sentences provide information and the rest is boilerplate, disclaimers, generic advice or plain spam.

phb07jm t1_jdmp7kc wrote on March 25, 2023 at 3:24 PM

I think this will prove prophetic

Thebadwolf47 t1_jdnbfya wrote on March 25, 2023 at 6:01 PM

wasn't he rather comparing the parameters to the volume of the first computer and not their transistor count?

[deleted] t1_jdmcma9 wrote on March 25, 2023 at 1:48 PM

[deleted]

Puzzleheaded_Acadia1 t1_jdn6ugl wrote on March 25, 2023 at 5:29 PM

I see a future where LLMs or llamas that are multimodels or any other new kind artificial intelligence run on esp32 level of hardware i don't know how that will work but I'm pretty sure we are heading there

atheist-projector t1_jdm7hw1 wrote on March 25, 2023 at 1:04 PM

Especialy when considr that sgd is a local minima we can probably do a whole lot better if we find a niced optimizer.

Blacky372 t1_jdl62vl wrote on March 25, 2023 at 4:58 AM

GPT-J-6B with instruction finetuning will surely not ever be better than GPT-4. With RLHF you may reach a similar response quality in some contexts for some types of instruction, but you will never match the vast amounts of proprietary data that ClosedAI fed into a probably 250+B parameter model with specialized expert data from literally 50 experts in various fields that worked on the response quality in their domain. This cannot be surpassed easily, unfortunately. But maybe future open source models will be of similar capabilities with advanced training techniques. I would definitely hope so.

blueSGL t1_jdl756z wrote on March 25, 2023 at 5:10 AM

> with specialized expert data from literally 50 experts in various fields that worked on the response quality in their domain.

Sounds like a future goal for Open Assistant.

If one were being unethical... create a bot to post the current Open Assistant answers to technical questions in small specialist subreddits and wait for Cunningham's_Law to come into effect. (I'm only half joking)

atheist-projector t1_jdm7mmi wrote on March 25, 2023 at 1:05 PM

I love the odea of calling them closedai.

Thats it j am doing it from now on

WonderFactory t1_jdm4pk1 wrote on March 25, 2023 at 12:37 PM

How long though before LLMs perform at the same level as experts in a most fields? A year, two, three? When you get to that point you can generate synthetic data that's the same quality as human produced data. The Reflexion paper mentioned in another thread claims that giving GPT 4 the ability to test the output of its code produces expert level coding performance. This output could be used to train an open source model.

blose1 t1_jdoj8kl wrote on March 25, 2023 at 11:25 PM

GPT models struggle with out of distribution programming tasks, which means it can't create novel ideas, I tested this myself many times and it's not a prompt engineering issue. I think LLMs could act as great teachers but not researchers, teachers just teach what we already know, researchers create novel knowledge that teachers use.

Vegetable-Skill-9700 OP t1_jdl8hh5 wrote on March 25, 2023 at 5:26 AM

Agreed, it won't generalize as well as GPT-4, but it could achieve similar performance for a specialized task (say answering technical questions around a certain topic or writing social media posts for a certain entity, etc.).

zbyte64 t1_jdmvaak wrote on March 25, 2023 at 4:07 PM

Sounds like we're realizing that a model is only as good as the experts that wrote the training data.

shanereid1 t1_jdlt38a wrote on March 25, 2023 at 10:19 AM

Have you read about the lotto ticket hypothesis? It was a paper from a few years ago that showed that within a fully connected neural network there exists a smaller sub network that can perform equally as well, even when the subnetwork is as low as a few % of the size of the original network. AFAIK they only proved this for MLP and CNNs. Its almost certain that the power of these LLMs can be distilled in some fashion without significantly degrading performance.

tdgros t1_jdlxy8a wrote on March 25, 2023 at 11:24 AM

There are versions for NLP (and a special one for vision transformers), here is the BERT one from some of the same authors (Frankle & Carbin) https://proceedings.neurips.cc/paper/2020/file/b6af2c9703f203a2794be03d443af2e3-Paper.pdf

It is still costly, as it includes rewinding and finding masks, we probably need to switch to dedicated sparse computations to fully benefit from it.

wrossmorrow t1_jdmsbvf wrote on March 25, 2023 at 3:46 PM

Probably related https://arxiv.org/abs/2106.09685

fiftyfourseventeen t1_jdngwum wrote on March 25, 2023 at 6:40 PM

Eh.... Not really, that's training a low rank representation of the model, not actually making it smaller.

Wilfred86 t1_jdot7gb wrote on March 26, 2023 at 12:39 AM

Is this like pruning in the brain?

andreichiffa t1_jdvojfg wrote on March 27, 2023 at 3:20 PM

It's a common result from the domain of flat minima - to train the model needs to be overparametrized to avoid getting stuck in local minima and to smooth the loss landscape.

However the overparameterization at the training stage can be trimmed at the inference stage.

ttkciar t1_jdl8i7w wrote on March 25, 2023 at 5:26 AM

LLaMa-7B output is abysmally horrible. We might need less than 100B, but not too much less.

wojtek15 t1_jdlpai0 wrote on March 25, 2023 at 9:24 AM

Exactly, I have seen many inaccurate claims, e.g. LLaMa-7B with Alpaca being as capable as ChatGPT. From my testing even much bigger LLaMa-30B with Alpaca is far worse than ChatGPT, can't even get simplest programming and common knowledge tasks right, and GPT3 ChatGPT get them right without any problems every time. I have not tried LLaMa-65B with Alpaca yet, because it has not being trained yet AFAIK, but I doubt it will be very different. GPT3 ChatGPT is 175B, maybe some 100B model can match it, but not 6B or 7B model, if someone claim this, he clearly don't know what he is talking about.

Disastrous_Elk_6375 t1_jdm4h39 wrote on March 25, 2023 at 12:35 PM

> I have seen many inaccurate claims, e.g. LLaMa-7B with Alpaca being as capable as ChatGPT

I believe you might have misunderstood the claims in Alpaca. They never stated it is as capable as ChatGPT, they found (and you can confirm this yourself) that it accurately replicates the instruction tuning. That is, for most of the areas in the fine-tuning set, a smaller model will output in the same style of davinci. And that's an amazing progress from the raw outputs of the raw models.

farmingvillein t1_jdnuvnf wrote on March 25, 2023 at 8:21 PM

> I believe you might have misunderstood the claims in Alpaca. They never stated it is as capable as ChatGPT, they found (and you can confirm this yourself) that it accurately replicates the instruction tuning. That is, for most of the areas in the fine-tuning set, a smaller model will output in the same style of davinci.

This is a misleading summary of the paper.

They instruction tune and then compare Alpaca versus GPT-3.5, and say that Alpaca is about equal on the tasks it compares (which, to be clear, is not equivalent to a test of "broad capability").

Yes, you are right that they don't make a statement that it is categorically more capable than ChatGPT, but they do state that their model is approximately as capable as GPT3.5 (which is of course not a 1:1 to chatgpt), on the diverse set of tasks tested.

It is very much not just a paper showing that you can make it output in the same "style".

Yardanico t1_jdls342 wrote on March 25, 2023 at 10:05 AM

Yeah, I think there's a lot of overhyping going around "running ChatGPT-grade language models on consumer hardware". They can "follow" instructions they same way as ChatGPT, but obviously those models know far, far less than the ClosedAI models do, and of course they'll hallucinate much more.

Although it's not an entirely bad thing, at least the community will innovate more so we might get something interesting in the future from this "push" :)

fiftyfourseventeen t1_jdnhbn0 wrote on March 25, 2023 at 6:43 PM

OpenAI is also doing a lot of tricks behind the scenes, so it's not really fair to just type two things into both, because they are getting nowhere near the same prompt. Llama is promising but it just needs to be properly instruction tuned

devl82 t1_jdmf6b3 wrote on March 25, 2023 at 2:09 PM

no they is no overhype, you just don't understand what Alpaca is trying to do & I am sure others will also reply similar

gamerx88 t1_jdmr4n2 wrote on March 25, 2023 at 3:38 PM

My observations are similar to yours, but I think Stanford's claim was that it rivalled text-davinci-003's dialogue or chat capabilities, and only in a single turn setting.

Sorry-Balance2049 t1_jdl7yn7 wrote on March 25, 2023 at 5:20 AM

The databrick's blog post doesn't really show much eval on the model, only choice examples. It's more of a "hey we did this!" blog post.

Vegetable-Skill-9700 OP t1_jdl8onp wrote on March 25, 2023 at 5:28 AM

Agreed! I don't expect it to be as good as GPT-4 on all tasks, but maybe fine-tuning for specific tasks can help it achieve similar performance on test samples related to that task. wdyt?

farmingvillein t1_jdntw7b wrote on March 25, 2023 at 8:14 PM

pure marketing.

not even weights...due to the ToS issues with the fine-tune set, presumably.

austintackaberry t1_jdrau92 wrote on March 26, 2023 at 4:02 PM

Yes, that's correct.

>[@matei_zaharia] The code is at https://github.com/databrickslabs/dolly. You can also contact us for weights, just want to make sure people understand the restrictions on the fine tuning data (or you can get that data from Stanford and train it yourself).

https://twitter.com/matei_zaharia/status/1639357850807054336?s=20

Zealousideal_Low1287 t1_jdlm2c0 wrote on March 25, 2023 at 8:36 AM

It seems that contrary to conventional wisdom, models with more parameters learn more efficiently. My personal ‘hunch’ is that training large models and then some form of distillation may become the standard thing to do.

londons_explorer t1_jdn0t7k wrote on March 25, 2023 at 4:47 PM

Paper after paper has shown that bigger model outperforms smaller model.

Sure, you can use tricks to make a small model work better. But apply those same tricks to a big model, and it works even better.

farmingvillein t1_jdnwda6 wrote on March 25, 2023 at 8:32 PM

> But apply those same tricks to a big model, and it works even better.

In general, yes, although there are many techniques that help small models that do not help large ones.

That said, agree with your overall point. I think the only reason we won't see model sizes continue to inflate is if 1) there are substantial underlying architecture discoveries (possible!) or 2) we really hit problems with data availability. But synthetic + multi-modal probably gives us a ways to go there.

londons_explorer t1_jdo4kj3 wrote on March 25, 2023 at 9:33 PM

Think how many hard drives there are in the world...

All of that data is potential training material.

I think a lot of companies/individuals might give up 'private' data in bulk for ML training if they get a viable benefit from it (for example, having a version of ChatGPT with perfect knowledge of all my friends and neighbours, what they like and do, etc. would be handy)

drinkingsomuchcoffee t1_jdnhxri wrote on March 25, 2023 at 6:47 PM

Huge models are incredibly wasteful and unoptimized. Someday, someone is going to sit down and create an adaptive algorithm that expands or contracts a model during the training phase and we're going to laugh at how stupid we were.

MrFlamingQueen t1_jdnmkby wrote on March 25, 2023 at 7:21 PM

🤫🤫 Shhhhh, this is my research area.

YoloSwaggedBased t1_jdp9cge wrote on March 26, 2023 at 2:46 AM

I can't find it now, but I've read a paper that essentially proposed this, at least for inferencing. You essentially have a model output and task loss after every n layers of the model. At training time, you produce outputs up to the end of the architecture and then at inference time utilise some heuristic to measure how much accuracy loss you're willing to sacrifice for layer wise model reduction.

drinkingsomuchcoffee t1_jdpg1cb wrote on March 26, 2023 at 3:46 AM

The problem is learned features aren't factored nicely into a minimal set of parameters. For example, identifying if an image is a cat may be 1000s of parameters over n layers, where it may actually be expressed as 10 parameters over fewer layers. A small model does this automatically, as it's obviously physically constrained. A large model has no such constraint, so it is wasteful. There's probably many solutions to get the best of both worlds at training time, but it's by no means an easy problem. And the current distillation methods or retraining feel clunky. We actually want the big model to use all its parameters efficiently and not waste them, which it's likely doing if much more compact models can get similar results. It's probably extremely wasteful if it requires an order of magnitude in size to get a few percentage points improvement. Compare that to biological entities where an order of magnitude size increase results in huge cognitive improvements.

_Repeats_ t1_jdm3h7a wrote on March 25, 2023 at 12:25 PM

For enterprise use cases, you might need only a small model in the 1-3 billion range that answers specific queries. For general knowledge, it remains to be seen how big or small you can retrain them.

badabummbadabing t1_jdm1poy wrote on March 25, 2023 at 12:06 PM

Well, if you apply all of those tricks that these smaller models perform (to get decent performance) AND increase the parameter count, can you get an even better model? Who knows, "Open"AI might already apply these.

The question is not: "Do fewer than 100B parameters suffice to get a model that performs 'reasonably' for a March 2023 observer?"

Chinchilla scaling rules tell us some upper bounds to the number of parameters that we can expect to still yield an improvement given the amount of available training data (PaLM is too big for instance), but even that only tells us half of the story: How good can our models get, if we make do with sub-optimal training efficiency (see LLaMA)? What is the influence of data quality/type? What if we train (gasp) multiple epochs with the same training set?

Impressive-Ad6400 t1_jdnjakm wrote on March 25, 2023 at 6:57 PM

Uhm, this is probably incorrect as an analogy, but, do we humans actually need those 75 billion neurons on our brains?

I mean, there are lots of people who have lost a brain hemisphere for different reasons, and yet, they live happy lives.

However, what they lose is flexibility. This means they have a hard time when faced to new situations and have difficulties adapting to them.

I can't be certain, but it's possible that the number of parameters in large language models can account for their flexibility. That is why you can throw anything to chatGPT and it will answer, within the scope given by its restrictions.

I'm not sure either if enlarging the number of parameters will give us emergent properties or if it will only slow down data processing. Blue whales have immense brains, but they aren't necessarily smarter than us. And this is because a larger brain means larger distances for neurons to connect, slower response times and increased energetic expenditure.

I could be wrong, though. Electronic brains don't have the same limitations of physical brains, so maybe increasing their size won't affect their output.

EmbarrassedHelp t1_jdpbi45 wrote on March 26, 2023 at 3:05 AM

Human brains contain a lot of neurons for life support, connective wiring, maintenance, and other stuff that a digital brain wouldn't require. Human brains have also been structurally optimized by evolution, and are distilled via synaptic pruning.

wojapa t1_jdl23pj wrote on March 25, 2023 at 4:17 AM

Did they use RLHF?

Vegetable-Skill-9700 OP t1_jdl2fbp wrote on March 25, 2023 at 4:20 AM

I think it's just supervised training. Similar to alpaca, I guess

[deleted] t1_jdltev5 wrote on March 25, 2023 at 10:24 AM

[removed]

A1-Delta t1_jdl325g wrote on March 25, 2023 at 4:26 AM

GPT-J-6B fine tuned on Alpaca’s instruction dataset.

gamerx88 t1_jdmrlhh wrote on March 25, 2023 at 3:41 PM

No, check their git repo. They used HF transformer's AutoFromCausalLM in their training script. It's supervised fine-tuning.

gamerx88 t1_jdmql8y wrote on March 25, 2023 at 3:34 PM

Answer is probably not. DeepMind's Chinchilla paper shows that many of those 100B+ LLMs are oversized for the amount of data used to pre-train them.

currentscurrents t1_jdmzphs wrote on March 25, 2023 at 4:39 PM

That's true, but only for the given compute budget used in training.

Right now we're really limited by compute power, while training data is cheap. Chinchilla and LLaMA are intentionally trading more data for less compute. Larger models still perform better than smaller ones given the same amount of data.

In the long run I expect this will flip; computers will get very fast and data will be the limiting factor.

gamerx88 t1_jdn1dd3 wrote on March 25, 2023 at 4:51 PM

> In the long run I expect this will flip; computers will get very fast and data will be the limiting factor.

I agree but I think data is already a limiting factor today, with the largest (that is public knowledge) models at 175B. The data used to train these models supposedly already cover a majority of the open internet.

[deleted] t1_jdnw01y wrote on March 25, 2023 at 8:29 PM

[deleted]

PilotThen t1_jdppmpl wrote on March 26, 2023 at 5:27 AM

There's also the point that they optimise for computer power at training time.

In mass deployment computer power at inference time starts to matter.

light24bulbs t1_jdoxig4 wrote on March 26, 2023 at 1:13 AM

Your brain has 86 billion neurons. They're very expensive to your body to run.

You need them to be as smart as you are.

Edit: nevermind, made a false equivalence. This blog is a good explanation of how many "parameters" our brain uses for language. https://www.beren.io/2022-08-06-The-scale-of-the-brain-vs-machine-learning/

frequenttimetraveler t1_jds91rc wrote on March 26, 2023 at 8:04 PM

Number of synapses is more akin to parameters

light24bulbs t1_jdsapie wrote on March 26, 2023 at 8:15 PM

My mistake, correct you are:

https://www.beren.io/2022-08-06-The-scale-of-the-brain-vs-machine-learning/

This post recons we using 800b parameters for language processing.

LeN3rd t1_jdls5jy wrote on March 25, 2023 at 10:06 AM

How big do models need to be until certain capabilities emerge? That is the actual question here, isn't it? Do smaller models perform as well in all tasks, or just the one they are trained for?

harharveryfunny t1_jdm3bm4 wrote on March 25, 2023 at 12:23 PM

It seems most current models don't need the number of parameters that they have. DeepMind did a study on model size vs number of training tokens and concluded that for each doubling of number of parameters the number of training tokens also needs to double, and that a model like GPT-3, trained on 300B tokens would really need to be trained on 3.7T tokens (a 10x increase) to take advantage of it's size.

To prove their scaling law, DeepMind built the 70B params Chinchilla model, and trained it on the predicted optimal 1.4T (!) tokens, and found it to outperform GPT-3.

https://arxiv.org/abs/2203.15556

alrunan t1_jdmbv4k wrote on March 25, 2023 at 1:42 PM

The chinchilla scaling laws is just used to calculate the optimal scale for dataset and model size for a particular training budget.

You should read the LLaMA paper.

harharveryfunny t1_jdmd38s wrote on March 25, 2023 at 1:52 PM

>You should read the LLaMA paper.

OK - will do. What specifically did you find interesting (related to scaling or not) ?

alrunan t1_jdmm3lw wrote on March 25, 2023 at 3:02 PM

The 7B model is trained on 1T tokens and performs really well for its number of parameters.

[deleted] t1_jdnmu4i wrote on March 25, 2023 at 7:22 PM

[deleted]

noobgolang t1_jdm7pvm wrote on March 25, 2023 at 1:06 PM

Big is not always better (˵ ͡° ͜ʖ ͡°˵)

minhrongcon2000 t1_jdr6xtv wrote on March 26, 2023 at 3:34 PM

Right now yes! Most of the papers published recently (like Chinchilla, GPT, etc.) show a scaling law on the number of data wrt the number of params in a model. If you want a no-brain training with little preprocessing, bigger models are mostly better. However, if you have sufficient data, then the number of params needed may be mitigated. However, I feel like the number of parameters decreases really slow when the data size grows. So yeah, we still somehow need larger model (of course, this also depends on the scenario where you apply LLM, for example, you don't really need that big of a model for an ecom app)

SpiritualTwo5256 t1_jdlwq53 wrote on March 25, 2023 at 11:09 AM

How many specialized neurons are there in a human brain?

LahmacunBear t1_jdo7k0w wrote on March 25, 2023 at 9:55 PM

Here’s a thought — 175B in GPT3 original, the best stuff thrown at it, performs as it did. ChatGPT training tricks, suddenly same size performs magnitudes better. I doubt that currently the LLMs are fully efficient, i.e. just as with GPT3 to 3.5, with the same size we can continue to get much better results, and therefore current results with much smaller models.

PilotThen t1_jdpn8eb wrote on March 26, 2023 at 5:00 AM

I'm down the rabbit hole of finding the best model to build on and learn with this weekend.

Currently poking at PygmalionAI/pygmalion-1.3b

Beware: The different size pygmalion model are finetuned from different pretrained models, so have inherited different licenses.

I like my results with 6b better but 1.3b has the better license (apgl-3.0)

Poseidon_22 t1_jdpyo9u wrote on March 26, 2023 at 7:29 AM

Apparently, for linear improvement in accuracy, we would need exponentially more parameters. Gpt-4 with more than 1 trillion parameters would need to be trained on 6,700gpus for a whole year!

frequenttimetraveler t1_jds97ia wrote on March 26, 2023 at 8:05 PM

Why don't we ask gpt4 to optimize itself

jabowery t1_jdm16ig wrote on March 25, 2023 at 12:01 PM

Algorithmic information theory: Smallest model that memorizes all the data is optimal. "Large" is only there because of the need to expand in order to compress. Think decompress gz in order to compress with bz2. Countering over-fitting with over-informing (bigger data) yields interpolation, sacrificing extrapolation.

If you understand all of the above you'll be light years beyond the current ML industry including the political/religious bias of "algorithmic bias experts".

Cherubin0 t1_jdmt5el wrote on March 25, 2023 at 3:52 PM

I think have the particular knowledge inside the model is a bad approach. I think it would make much more sense that the model knows how to search and reason about the found data.

Comments