gamerx88
gamerx88 t1_jdmrlhh wrote
Reply to comment by wojapa in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700
No, check their git repo. They used HF transformer's AutoFromCausalLM in their training script. It's supervised fine-tuning.
gamerx88 t1_jdmr4n2 wrote
Reply to comment by wojtek15 in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700
My observations are similar to yours, but I think Stanford's claim was that it rivalled text-davinci-003's dialogue or chat capabilities, and only in a single turn setting.
gamerx88 t1_jdmql8y wrote
Answer is probably not. DeepMind's Chinchilla paper shows that many of those 100B+ LLMs are oversized for the amount of data used to pre-train them.
gamerx88 t1_jdmpdtf wrote
Reply to comment by ginger_beer_m in [R] Hello Dolly: Democratizing the magic of ChatGPT with open models by austintackaberry
Not if they adopt the technology
gamerx88 t1_jdmndip wrote
Food for thought. Is this really surprising considering that the InstructGPT paper in early 2022, already showed how even a 1.3B model after RLHF could beat a much larger 175B model?
I guess what this shows is that it's the data that matters rather than SFT vs RLHF. Wondering if any ablation studies have been done here.
gamerx88 t1_jcx0t9r wrote
Reply to comment by fullstackai in [D] Unit and Integration Testing for ML Pipelines by Fender6969
Ah, that makes sense.
gamerx88 t1_jctqruk wrote
For ETL, write unit tests to handle some input edge cases. E.g Null values, mis-formatting, values out of range as well as some simple working cases.
For model training, the test focus is on having "valid" hyperparams and configurations. I write test cases to try to overfit on a small training set. i.e Confirm the model learns. There are also some robustness tests that I sometimes run post training, but those are very specific to certain NLP tasks, applications.
For model serving, successful parsing of the request and subsequent feature transformation (if any), very similar to ETL.
gamerx88 t1_jctp6px wrote
Reply to comment by fullstackai in [D] Unit and Integration Testing for ML Pipelines by Fender6969
Is there a reason why you feel there is need for such rigour? 100% is quite an overkill even for the typical software projects IMO.
You probably end up having to write tests for even simple one liner functions which gets exhausting.
gamerx88 t1_j9evm62 wrote
Reply to [D] Things you wish you knew before you started training on the cloud? by I_will_delete_myself
How do you utilize a spot instance for training? How do you automatically resume training from a checkpoint? Or are you referring to something like Sagemaker's managed spot training?
gamerx88 t1_j7smwbb wrote
Reply to [Discussion] Is ChatGPT and/or OpenAI really the leader in the space? by wonderingandthinking
Leader in what space and what sense? Fundamental research? Innovation? Marketshare for LLM? Hype?
gamerx88 t1_j70rs5v wrote
Reply to [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta
Without referring to the paper again, my intuition is that a pairwise loss over final outputs does not gel well with how the model is auto-regressively generating the text.
Generation with GPT is basically a token by token decoding process with the previous time steps taken into account. Think about the difference between a supervised learning problem vs reinforcement learning. The former ignores the step-by-step nature of the generation scheme, and is a poorer fit for a decoding problem.
gamerx88 t1_j6cqerx wrote
It's not about large data or number of parameters. OpenAI has not actually revealed details regarding ChatGPT's architecture and training. What is special is the fine-tuning procedure -- alignment through RLHF on the underlying LLM (nicknamed GPT3.5) that is extremely good at giving "useful" responses to prompts\instructions.
Prior to this innovation, zero-shot and in-context few-shot learning with LLM was hardly working. Users had to trial and error their way to some obtuse prompt to get the LLM to generate some sensible response to their prompt, if it even worked at all. This is because LLM pre-training is purely about language structure without accounting for intent (what the human wishes to obtain via the prompt). Supervised fine-tuning based on instructions and output pairs helped but not by much. With RLHF however, the process is so effective that a mere 6B parameter model (fine-tuned with RLHF) is able to surpass a 175B parameter model. Check out the InstructGPT paper for details.
gamerx88 t1_j3qft42 wrote
Reply to comment by rodeowrong in [R] Diffusion language models by benanne
What do you mean Transformers took over? In what area or sense? You mean took over in popularity?
gamerx88 t1_j3m0drc wrote
Yes, we used DistilBERT (and even logistic regression) heavily in my previous startup where data volume was web scale.
Depending on the exact problem, large transformer models can be an overkill. For some straightforward text classification even logistic regression with some feature engineering can hit within 3% point of a transformer, and costs a negligible fraction of them.
gamerx88 t1_j3fx20a wrote
I am very impressed by the underlying GPT3.5 LLM and the capabilities that alignment via RLHF has unlocked in LLM, but I don't believe any serious NLP researchers or practitioners think that NLP is solved.
There are still tonnes of challenges and limitations that needs to be solved before this tech is ready. E.g The very convincing hallucinations, failure on simple math problems, and second order reasoning tasks amongst others. And many other areas that remains unresolved in NLP as well.
Having been in the NLP field for close to 10 years and having experienced several other developments and paradigm shifts in the past (RNN/LSTM, Attention, Transformer Models, LLMs with emergent capabilities) , I am more optimistic than fearful of this development's impact on our job.
Each of these past developments made obsolete certain expertise, but also expanded the problem space that NLP can tackle. The net effect however has been consistently positive with the amount of money and demand for NLP expertise increasing.
gamerx88 t1_j2vzjfx wrote
"An empirical analysis of compute-optimal large language model training" by Deepmind, suggesting that LLMs are over-parameterized or under-trained (insufficient data used in training).
gamerx88 t1_j239l9d wrote
Reply to ML Impacts [D] by evomed
Technological improvements and economic restructuring taking away jobs is nothing new and is not AI specific. Such creative destruction ultimately leads to a productive economy and better standards of living for all.
I do recognize however, that these net benefit is not equally distributed throughout society. Those who bear the brunt of the cost (unemployment) may not even get a shred of the payoff from improved productivity. Secondly, I do think that the potential scale of disruption from AI may be far greater than other occasions in history, and there may be extremely short term suffering on an unprecedented scale in the short term.
Hence, I do think that policymakers should seriously consider the ideas of universal basic income and enhanced social safety nets when the time comes.
gamerx88 t1_iwp3jh9 wrote
Reply to comment by tis_avionics in [D] AMA: Team OPUS.AI by tis_avionics
Came across that. Not sure if it's a problem on my side, but the audio (is there?) does not work.
gamerx88 t1_iwkogi7 wrote
Work in NLP. Ever since HuggingFace became mainstream we almost never had to do this.
We used to have to implement the cutting edge stuff ourselves because papers do not come with code, or give code that requires huge amount of work to run. Now they often appear on HF within a few weeks of publication.
The only occasion in the last 2 or 3 years where I wrote a DNN from scratch was when I had to give a short lecture, for pedagogical reasons.
gamerx88 t1_iwko24c wrote
Reply to [D] AMA: Team OPUS.AI by tis_avionics
Your website is very sparse on details. Give us a 5 minute elevator pitch.
gamerx88 t1_jdn1dd3 wrote
Reply to comment by currentscurrents in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700
> In the long run I expect this will flip; computers will get very fast and data will be the limiting factor.
I agree but I think data is already a limiting factor today, with the largest (that is public knowledge) models at 175B. The data used to train these models supposedly already cover a majority of the open internet.