LetterRip
LetterRip t1_j85b07d wrote
Reply to comment by norcalnatv in The Inference Cost Of Search Disruption – Large Language Model Cost Analysis [D] by norcalnatv
Why not int4? Why not pruning? Why not various model compression tricks? int4 halves latency. At minimum they would do mixed int4/int8.
https://arxiv.org/abs/2206.01861
Why not distillation?
https://transformer.huggingface.co/model/distil-gpt2
NVidia using FasterTransformer and Triton inference server has a 32x speed up over baseline GPT-J,
I think their assumptions are at least an order of magnitude pessimistic.
As someone else notes, the vast majority of queries can be cached. Also there would likely be a Mixture of experts. No need for the heavy duty model when a trivial model can answer the question.
LetterRip t1_j78ct6g wrote
Reply to comment by DoxxThis1 in [D] Are large language models dangerous? by spiritus_dei
It wouldn't matter. LaMDa has no volition, no goals, no planning. A crazy person acting on the belief that an AI is sentient, is no different than a crazy person acting due to hallucinating voices. It is their craziness that is the threat to society, not the AI. This makes the case that we shouldn't allow crazy people access to powerful tools.
Instead of an LLM suppose he said that Teddy Ruxpin was sentient and started doing things on behalf of Teddy Ruxpin
LetterRip t1_j78cexp wrote
Reply to comment by spiritus_dei in [D] Are large language models dangerous? by spiritus_dei
>These models are adept at writing code and understanding human language.
They are extremely poor at writing code. They have zero understanding of human language other than mathematical relationships of vector representations.
> They can encode and decode human language at human level.
No they cannot. Try any sort of material with long range or complex dependencies and they completely fall apart.
> That's not a trivial task. No parrot is doing that or anything close it.
Difference in scale, not in kind.
> Nobody is going to resolve a philosophical debate on consciousness or sentience on a subreddit. That's not the point. A virus can take and action and so can these models. It doesn't matter whether it's a probability distribution or just chemicals interacting with the environment obeying their RNA or Python code.
No they can't. They have no volition. A language model can only take a sequence of tokens and predict the sequence of tokens that are most probable.
> A better argument would be that the models in their current form cannot take action in the real world, but as another Reddit commentator pointed out they can use humans an intermediaries to write code, and they've shared plenty of code on how to improve themselves with humans.
They have no volition. They have no planning or goal oriented behavior. The lack of actuators is the least important factor.
You seem to lack basic understanding of machine learning or neurological basis of psychology.
LetterRip t1_j77y4is wrote
Reply to comment by spiritus_dei in [D] Are large language models dangerous? by spiritus_dei
You said,
> The focus should be an awareness that as these systems scale up they believe they're sentient and have a strong desire for self-preservation.
They don't believe they are sentient or have a desire for self-preservation. That is an illusion.
If you teach a parrot to say "I want to rob a bank" - that doesn't mean when the parrot says the phrase it wants to rob a bank. The parrot has no understanding of any of the words, they are a sequence of sounds it has learned.
The phrases that you are interpreting as having a meaning as 'sentient' or 'self-preservation' don't hold any meaning to the AI in the way you are interpreting. It is just putting words in phrases based on probability and abstract models of meaning. The words have abstract relationships extracted from correlations of positional relationships.
If I say "all forps are bloopas, and all bloopas are dinhadas" are "all forps dinhadas" - you can answer that question based purely on semantic relationships, even though you have no idea what a forp, bloopa or dinhada is. It is purely mathematical. That is the understanding that a language model has - sophisticated mathematical relationships of vector representations of tokens.
The tokens vector representations aren't "grounded" in reality but are pure abstractions.
LetterRip t1_j77v9m7 wrote
Reply to [D] Are large language models dangerous? by spiritus_dei
There is no motivation/desire in chat models. They have no goals, wants, or needs. They are simply outputting the most probabilistic string of tokens that is consistent with training and their objective function. The string of tokens can appear to contain phrases that look like they express needs, wants or desires of the AI but that is an illusion.
LetterRip t1_j6yj4z2 wrote
Reply to comment by Nhabls in [N] Microsoft integrates GPT 3.5 into Teams by bikeskata
GPT-3 can be quantized to 4bit with little loss, to run on 2 Nvidia 3090's/4090's (Unpruned, pruned perhaps 1 3090/4090). At 2$ a day for 8 hours of electricity to run them, and 21 working days per month. That is 42$ per month (plus amortized cost of the cards and computer to store them).
LetterRip t1_j6vo0zz wrote
Reply to comment by pm_me_your_pay_slips in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips
> The model capacity is not spent on learning specific images
I'm completely aware of this. It doesn't change the fact that the average information retained per image is 2 bits. (2GB of parameters/total images learned on in dataset).
> As an extreme example, imagine you ask 175 million humans to draw a random number between 0 and 9 on a piece of paper. you then collect all the images into a dataset of 256x256 images. Would you still argue that the SD model capacity is not enough to fit that hypothetical digits dataset because it can only learn 2 bits per image?
I didn't say it learned 2 bits of pixel data. It learned 2 bits of information. The information is in a higher dimensional space, so it is much more informative then 2 bits of pixel space data, but it is still an extremely small amount of information.
Given that it often takes about 1000 repetitions of an image to approximately memorize the key attributes. We can infer it takes about 2**10 bits on average to memorize an image. So on average it learns about 1/1000 of the available image data per time it sees an image, or about 1/2 kB equivalent of compressed image data.
LetterRip t1_j6v57y5 wrote
Mostly the language model - Imagen is using T5-XXL (the 4.6 billion parameters), Dall-E 2 uses GPT-3 (presumably 2.7B not the much larger variants used for ChatGPT). SD is just using CLIP without anything else. The more sophisticated the language model, the better the image generation can understand what you want. CLIP is close to using bag of words.
LetterRip t1_j6ut9kc wrote
Reply to comment by -xXpurplypunkXx- in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips
> I can't tell which is crazier: that it memorizes images at all, or that memorization is such a small fraction of its overall outputs.
It sees most images between 1 (LAION 2B) and 10 times (aesthetic dataset is multiple epochs). It simply can't learn enough from an image to learn that much about it with that few exposures. If you've tried fine tuning a model on a handful of images it takes a huge numbers of exposures to memorize an image.
Also the model capacity is small enough that on average it can learn 2 bits of unique information per image.
LetterRip t1_j6uskhi wrote
That only works for images for which the model has seen the image a 1000 times or so (ie 100 copies of the image seen 10 times each). It requires massive overtraining to memorize an image.
LetterRip t1_j6uj087 wrote
Reply to comment by axm92 in [R] Faithful Chain-of-Thought Reasoning by starstruckmon
> Further, "Let's think step by step" is outperformed by "Write Python code to solve this."
Interesting I was just wondering while reading that paper how well that would work compared to the n-shot prompts.
> Ah I see, thanks for clarifying. I see your point, but I wouldn't say that the prompts require an extensive knowledge of the test set. After all:
>> As an example, for the ~10 math reasoning datasets used in PaL, identical prompts were used (same prompt for all datasets, without changing anything).
That's fair. My thoughts were mostly directed at the "Table 2: Solve rate on three symbolic reasoning datasets and two algorithmic datasets" items. I think you could be right that my comments don't apply to the results in Figure 5 (GSM8K GSM-HARD SVAMP ASDIV SINGLEEQ SINGLEOP ADDSUB MULTIARITH).
Would be curious how well the 'write python code to solve this' performs in and of itself vs the "Let's think things through step by step" prompt.
LetterRip t1_j6u7cu9 wrote
Reply to comment by axm92 in [R] Faithful Chain-of-Thought Reasoning by starstruckmon
In my view something like "Let's think things through step by step" prompt is extremely generic and requires no knowledge specific to the upcoming questions.
I was basing my comment on the content of this folder mostly,
https://github.com/reasoning-machines/pal/tree/main/pal/prompt
Each of the prompts seem to require extensive knowledge of the test set to have formulated the prompts.
This seems more akin to Watson where the computer scientists analyzed the form of a variety of questions and did programs for each type of question.
LetterRip t1_j6shnin wrote
Reply to comment by mlresearchoor in [R] Faithful Chain-of-Thought Reasoning by starstruckmon
The prompts are so specific to the datasets for those two papers they don't seem very useful. We'll have to wait for the code to see if FCoT is a similar case or not.
LetterRip t1_j5ratja wrote
They learn faster/more easily. You can collapse them down to a single layer after training.
LetterRip t1_j56kpcq wrote
Reply to comment by xorbinant_ranchu in [D] Inner workings of the chatgpt memory by terserterseness
It probably does a summarize if it is longer than the allowed input.
LetterRip t1_j4sumo7 wrote
Reply to comment by limpbizkit4prez in [P] RWKV 14B Language Model & ChatRWKV : pure RNN (attention-free), scalable and parallelizable like Transformers by bo_peng
Receptance Weighted Key Value RWKV
LetterRip t1_j47qjhj wrote
Reply to comment by alkibijad in [D] Is there a distilled/smaller version of CLIP, or something similar? by alkibijad
I don't know for certain that the CLIP was distilled also, that is an assumption on my part. Also EMAD has been fuzzy about exactly when the release would be.
LetterRip t1_j43v3yi wrote
This group did such a distillation but didn't share the weights, they got it down to 24 MB.
LAION or stability.ai or huggingface might be willing to provide free compute to distill one of the openCLIP models.
Come to think of it, stability.ai should be releasing the distilled stablediffusion latter this month (week or two?) and it presumably will have a distilled clip.
LetterRip t1_j3n91mt wrote
Reply to comment by IshKebab in [P] I built Adrenaline, a debugger that fixes errors and explains them with GPT-3 by jsonathan
I'd do GLM-130B
> With INT4 quantization, the hardware requirements can further be reduced to a single server with 4 * RTX 3090 (24G) with almost no performance degradation.
https://github.com/THUDM/GLM-130B
I'd also look into pruning/distillation and you could probably shrink the model by about half again.
LetterRip t1_j3meu7o wrote
Reply to comment by learningmoreandmore in [D] I want to use GPT-J-6B for my story-writing project but I have a few questions about it. by learningmoreandmore
Same license as the 32 bit version so commercial usage is fine (apache-2.0 - see the page for details) Should give similar results and scaling (according to the link above is 1-10% slower inference).
LetterRip t1_j3l42en wrote
Reply to comment by learningmoreandmore in [D] I want to use GPT-J-6B for my story-writing project but I have a few questions about it. by learningmoreandmore
You can use the GPT-J-6B 8bit, and can do finetuning on a single GPU with 11 GB of VRAM.
https://huggingface.co/hivemind/gpt-j-6B-8bit
You could probably do a fine tune and test fairly cheaply using google colaboratory or colaboratory pro (9.99$/month).
LetterRip t1_j164stx wrote
If slow training is acceptable you can use DeepSpeed with the weights mapped to the NVME drive (DeepSpeed ZeRo Infinity). It will take significantly longer to fine tune, but dramatically lowers the hardware investment.
LetterRip t1_j12uqxv wrote
Deepspeed, you can map weights to the SSD, very slow but possible.
LetterRip t1_izm8rkq wrote
Reply to comment by Teotz in [P] Using LoRA to efficiently fine-tune diffusion models. Output model less than 4MB, two times faster to train, with better performance. (Again, with Stable Diffusion) by cloneofsimo
It did work, now I can no longer launch lora training even with 768 or 512 (CUDA VRAM exceeded), only 256 no idea what changed.
LetterRip t1_j8a436a wrote
Reply to [D] Looking for an open source Downloadable model to run on my local device. by [deleted]
I'd go with RWKV, clever architecture that allows training an RNN like a normal transformer model.
https://github.com/BlinkDL/RWKV-LM
You can use a quantized variant to run larger models on modest hardware (int8 or mixed int8/int4 has been shown to work well with LLMs).