Simusid OP t1_jbsyp5n wrote on March 11, 2023 at 1:54 PM

#2,208,229

Yesterday I set up a paid account at OpenAI. I have been using the free sentence-transformers library and models for many months with good results. I compared the performance of the two by encoding 20K vectors from this repo https://github.com/mhjabreel/CharCnn_Keras. I did no preprocessing or cleanup of the input text. The OpenAI model is text-embedding-ada-002 and the SentenceTransformer model is all-mpnet-base-v2. The plots are simple UMAP(), with all defaults.I also built a very generic model with 3 dense layers, nothing fancy. I ran each model ten times for the two embeddings, fitting with EarlyStopping, and evaluating with hold out data. The average results were HF 89% and OpenAI 91.1%. This is not rigorous or conclusive, but for my purposes I'm happy sticking with SentenceTransformers. If I need to chase decimal points of performance, I will use OpenAi.

Edit - The second graph should be titled "SentenceTransformer" not HuggingFace.

imaginethezmell t1_jbszsey wrote on March 11, 2023 at 2:03 PM

#2,208,277

Replying to Simusid (#2,208,229)

openai is 8k

how about sentence transformer

Simusid OP t1_jbt13iy wrote on March 11, 2023 at 2:14 PM

#2,208,336

Replying to imaginethezmell (#2,208,277)

8K? I'm not sure what you're referring to.

krishnakumar3096 t1_jbt3nk4 wrote on March 11, 2023 at 2:35 PM

#2,208,443

Why not try Davinci instead?? Why is it on ada??

Simusid OP t1_jbt4y5s wrote on March 11, 2023 at 2:45 PM

#2,208,485

Replying to krishnakumar3096 (#2,208,443)

I was lazy and used the model they show in their code example found here https://platform.openai.com/docs/guides/embeddings/what-are-embeddings.

Also on that page, they show that Ada outperform Davinci (BEIR score) and is cheaper to use.

jobeta t1_jbt54u8 wrote on March 11, 2023 at 2:46 PM

#2,208,493

What are we looking at though? t-SNE?

VarietyElderberry t1_jbt5zkd wrote on March 11, 2023 at 2:53 PM

#2,208,527

Replying to Simusid (#2,208,336)

I'm assuming u/imaginethezmell is referring to the context length. Indeed, if there is a need for longer context lengths, then OpenAI outcompetes SentenceTransformer which has a default context length of 128.

rajanjedi t1_jbt625q wrote on March 11, 2023 at 2:53 PM

#2,208,533

Replying to Simusid (#2,208,336)

Number of tokens in the input perhaps?

ID4gotten t1_jbt63ni wrote on March 11, 2023 at 2:54 PM

#2,208,536

Replying to Simusid (#2,208,229)

Maybe I'm being "dense", but what task was your network trained to accomplish? That wasn't clear to me from your description.

ITagEveryone t1_jbt76r3 wrote on March 11, 2023 at 3:02 PM

#2,208,580

Replying to jobeta (#2,208,493)

Definitely t-SNE

tdgros t1_jbt7dy0 wrote on March 11, 2023 at 3:04 PM

#2,208,591

Replying to jobeta (#2,208,493)

OP said UMAP above

[deleted] t1_jbt7rkv wrote on March 11, 2023 at 3:06 PM

#2,208,604

Replying to Simusid (#2,208,336)

[deleted]

Simusid OP t1_jbt91tb wrote on March 11, 2023 at 3:16 PM

#2,208,647

Replying to ID4gotten (#2,208,536)

My main goal was to just visualize the embeddings to see if they are grossly different. They are not. That is just a qualitative view. My second goal was to use the embeddings with a trivial supervised classifier. The dataset is labeled with four labels. So I made a generic network to see if there was any consistency in the training. And regardless of hyperparameters, the OpenAI embeddings seemed to always outperform the SentenceTransformer embeddings, slightly but consistency.

This was not meant to be rigorous. I did this to get a general feel of the quality of the embeddings, plus to get a little experience with the OpenAI API.

Simusid OP t1_jbt962j wrote on March 11, 2023 at 3:17 PM

#2,208,651

Replying to jobeta (#2,208,493)

UMAP()

[deleted] t1_jbtcsig wrote on March 11, 2023 at 3:43 PM

#2,208,790

Replying to Simusid (#2,208,647)

[deleted]

montcarl t1_jbtexjk wrote on March 11, 2023 at 3:58 PM

#2,208,875

Replying to imaginethezmell (#2,208,277)

This is an important point. The performance similarities indicate that the sentence lengths of the 20k dataset were mostly within the SentenceTransformer max length cutoff. It would be nice to confirm this and also run another test with longer examples. This new test should result in a larger performance gap.

rshah4 t1_jbtfzig wrote on March 11, 2023 at 4:06 PM

#2,208,907

Two quick tips for finding the best embedding models:

Sentence Transformers documentation compares models: https://www.sbert.net/docs/pretrained_models.html

Massive Text Embedding Benchmark (MTEB) Leaderboard has 47 different models: https://huggingface.co/spaces/mteb/leaderboard

These will help you compare different models across a lot of benchmark datasets so you can figure out the best one for your use case.

DigThatData t1_jbthgbj wrote on March 11, 2023 at 4:16 PM

#2,208,960

I think it might be easier to compare if you flip the vertical axis on one of them. you can just negate the values of the component, won't change the topology (the relations of the points relative to each other).

deliciously_methodic t1_jbtl53y wrote on March 11, 2023 at 4:42 PM

#2,209,082

What are embeddings? I watch videos, but still don’t fully understand them, then I see these pictures and I’m even more confused.

wikipedia_answer_bot t1_jbtl62p wrote on March 11, 2023 at 4:42 PM

#2,209,083

Replying to deliciously_methodic (#2,209,082)

**In mathematics, an embedding (or imbedding) is one instance of some mathematical structure contained within another instance, such as a group that is a subgroup. When some object

    X
  

{\displaystyle X}

is said to be embedded in another object

    Y
  

{\displaystyle Y}

, the embedding is given by some injective and structure-preserving map

    f
    :
    X
    →
    Y
  

{\displaystyle f:X\rightarrow Y}

.**

More details here: <https://en.wikipedia.org/wiki/Embedding>

This comment was left automatically (by a bot). If I don't get this right, don't get mad at me, I'm still learning!

^(opt out) ^(|) ^(delete) ^(|) ^(report/suggest) ^(|) ^(GitHub)

LetterRip t1_jbtn573 wrote on March 11, 2023 at 4:56 PM

#2,209,157

Replying to Simusid (#2,208,336)

number of total tokens in input + output.

Simusid OP t1_jbtp8wr wrote on March 11, 2023 at 5:10 PM

#2,209,228

Replying to deliciously_methodic (#2,209,082)

Given three sentences:

Tom went to the bank to make a payment on his mortgage.
Yesterday my wife went to the credit union and withdrew $500.
My friend was fishing along the river bank, slipped and fell in the water.

Reading those you immediately know that the first two are related because they are both about banks/money/finance. You also know that they are unrelated to the third sentence even though the first and third share the word "bank". If we had naively encoded a strictly word based model, it might incorrectly associate the first and third sentences.

What we want is a model that can represent the "semantic content" or idea behind a sentence in a way that we can make valid mathematical comparisons. We want to create a "metric space". In that space, each sentence will be represented by a vector. Then we use standard math operations to compute the distances between the vectors. In other words, the first two sentences will have vectors that point basically in the same direction, and the third vector will point in a very different direction.

The job of the language models (BERT, RoBERTa, all-mpnet-v2, etc) are to do the best job possible turning sentences into vectors. The output of these models are very high dimension, 768 dimensions and higher. We cannot visualize that, so we use tools like UMAP, tSNE, PCA, and eig to find the 2 or 3 most important components and then display them as pretty 2 or 3D point clouds.

In short, the embedding is the vector that represents the sentence in a (hopefully) valid metric space.

quitenominal t1_jbtptri wrote on March 11, 2023 at 5:14 PM

#2,209,250

Replying to deliciously_methodic (#2,209,082)

An embedding is a numerical representation of some data. In this case the data is text.

These representations (read list of numbers) can be learned with some goal in mind. Usually you want the embeddings of similar data to be close to one another, and the embeddings of disparate data to be far.

Often these lists of numbers representing the data are very long - I think the ones from the model above are 768 numbers. So each piece of text is transformed into a list of 768 numbers, and similar text will get similar lists of numbers.

What's being visualized above is a 2 number summary of those 768. This is referred to as a projection, like how a 3D wireframe casts a 2D shadow. This lets us visualize the embeddings and can give a qualitative assessment of their 'goodness' - a.k.a are they grouping things as I expect? (Similar texts are close, disparate texts are far)

quitenominal t1_jbtqio0 wrote on March 11, 2023 at 5:19 PM

#2,209,275

Replying to Simusid (#2,209,228)

Nice explainer! I think this is good for those with some linear algebra familiarity. I added a further explanation going one level more simple again

quitenominal t1_jbtr6g7 wrote on March 11, 2023 at 5:24 PM

#2,209,305

Replying to Simusid (#2,208,647)

fwiw this has also been my finding when comparing these two embeddings for classification tasks. Better, but not enough to justify the cost

pyepyepie t1_jbtsc3n wrote on March 11, 2023 at 5:32 PM

#2,209,351

Your plot doesn't mean much - when you use UMAP you can't even measure the explained variance, differences can be more nuanced than what you get from the results. I would evaluate with some semantic similarity or ranking.

For the "91% vs 89%" - you need to pick the classification task very carefully if you don't describe what it was then it also literally means nothing.

That being said, thanks for the efforts.

rshah4 t1_jbtsl7o wrote on March 11, 2023 at 5:34 PM

#2,209,365

Replying to rshah4 (#2,208,907)

Also, not sure about a recent comparison, but Nils Reimers also tried to empirically analyze OpenAI's embeddings here: https://twitter.com/Nils_Reimers/status/1487014195568775173

He found across 14 datasets that the OpenAI 175B model is actually worse than a tiny MiniLM 22M parameter model that can run in your browser.

utopiah t1_jbtx8iv wrote on March 11, 2023 at 6:06 PM

#2,209,529

Replying to Simusid (#2,209,228)

> What we want is a model that can represent the "semantic content" or idea behind a sentence

We do but is it what embedding actually provide or rather some kind of distance between items, how they might relate or not between each other? I'm not sure that would be sufficient for most people to provide the "idea" behind a sentence, just relatedness. I'm not saying it's not useful but arguing against the semantic aspect here, at least from my understanding of that explanation.

[deleted] t1_jbtztzc wrote on March 11, 2023 at 6:24 PM

#2,209,603

Replying to utopiah (#2,209,529)

[deleted]

Simusid OP t1_jbu0bkv wrote on March 11, 2023 at 6:28 PM

#2,209,613

Replying to utopiah (#2,209,529)

>We do but is it what embedding actually provide or rather some kind of distance between items,

A single embedding is a single vector, encoding a single sentence. To identify a relationship between sentences, you need to compare vectors. Typically this is done with cosine distance between the vectors. The expectation is that if you have a collection of sentences that all talk about cats, the vectors that represent them will exist in a related neighborhood in the metric space.

utopiah t1_jbu0qpa wrote on March 11, 2023 at 6:31 PM

#2,209,625

Replying to Simusid (#2,209,613)

Still says absolutely nothing if you don't know what a cat is.

Simusid OP t1_jbu229y wrote on March 11, 2023 at 6:40 PM

#2,209,671

Replying to pyepyepie (#2,209,351)

Regarding the plot, the intent was not to measure anything, nor identify any specific differences. UMAP is an important tool for humans to get a sense of what is going on at a high level. I think if you ever use a UMAP plot for analytic results, you're using it incorrectly.

At a high level I wanted to see if there were very distinct clusters or amorphous overlapping blobs and to see if one embedding was very distinct. I think these UMAPs clearly show good and similar clustering.

Regarding the classification task; Again, this is a notional task and not trying to solve a concrete problem. The goal was to use nearly identical models with both sets of embeddings to see if there were consistent differences. There were. The OpenAI models marginally outperforms the SentenceTransformer models every single time (several hundreds runs with various hyperparameters). Whether it's a "carefully chosen" task or not is immaterial. In this case "carefully chosen" means softmax classification accuracy of the 4 labels in the curated dataset.

Simusid OP t1_jbu2n5w wrote on March 11, 2023 at 6:44 PM

#2,209,692

Replying to utopiah (#2,209,625)

That was not the point at all.

Continuing the cat analogy, I have two different cameras. I take 20,000 pictures of the same cats with both. I have two datasets of 20,000 cats. Is one dataset superior to the other? I will build a model that tries to predict cats and see if the "quality" of one dataset is better than the other.

In this case, the OpenAI dataset appears to be slightly better.

Non-jabroni_redditor t1_jbu2shx wrote on March 11, 2023 at 6:45 PM

#2,209,698

Replying to rshah4 (#2,209,365)

That’s to be expected, no? No model is going to be perfect regardless of how it performs on a set (of datasets) as a whole

polandtown t1_jbu2zqe wrote on March 11, 2023 at 6:47 PM

#2,209,707

Replying to Simusid (#2,208,647)

Learning here, but how are you axes defined? Some kind of factor(s) or component(s) extracted from each individual embedding? Thanks for the visualization, as it made me curious and interested! Good work!

pyepyepie t1_jbu3245 wrote on March 11, 2023 at 6:47 PM

#2,209,709

Replying to Simusid (#2,209,671)

I think you misunderstood my comment. What I say is, that since you have no way to measure how well UMAP worked and how much of the variance of the data this plot contains, the fact that it "seems similar" means nothing (I am really not an expert on it, if I get it wrong feel free to correct me). Additionally, I am not sure how balanced the dataset you used for classification is, and if sentence embeddings are even the right approach for that specific task.

It might be the case - for example, that the OpenAI embeddings + the FFW network classify the data perfectly/as well as you can since the dataset is very imbalanced and the annotation is imperfect/categories are very similar. In this case, 89% vs 91% could be a huge difference. In fact, for some datasets the "majority classifier" would yield high accuracy, I would start by reporting precision & recall.

Again, I don't want to be "the negative guy" but there are serious flaws that make me unable to make any conclusion based on it (I find the project very important and interesting). Could you release the data of your experiments (vectors, dataset) so other people (I might as well) can look into it more deeply?

Simusid OP t1_jbu3q8m wrote on March 11, 2023 at 6:52 PM

#2,209,735

Replying to polandtown (#2,209,707)

Here is some explanation about UMAP axes and why they should usually be ignored: https://stats.stackexchange.com/questions/527235/how-to-interpret-axis-of-umap

Basically it's because they are nonlinear.

Geneocrat t1_jbu4law wrote on March 11, 2023 at 6:58 PM

#2,209,777

Replying to ID4gotten (#2,208,536)

Thanks for asking the questions seemingly obvious questions so that I don’t have to wonder.

Simusid OP t1_jbu5594 wrote on March 11, 2023 at 7:02 PM

#2,209,796

Replying to pyepyepie (#2,209,709)

Actually the curated dataset (ref github in original post) is almost perfectly balanced. And yes, sentence embeddings is probably the SOTA approach today.

I agree that when I say the graphs "seems similar", that is a very qualitative label. However I would not say it "means nothing". At the far extreme if you plot:

x = UMAP().fit(np.random.random((10000,75)))
plt.scatter(x.embedding_[:,0], x.embedding_[:,1], s=1)

You will get "hot garbage", a big blob. My goal, and my only goal was to visually see how "blobby" OpenAI was vs ST. And clearly they are visually similar.

polandtown t1_jbu56lb wrote on March 11, 2023 at 7:02 PM

#2,209,798

Replying to Simusid (#2,209,735)

Thanks!

pyepyepie t1_jbu75ec wrote on March 11, 2023 at 7:16 PM

#2,209,860

Replying to Simusid (#2,209,796)

Let's agree to disagree. Your example shows random data while I talk about how much of the information your plot actually shows after dimensionality reduction (you can't know).

Honestly, I am not sure what your work actually means since the details are kept secret - I think you can shut my mouth by reporting a little more or releasing the data, but more importantly - it would make your work a significant contribution.

Edit: I would like to see a comparison of the plot with a very simple method, e.g. mean of word embeddings. My hypothesis is that it will look similar as well.

Kyle-Boi t1_jbug5j7 wrote on March 11, 2023 at 8:21 PM

#2,210,210

Wtf does that even mean?

lppier2 t1_jbvhlwk wrote on March 12, 2023 at 1:07 AM

#2,211,627

Thanks, I’m more interested in real world examples of how these two models compare. If sentence transformers can give me the same kind of embedding a, why am I paying OpenAI for the ada embeddings?

Simusid OP t1_jbvrbnu wrote on March 12, 2023 at 2:26 AM

#2,212,043

Replying to lppier2 (#2,211,627)

Well that was pretty much the reason I did this test. And right now I'm leaning toward SentenceTransformers.

lppier2 t1_jbvwweu wrote on March 12, 2023 at 3:13 AM

#2,212,228

Replying to Simusid (#2,212,043)

Thanks, to be really convinced though I would want to see the real world examples. Like a sample of where OpenAI did well, and where sentence transformers did well. Frankly if it doesn’t out perform sentence transformers I would be a bit disappointed given the larger size and all.

phys_user t1_jbw7i59 wrote on March 12, 2023 at 4:51 AM

#2,212,553

Replying to rshah4 (#2,208,907)

Looks like text-embedding-ada-002 is already on the MTEB leaderboard! It comes in at #4 overall, and has the highest performance for clustering.

You might also want to look into SentEval, which can help you test the embedding performance on a variety of tasks: https://github.com/facebookresearch/SentEval

onkus t1_jbwftny wrote on March 12, 2023 at 6:21 AM

#2,212,818

Replying to Simusid (#2,209,735)

Doesn’t this also make it essentially impossible to compare the two figures you’ve shown?

JClub t1_jbwu3lx wrote on March 12, 2023 at 9:32 AM

#2,213,211

Replying to Non-jabroni_redditor (#2,209,698)

more than that, GPT is unidirectional, which is really not great a sentence embedder

[deleted] t1_jbyaq18 wrote on March 12, 2023 at 5:40 PM

#2,215,165

Replying to onkus (#2,212,818)

[deleted]

Thog78 t1_jbyh4w1 wrote on March 12, 2023 at 6:24 PM

#2,215,414

Replying to onkus (#2,212,818)

What you're looking for when comparing UMAPs is if the local relationships are the same. Try to recognize clusters and see their neighbors, or whether they are distinct or not. A much finer colored clustering based on another reduction (typically PCA) helps with that. Without clustering, you can only try to recognize landmarks from their size and shape.

dhruv-kadam t1_jbyw9t7 wrote on March 12, 2023 at 8:09 PM

#2,216,032

I love reading these geeky comments even though I don't understand a thing. I love this!!

vintage2019 t1_jbzzadd wrote on March 13, 2023 at 12:54 AM

#2,217,721

Replying to phys_user (#2,212,553)

Has anyone ranked models with that and published the results?

deliciously_methodic t1_jcifdxa wrote on March 17, 2023 at 1:26 AM

#2,249,761

Replying to Simusid (#2,209,228)

Thanks very informative. Can we dumb this down further? What would a 3 dimensional embedding table look like for the following sentences? And how do we go from words to numbers, what is the algorithm?

Bank deposit.
Bank withdrawal.
River bank.

Simusid OP t1_jciguq5 wrote on March 17, 2023 at 1:38 AM

#2,249,848

Replying to deliciously_methodic (#2,249,761)

"words to numbers" is the secret sauce of all the models including the new GPT-4. Individual words are tokenized (sometimes into "word pieces") and a mapping from the tokens to numbers via a vocabulary is made. Then the model is trained on pairs of sentences A and B. Sometimes the model is shown a pair where B correctly follows A, and sometimes not. Eventually the model learns to predict what is most likely to come next.

"he went to the bank", "he made a deposit"

B probably follows A

"he went to the bank", "he bought a duck"

Does not.

That is one type of training to learn valid/invalid text. Another is "leave one out" training. In this case the input is a full sentence minus one word (typically).

"he went to the convenience store and bought a gallon of _____"

and the model should learn that the most common answer will probably be "milk"

Back to your first question. In 3D your first two embeddings should be closer together because they are similar. And they should be both "far' from the third encoding.

Comments