pyepyepie t1_jbu3245 wrote on March 11, 2023 at 6:47 PM

Reply to comment by Simusid in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid

I think you misunderstood my comment. What I say is, that since you have no way to measure how well UMAP worked and how much of the variance of the data this plot contains, the fact that it "seems similar" means nothing (I am really not an expert on it, if I get it wrong feel free to correct me). Additionally, I am not sure how balanced the dataset you used for classification is, and if sentence embeddings are even the right approach for that specific task.

It might be the case - for example, that the OpenAI embeddings + the FFW network classify the data perfectly/as well as you can since the dataset is very imbalanced and the annotation is imperfect/categories are very similar. In this case, 89% vs 91% could be a huge difference. In fact, for some datasets the "majority classifier" would yield high accuracy, I would start by reporting precision & recall.

Again, I don't want to be "the negative guy" but there are serious flaws that make me unable to make any conclusion based on it (I find the project very important and interesting). Could you release the data of your experiments (vectors, dataset) so other people (I might as well) can look into it more deeply?

Simusid OP t1_jbu5594 wrote on March 11, 2023 at 7:02 PM

Actually the curated dataset (ref github in original post) is almost perfectly balanced. And yes, sentence embeddings is probably the SOTA approach today.

I agree that when I say the graphs "seems similar", that is a very qualitative label. However I would not say it "means nothing". At the far extreme if you plot:

x = UMAP().fit(np.random.random((10000,75)))
plt.scatter(x.embedding_[:,0], x.embedding_[:,1], s=1)

You will get "hot garbage", a big blob. My goal, and my only goal was to visually see how "blobby" OpenAI was vs ST. And clearly they are visually similar.

pyepyepie t1_jbu75ec wrote on March 11, 2023 at 7:16 PM

Let's agree to disagree. Your example shows random data while I talk about how much of the information your plot actually shows after dimensionality reduction (you can't know).

Honestly, I am not sure what your work actually means since the details are kept secret - I think you can shut my mouth by reporting a little more or releasing the data, but more importantly - it would make your work a significant contribution.

Edit: I would like to see a comparison of the plot with a very simple method, e.g. mean of word embeddings. My hypothesis is that it will look similar as well.