Devinco001

Devinco001 OP t1_ix08pyr wrote

I am going to use the embeddings for clustering the text in an unsupervised manner to get the popular intents actually.

1,2. Would be fine with a bit of trade off in accuracy. Time is the main concern, since I want it not to take more than a day. Maybe, I have to use something other then BERT

  1. Googled them out and RoBERTA seems to be the best choice. Much better than base BERT or larger BERT

  2. I actually asked this because Google collab has some restrictions on the free usage

  3. Thanks, really good article

1

Devinco001 OP t1_iwmvton wrote

Yeah, I googled BK tree. The data structure is amazing and reduces a lot of computational time. While searching that, I found another algorithm, symspell. This is even faster with high accuracy but doesn't use levenshtein, so currently going to use that.

But BK trees are the go to method if using pure levenshtein and similar, more accurate, string distances. So will be comparing the accuracy of both algos to choose the better one. Thanks

2

Devinco001 OP t1_iwmv475 wrote

Yeah, I looked at some LM at huggingface for filling the mask. They looked good, but required significant memory and computational resources to train.

This approach is the best, but due to resource constraints, I might have to fall back on simpler algorithms. Currently, I am going to use symspell. It is surprisingly highly accurate and fast.

But I will keep looking for a less resource hungry LM, since lookup time is low and they better catch the context and grammer. Using levenshtein and varying model output will increase its accuracy further many times. Ultimately, will be shifting to that, thanks

1

Devinco001 OP t1_iwmebpe wrote

Sure, but its just a for loop, looping through the words in the dictionary, and using a python library 'python-levenshtein' to calculate distance between the dictionary words and the mispelled word.

For now, I am skipping levenshtein for a faster approximate distance, using symspell algorithm. It is highly accurate and much faster. Reduced computation time from 21 days to 13 hours

0

Devinco001 OP t1_iwmaat8 wrote

I actually saw this very example first, yeah it requires a good amount of computational power which my pc currently lacks. API calls I can but there would be rate limits to it, which needs to be payed to extend usage, that is why I have to drop that approach

I was actually looking for a non language model based approach for now, since language models are computation heavy. I am currently going to use symspell python library, since it is faster, though less accurate. Once I increase my Ram, I will surely start using LM since these are far better in accuracy. Thanks

1

Devinco001 OP t1_iwgaljf wrote

Yes, for example if I have a word 'baend' and I make it go through soundex + levenshtein, it gives me 'band' and 'bend', both with a distance of 1. So I want to basically decide which of the words would be a better choice.

Yes, the LM idea is awesome. But I am a bit low on memory and disk space. On hugging face, the LM which pops up for filling mask is quite large, with significant computational time.

Can this be done without LM, like some frequency tables, etc.? Or is there an LM sort of thing where I can input the highest ranked soundex words and get the confidence score for each? Or is there an optimized LM for this task, I tried finding it but didn't get one till now.

1