Devinco001 OP t1_ix75z7w wrote on November 21, 2022 at 6:51 AM

Reply to comment by pagein in [D] BERT related questions by Devinco001

Will surely check them out, thanks

Devinco001 OP t1_ix710w3 wrote on November 21, 2022 at 5:50 AM

Reply to comment by pagein in [D] BERT related questions by Devinco001

This looks really interesting, thanks. Is it open source?

Devinco001 OP t1_ix2ewbe wrote on November 20, 2022 at 5:32 AM

Reply to comment by LetterRip in [D] BERT related questions by Devinco001

Yes, they are short, conversational based. Business intent. Average token length around 10. Total approx 2.5 million sentences

Devinco001 OP t1_ix08pyr wrote on November 19, 2022 at 7:05 PM

Reply to comment by skelly0311 in [D] BERT related questions by Devinco001

I am going to use the embeddings for clustering the text in an unsupervised manner to get the popular intents actually.

1,2. Would be fine with a bit of trade off in accuracy. Time is the main concern, since I want it not to take more than a day. Maybe, I have to use something other then BERT

Googled them out and RoBERTA seems to be the best choice. Much better than base BERT or larger BERT
I actually asked this because Google collab has some restrictions on the free usage
Thanks, really good article

Devinco001 OP t1_iwmvton wrote on November 16, 2022 at 8:37 PM

Reply to comment by sameersoi in [D] Spellcheck and Levenshtein distance by Devinco001

Yeah, I googled BK tree. The data structure is amazing and reduces a lot of computational time. While searching that, I found another algorithm, symspell. This is even faster with high accuracy but doesn't use levenshtein, so currently going to use that.

But BK trees are the go to method if using pure levenshtein and similar, more accurate, string distances. So will be comparing the accuracy of both algos to choose the better one. Thanks

Devinco001 OP t1_iwmv475 wrote on November 16, 2022 at 8:32 PM

Reply to comment by [deleted] in [D] Spellcheck and Levenshtein distance by Devinco001

Yeah, I looked at some LM at huggingface for filling the mask. They looked good, but required significant memory and computational resources to train.

This approach is the best, but due to resource constraints, I might have to fall back on simpler algorithms. Currently, I am going to use symspell. It is surprisingly highly accurate and fast.

But I will keep looking for a less resource hungry LM, since lookup time is low and they better catch the context and grammer. Using levenshtein and varying model output will increase its accuracy further many times. Ultimately, will be shifting to that, thanks

Devinco001 OP t1_iwmez9k wrote on November 16, 2022 at 6:47 PM

Reply to comment by threehofive in [D] Spellcheck and Levenshtein distance by Devinco001

Thanks, I am using a different algorithm now, symspell. But haven't used multi threading till now. A really good idea, would speed anything up several times

Devinco001 OP t1_iwmebpe wrote on November 16, 2022 at 6:43 PM

Reply to comment by goedel777 in [D] Spellcheck and Levenshtein distance by Devinco001

Sure, but its just a for loop, looping through the words in the dictionary, and using a python library 'python-levenshtein' to calculate distance between the dictionary words and the mispelled word.

For now, I am skipping levenshtein for a faster approximate distance, using symspell algorithm. It is highly accurate and much faster. Reduced computation time from 21 days to 13 hours

Devinco001 OP t1_iwmdmgy wrote on November 16, 2022 at 6:38 PM

Reply to comment by afireohno in [D] Spellcheck and Levenshtein distance by Devinco001

Thanks! I BK tree is much faster. While researching of BK trees, I found out symspell algorithm, which is even faster. So going to use that for now, because of its high accuracy and faster time.

Devinco001 OP t1_iwmaat8 wrote on November 16, 2022 at 6:16 PM

Reply to comment by cautioushedonist in [D] Phonetic Algorithm Spellcheck Metric by Devinco001

I actually saw this very example first, yeah it requires a good amount of computational power which my pc currently lacks. API calls I can but there would be rate limits to it, which needs to be payed to extend usage, that is why I have to drop that approach

I was actually looking for a non language model based approach for now, since language models are computation heavy. I am currently going to use symspell python library, since it is faster, though less accurate. Once I increase my Ram, I will surely start using LM since these are far better in accuracy. Thanks

Devinco001 OP t1_iwjxuul wrote on November 16, 2022 at 4:28 AM

Reply to comment by goedel777 in [D] Spellcheck and Levenshtein distance by Devinco001

Yes, I have done that. It's after dropping the duplicates, the count is coming 10M

Devinco001 OP t1_iwgaljf wrote on November 15, 2022 at 12:45 PM

Reply to comment by cautioushedonist in [D] Phonetic Algorithm Spellcheck Metric by Devinco001

Yes, for example if I have a word 'baend' and I make it go through soundex + levenshtein, it gives me 'band' and 'bend', both with a distance of 1. So I want to basically decide which of the words would be a better choice.

Yes, the LM idea is awesome. But I am a bit low on memory and disk space. On hugging face, the LM which pops up for filling mask is quite large, with significant computational time.

Can this be done without LM, like some frequency tables, etc.? Or is there an LM sort of thing where I can input the highest ranked soundex words and get the confidence score for each? Or is there an optimized LM for this task, I tried finding it but didn't get one till now.