Submitted by Devinco001 t3_105la5f in MachineLearning
Devinco001
Devinco001 OP t1_ix710w3 wrote
Reply to comment by pagein in [D] BERT related questions by Devinco001
This looks really interesting, thanks. Is it open source?
Devinco001 OP t1_ix2ewbe wrote
Reply to comment by LetterRip in [D] BERT related questions by Devinco001
Yes, they are short, conversational based. Business intent. Average token length around 10. Total approx 2.5 million sentences
Devinco001 OP t1_ix08pyr wrote
Reply to comment by skelly0311 in [D] BERT related questions by Devinco001
I am going to use the embeddings for clustering the text in an unsupervised manner to get the popular intents actually.
1,2. Would be fine with a bit of trade off in accuracy. Time is the main concern, since I want it not to take more than a day. Maybe, I have to use something other then BERT
-
Googled them out and RoBERTA seems to be the best choice. Much better than base BERT or larger BERT
-
I actually asked this because Google collab has some restrictions on the free usage
-
Thanks, really good article
Submitted by Devinco001 t3_yzh6v1 in MachineLearning
Devinco001 OP t1_iwmvton wrote
Reply to comment by sameersoi in [D] Spellcheck and Levenshtein distance by Devinco001
Yeah, I googled BK tree. The data structure is amazing and reduces a lot of computational time. While searching that, I found another algorithm, symspell. This is even faster with high accuracy but doesn't use levenshtein, so currently going to use that.
But BK trees are the go to method if using pure levenshtein and similar, more accurate, string distances. So will be comparing the accuracy of both algos to choose the better one. Thanks
Devinco001 OP t1_iwmv475 wrote
Reply to comment by [deleted] in [D] Spellcheck and Levenshtein distance by Devinco001
Yeah, I looked at some LM at huggingface for filling the mask. They looked good, but required significant memory and computational resources to train.
This approach is the best, but due to resource constraints, I might have to fall back on simpler algorithms. Currently, I am going to use symspell. It is surprisingly highly accurate and fast.
But I will keep looking for a less resource hungry LM, since lookup time is low and they better catch the context and grammer. Using levenshtein and varying model output will increase its accuracy further many times. Ultimately, will be shifting to that, thanks
Devinco001 OP t1_iwmez9k wrote
Reply to comment by threehofive in [D] Spellcheck and Levenshtein distance by Devinco001
Thanks, I am using a different algorithm now, symspell. But haven't used multi threading till now. A really good idea, would speed anything up several times
Devinco001 OP t1_iwmebpe wrote
Reply to comment by goedel777 in [D] Spellcheck and Levenshtein distance by Devinco001
Sure, but its just a for loop, looping through the words in the dictionary, and using a python library 'python-levenshtein' to calculate distance between the dictionary words and the mispelled word.
For now, I am skipping levenshtein for a faster approximate distance, using symspell algorithm. It is highly accurate and much faster. Reduced computation time from 21 days to 13 hours
Devinco001 OP t1_iwmdmgy wrote
Reply to comment by afireohno in [D] Spellcheck and Levenshtein distance by Devinco001
Thanks! I BK tree is much faster. While researching of BK trees, I found out symspell algorithm, which is even faster. So going to use that for now, because of its high accuracy and faster time.
Devinco001 OP t1_iwmaat8 wrote
Reply to comment by cautioushedonist in [D] Phonetic Algorithm Spellcheck Metric by Devinco001
I actually saw this very example first, yeah it requires a good amount of computational power which my pc currently lacks. API calls I can but there would be rate limits to it, which needs to be payed to extend usage, that is why I have to drop that approach
I was actually looking for a non language model based approach for now, since language models are computation heavy. I am currently going to use symspell python library, since it is faster, though less accurate. Once I increase my Ram, I will surely start using LM since these are far better in accuracy. Thanks
Devinco001 OP t1_iwjxuul wrote
Reply to comment by goedel777 in [D] Spellcheck and Levenshtein distance by Devinco001
Yes, I have done that. It's after dropping the duplicates, the count is coming 10M
Submitted by Devinco001 t3_ywjd26 in MachineLearning
Devinco001 OP t1_iwgaljf wrote
Reply to comment by cautioushedonist in [D] Phonetic Algorithm Spellcheck Metric by Devinco001
Yes, for example if I have a word 'baend' and I make it go through soundex + levenshtein, it gives me 'band' and 'bend', both with a distance of 1. So I want to basically decide which of the words would be a better choice.
Yes, the LM idea is awesome. But I am a bit low on memory and disk space. On hugging face, the LM which pops up for filling mask is quite large, with significant computational time.
Can this be done without LM, like some frequency tables, etc.? Or is there an LM sort of thing where I can input the highest ranked soundex words and get the confidence score for each? Or is there an optimized LM for this task, I tried finding it but didn't get one till now.
Submitted by Devinco001 t3_yvngga in MachineLearning
Devinco001 OP t1_ix75z7w wrote
Reply to comment by pagein in [D] BERT related questions by Devinco001
Will surely check them out, thanks