Can you please elaborate your answers and quantify?
I'm most interested in the effort for bullets 2 and 3. In your own experience, did it take hours, days, weeks?
That was not the answer I was hoping for, but very helpful :)
Do you have any code/repo to share? I'm only able to find the DistilBERT implementation in apple's repo, would like to see some other examples?
I was hoping to just fine-tune the model, let the training last days at most. Seems like my best chance is to wait for distilled stable diffusion, and use their clip encoder, as u/LetterRip mentions.
This may not be the direct answer, but it's applicable to many problems:
Use the simplest approach first. This would be creating a simple model, in this case flat fully connected layer.
Measure the results.
If the results aren't good enough, think about what could improve the results: different model architecture, training procedure, obtaining more data...
Iterate (go to 2)
Also:
`creating linear or embedding layers for each feature group before combining them together` - this adds additional knowledge into the network, so it may help... but in theory the network should be able to find this out on its own - the combinations that don't have much sense will have weights close to zero - that's why I advise you to start without it (and try doing it without it).
​
1K+ features: in some cases this is a lot of features, in some it's not that big number... but it maybe makes sense to reduce the number of features, by using some of the dimension reduction techniques.
alkibijad OP t1_j7f1b2e wrote
Reply to comment by vade in [D] Apple's ane-transformers - experiences? by alkibijad
Looking forward to hearing their experiences!