jiamengial
jiamengial t1_j6sj3l2 wrote
Reply to [D] Audio segmentation - Machine Learning algorithm to segment a audio file into multiple class by PlayfulMenu1395
Using something like a CTC loss might be a good shout - you could basically say you're doing "speech recognition", but instead of recognising (sub)words you're recognising classes
jiamengial t1_j6mdcrj wrote
Don't think so, diffusion models are based entirely on sampling methods; if anything what's exciting is to take the "traditional" methods and, instead of replacing the whole thing with neural nets, replace only a component of it
jiamengial OP t1_j6mc7vs wrote
Reply to comment by the_Wallie in [D] What's stopping you from working on speech and voice? by jiamengial
To challenge on this a little though; surely at some point people thought free form text was unstructured data?
jiamengial OP t1_j6mbv97 wrote
Reply to comment by babua in [D] What's stopping you from working on speech and voice? by jiamengial
That's a good point - CTC and attention mechanisms work on the basis that you've got the whole segment of audio
jiamengial OP t1_j6j95fq wrote
Reply to comment by psma in [D] What's stopping you from working on speech and voice? by jiamengial
Presumably this would be for through certain protocols like Websockets and WebRTC? Or more like direct integration to Zoom?
jiamengial OP t1_j6j8c8c wrote
Reply to comment by jiamengial in [D] What's stopping you from working on speech and voice? by jiamengial
To go into your question further, one area that might be really interesting is open standards or formats for speech data; like the MLF formats in HTK and Kaldi but, like, modern, so that (to the point of some others here w.r.t. data storage costs) datasets can be hosted more centrally and people don't have to reformat them to their own data storage structures (which, let's face it, is basically someone's folder structure)
jiamengial OP t1_j6j6ruc wrote
Reply to comment by blackkettle in [D] What's stopping you from working on speech and voice? by jiamengial
If anything this is what's motivating me; getting Kaldi (or any of these other repos) to compile and run on your own data is usually painful enough that it's putting off anyone who isn't already knowledgeable in the area, where wrappers such as pykaldi and Montreal Forced Aligner try to result a lot of problems, but only really add to it.
I've personally had great experiences with repo's like NeMo, though that was mainly through nailing myself to a specific commit in the main branch and heavily wrapping various classes I needed to use (I still have no idea what a manifest file format should look like)
The field is still incredibly recipe-heavy in terms of setting up systems and running them; if you were someone testing the waters with speech processing (especially if you want to go beyond STT or vanilla TTS), there little to nothing that compares to the likes of HuggingFace for the text side
jiamengial t1_j6iiux3 wrote
Reply to [D] DL university research PC suggestions? by seanrescs
Where do you plan to put the machine? If it's anywhere near where you (or anyone else) work I'd recommend getting it liquid cooled if you want to save your hearing.
The A6000s don't have active cooling on themselves and are definitely meant to last a whole lot longer than the 4090's, so will be better if you plan to use the machine for quite a while or want to retain resell value for the future
Submitted by jiamengial t3_10p66zc in MachineLearning
jiamengial t1_j6t854s wrote
Reply to comment by uhules in [D] Audio segmentation - Machine Learning algorithm to segment a audio file into multiple class by PlayfulMenu1395
That's true, was thinking that flat frame-wise predictions could lead to incorrect mid-segment predictions, which might be an annoying model error to get