jiamengial t1_j6t854s wrote on February 1, 2023 at 6:52 PM

Reply to comment by uhules in [D] Audio segmentation - Machine Learning algorithm to segment a audio file into multiple class by PlayfulMenu1395

That's true, was thinking that flat frame-wise predictions could lead to incorrect mid-segment predictions, which might be an annoying model error to get

jiamengial t1_j6sj3l2 wrote on February 1, 2023 at 4:19 PM

Reply to [D] Audio segmentation - Machine Learning algorithm to segment a audio file into multiple class by PlayfulMenu1395

Using something like a CTC loss might be a good shout - you could basically say you're doing "speech recognition", but instead of recognising (sub)words you're recognising classes

jiamengial t1_j6mdcrj wrote on January 31, 2023 at 10:23 AM

Reply to [D] Have researchers given up on traditional machine learning methods? by fujidaiti

Don't think so, diffusion models are based entirely on sampling methods; if anything what's exciting is to take the "traditional" methods and, instead of replacing the whole thing with neural nets, replace only a component of it

jiamengial OP t1_j6mc7vs wrote on January 31, 2023 at 10:07 AM

Reply to comment by the_Wallie in [D] What's stopping you from working on speech and voice? by jiamengial

To challenge on this a little though; surely at some point people thought free form text was unstructured data?

jiamengial OP t1_j6mbv97 wrote on January 31, 2023 at 10:02 AM

Reply to comment by babua in [D] What's stopping you from working on speech and voice? by jiamengial

That's a good point - CTC and attention mechanisms work on the basis that you've got the whole segment of audio

jiamengial OP t1_j6j95fq wrote on January 30, 2023 at 6:53 PM

Reply to comment by psma in [D] What's stopping you from working on speech and voice? by jiamengial

Presumably this would be for through certain protocols like Websockets and WebRTC? Or more like direct integration to Zoom?

jiamengial OP t1_j6j8c8c wrote on January 30, 2023 at 6:49 PM

Reply to comment by jiamengial in [D] What's stopping you from working on speech and voice? by jiamengial

To go into your question further, one area that might be really interesting is open standards or formats for speech data; like the MLF formats in HTK and Kaldi but, like, modern, so that (to the point of some others here w.r.t. data storage costs) datasets can be hosted more centrally and people don't have to reformat them to their own data storage structures (which, let's face it, is basically someone's folder structure)

jiamengial OP t1_j6j6ruc wrote on January 30, 2023 at 6:39 PM

Reply to comment by blackkettle in [D] What's stopping you from working on speech and voice? by jiamengial

If anything this is what's motivating me; getting Kaldi (or any of these other repos) to compile and run on your own data is usually painful enough that it's putting off anyone who isn't already knowledgeable in the area, where wrappers such as pykaldi and Montreal Forced Aligner try to result a lot of problems, but only really add to it.

I've personally had great experiences with repo's like NeMo, though that was mainly through nailing myself to a specific commit in the main branch and heavily wrapping various classes I needed to use (I still have no idea what a manifest file format should look like)

The field is still incredibly recipe-heavy in terms of setting up systems and running them; if you were someone testing the waters with speech processing (especially if you want to go beyond STT or vanilla TTS), there little to nothing that compares to the likes of HuggingFace for the text side

jiamengial t1_j6iiux3 wrote on January 30, 2023 at 4:09 PM

Reply to [D] DL university research PC suggestions? by seanrescs

Where do you plan to put the machine? If it's anywhere near where you (or anyone else) work I'd recommend getting it liquid cooled if you want to save your hearing.

The A6000s don't have active cooling on themselves and are definitely meant to last a whole lot longer than the 4090's, so will be better if you plan to use the machine for quite a while or want to retain resell value for the future