HGFlyGirl t1_j2ewwry wrote on December 31, 2022 at 7:29 PM

Reply to comment by Apprehensive_Maize_4 in [D] Have you built ML models for your own use? by Lintaar

Tried a few of these things. The problem was that a lot of the songs had been ripped from CD's using different software. So, some would be called things like track01.mp3 with a duplicate with a completely different file name. These could also have different byte lengths and durations. Then there are the ones that come from the original recording, the live version and/or the compilation album - which often differ a bit in all the parameters.

HGFlyGirl t1_j2cpl6v wrote on December 31, 2022 at 7:06 AM

Reply to comment by iantimmis in [D] Have you built ML models for your own use? by Lintaar

For pairs of files, I took their filename length, calculated the Levenshtein distance between them, their size in bytes and their duration in Ticks.

I used the ML.NET AutoML API to train a binary classifier.

HGFlyGirl t1_j2al2u4 wrote on December 30, 2022 at 9:05 PM

Reply to comment by nexflatline in [D] Protecting your model in a place where models are not intellectual property? by nexflatline

Whatever solution you find, be mindful of how it impacts the bottom line. It's easy to spend more on protection against theft, than you could lose from a theft.

It could be impossible to make it completely safe from theft, but it can be made difficult and as you say - your customers have little knowledge of computers. I have had a customer actually pay a hacker to steal my software, I caught them at it and a letter from the legal team was all I needed. I caught it because I had legitimate remote access.

Can you encrypt the model and make your software temporarily decrypt it at the point of inference? This might make the model useless in isolation.

HGFlyGirl t1_j2aen6d wrote on December 30, 2022 at 8:23 PM

Reply to [D] Have you built ML models for your own use? by Lintaar

I trained a model to find duplicate music files in my brother's huge collection of digital music. He was frustrated by so many duplicates that still had different file names, file sizes and tags. We couldn't find any existing software that could do it - because they were all just looking for matches on those parameters. The model ended up working quite well.