Viewing a single comment thread. View all comments

tripple13 t1_j8r8f7i wrote

This reads like some of those posts criticising OS-frameworks that don't always behave intuitively.

While I don't disagree that there are bugs, Hugging Face is doing more for Open ML than many large tech companies are doing.

HuggingFace, FastAI and similar frameworks are designed to lower the barrier to ML, such that any person with programming skills can harness the power of SoTA ML progress.

I think that's a great mission tbh, even if there are some inevitable bumps on the road.

152

narsilouu t1_j8sckwb wrote

Hijacking highest answer.
Disclaimer, I work at HF.

First of all, thanks for stating things that go wrong. This is the only means we have to get better (we are working with our own tools, but we cannot possibly use in all the various ways our community uses them, and so we cant fix every issue since were simply not aware of them all).

For all the issues you mention above, have you tried opening issues when you encountered your problems ? Were usually keen on answering promptly, and while I cannot promise things will move your way (there s many tradeoffs in our libs), at least that helps inform the relevant people.

Just to give you an overview we have 3 things we re trying to achieve.

- Never introduce breaking change. (Or very rarely, like when something is super new, and we realize its hurting users rather than helping we feel ok to break things. If something is really old, we cannot break it since people rely on it even if something is somewhat buggy).
- Add Sota models as fast as possible (and with the most options possible). That requires help from the community, but also reusing tools that already exists, which sometimes requires creativity on our end, to make widely different codebases in a somewhat consistent way. Most codebases from research don t try to support widely different architectures (theres only a handful) so many things are hardcoded which have to be changed, some bugs are in the original code which we have to copy into our codebase to be somewhat consistent (like position_ids start at 2 for roberta https://github.com/huggingface/transformers/issues/10736)

- And have a very hackable codebase. Contrary to most beautiful code with DRY being the dogma, on the contrary transformers tries to be hackable instead. This is because of the origin of research heavy users, which dont want to spend 2h understanding inheritance of classes and where is that code that does X to the input tensor for them to create a new layer. That means that transformers at least is highly duplicated code (we even have an internal cookie cutter tool to maintain copies as easily as possible).

The consequence for this, is that you have clever idea X to improve upon Whisper lets say, you should be able to copy paste the whisper folder and get going. While it might seem odd for some, it is still a design choice, which comes with pros and cons like any design choice.

And just to set things straight. We dont try to shovel our hub into our tools, we have a lot of testing to make sure local models work all the time, we actually rely on it in several internal projects.
Breaking changes is a very big concern of ours. Subtle breaking changes are most likely unintentional (please report them !).

For reinventing things existing into other libraries, do you have example in mind. We re very careful about the use of our time, and also the amount of dependencies we rely on. Adding a dependency for is_pair function is not something we like to do. If the dependency is too large for what we need we dont need it. If we cant have the functionality in reasonable time, then its going to me mostly optional dependency.

Thanks for reading this to the end.
And for all readers, please rest assured we are continuously trying to have the best code given our 3 constraints above. Any issue or pain, no matter how trivial please report, it does help us improve. And our open source and free code, may not be the best (we re aware of some warts) but please please, never doubt we re trying to do the best. And do not hesitate to contribute to make it better if you feel like you know better than us (and you could definitely be right !)

71

dojoteef t1_j8sqm4i wrote

I commend what Huggingface is trying to do (be the source for the latest models that is consistent and easy to use), but every time I've used the library I've had to tackle bugs that were very time consuming to pinpoint, which is exacerbated by the structure of the code. The worst bugs have been subtle heisenbugs: the code seemed to work most of the time, but failed at other times. The heisenbugs are what made me stop using Huggingface altogether, unless it's my only option.

For example, I ran into a bug that only manifested when downloading a specific pretrained model for a task, which in turn downloads a config file that had a bug in the config. As a user it was super difficult to know where the source of the bug was without extensive spelunking. I've had many similarly difficult to diagnose issues each time I've used the Huggingface ecosystem.

I understand that what you're tasked with as a company is a huge undertaking for such a small team. Maybe splitting the package into a "stable" package and a "nightly" package could help (with stable being extensively bug tested more like an Ubuntu LTS release). My guess is that your team is likely too small to support that approach while adding new features at the same speed.

14

drinkingsomuchcoffee OP t1_j8sm1y7 wrote

Thank you for replying. I apologize for the harsh tone, and was hoping to phrase it as a wake up call that people are reading the code and they do care about quality.

Do continue to avoid inheritance. In fact, probably ban inheritance unless it's only one layer deep and inheriting from an abstract base class.

But don't misunderstand DRY. DRY is not about compressing code as much as possible. That's code golfing. DRY is about having one place for information to live, that's it. If you see a dev creating a poorly named function or abstraction to reduce 5 lines of duplicate code, that's not DRY, that's just bad code.

You can achieve DRY by using code generators as you mention, but splitting things into separate modules is also fine. A code generator is DRY because the generator is the point of truth for the information, even if it creates "duplicate" code. This is what a real understanding of DRY is.

People wanting to "hack" on code do not mind about having to copy a few folders. If you have a beautiful module of pure functions for calculating statistics, it is flat out stupid to copy+paste it into every folder to be more "hackable". Dont do this. Instead factor these out into simple pure modules.

6

fasttosmile t1_j8tudd4 wrote

You don't need to explain what DRY is. You need to understand that there is a trade-off between centralizing (creating shared functions/classes in modules that many other modules import from) a codebase verses keeping it hackable that is unavoidable.

They have a blogpost on this

10

drinkingsomuchcoffee OP t1_j8unlkp wrote

Alright, I have a bit of time so I'll address a few things.

>You need to understand that there is a trade-off between centralizing [...] verses keeping it hackable that is unavoidable.

I don't know what hackable means. You haven't defined it. I'm going to use the most generous interpretation to mean, you can modify it without impacting other places. Well you can do that if it's centralized, just copy paste it into your file and then edit it- that's no excuse to completely ban centralization! Alternatively decompose the centralized function more and only use the pieces you need.

Now onto the blog post.

>If a bug is found in one of the model files, we want to make it as easy as possible for the finder to fix it. There is little that is more demotivating than fixing a bug only to see that it caused 100 failures of other models.

Maybe it should cause 100s of failures if it's a breaking change (a bug). That's a pretty good sign you really did screw something up.

>Similarly, it's easier to add new modeling code and review the corresponding PR if only a single new model file is added.

No it's not. If new code uses a battle tested core, I don't have to review those parts as thoroughly. If it's copy pasted, I still have to review it and make sure they didn't copy an old version with bugs or slightly modified it and broke something. Sounds like this is common as many people have complained about dozens of bugs!

>We assume that a significant amount of users of the Transformers library not only read the documentation, but also look into the actual modeling code and potentially modify it. This hypothesis is backed by the Transformers library being forked over 10,000 times and the Transformers paper being cited over a thousand times.

Maybe you should check your assumptions before you make a fundamental decision (you know, basic engineering). There's plenty of forked libraries that are not modified and are forked for archival purposes. Nor should you cater to a small minority if most people _aren't_ doing this.

> Providing all the necessary logical components in order in a single modeling file helps a lot to achieve improved readability and adaptability.

It can _sometimes_. But not always. Having one massive file named `main.py` is not more readable than a well split program. This seems like basic common sense to me, but here's an actual paper on the subject: http://www.catb.org/esr/writings/taoup/html/ch04s01.html

>Every time we would have to have asked ourselves whether the "standard" attention function should be adapted or whether it would have been better to add a new attention function to attention.py. But then how do we name it? attention_with_positional_embd, reformer_attention, deberta_attention?

Yep, you've identified a place where you shouldn't try to fit every idea under a single "Attention" class. That's just common sense programming, not an argument against writing good shared functions or classes.

>Once a machine learning model is published, it is rarely adapted or changed afterward.

Then why does the Bert module have changes as recent as this week with changes from dozens of authors going back years?

https://github.com/huggingface/transformers/tree/main/src/transformers/models/bert

This is irrefutable hard evidence against your argument.

&gt; Sylvain Gugger, found a great mechanism that respects both the single file policy and keeps maintainability cost in bounds. This mechanism, loosely called "the copying mechanism", allows us to mark logical components, such as an attention layer function, with a # Copied from <predecessor_model>.<function> statement

Ok so the programmer you mentioned before is going to "break 100s of tests" when she changes this ad-hoc C-preprocessor knock off. You're still doing "DRY" you're just doing it how C programmers did it 30 years ago, in a much more complicated manner.

If anyone here works at HuggingFace, please forward this to the author of that article.

5

hpstring t1_j8x5iyh wrote

I'm a beginner in this field and I was wondering what it means for code to be "centralized" and "dry". Does "centralized" mean putting a lot of code in a single file and "dry" means raw code that is not very easy to read but is efficient or have some other advantages?

1

baffo32 t1_j8zbmua wrote

dry is a very basic software engineering principle that means to include only one copy of every sequence of code. it looks like machine learning people did not learn this as they weren’t trained as software engineers. DRY stands for “don’t repeat yourself”, and if not respected then it gets harder and slower more and more to maintain, improve, or bugfix software, the larger and older it gets.

2

baffo32 t1_j8zbuup wrote

i think by centralized they mean what they imagine dry looking like, putting code in one place rather than spreading it out. it’s not usually used that way. it’s a reasonable expression though; people usually centralize components so there is one organized place to go to in order to access them.

2

hpstring t1_j8zw7i5 wrote

Lots of thanks! I didn't receive training from software engineering perspective, which seems to be an important aspect in machine learning.

1

baffo32 t1_j90uucx wrote

it’s important if you’re publishing large software packages of course lots of hobbyists also learn in the field

1

fasttosmile t1_j8v03xu wrote

> I don't know what hackable means. You haven't defined it. I'm going to use the most generous interpretation to mean, you can modify it without impacting other places. Well you can do that if it's centralized, just copy paste it into your file and then edit it- that's no excuse to completely ban centralization! Alternatively decompose the centralized function more and only use the pieces you need.

Your definition of hackable is almost it. What’s missing is that being decentralized makes things much, much easier to understand because the code is very straightforward and doesn’t have to take 10 different things into account.

You cant just copy paste a file if it’s centralized, you’ll have to copy paste multiple, and the main issue is it’s gonna take a while to understand which ones (and you'll have to modify the imports etc., unless you copy the entire repo! are you seriously suggesting that lmao) and what’s safe to modify inside of them. Decomposing is just going to make things more complicated for no gain.

Deep learning is about the details, and whenever you start breaking things apart and putting the details in different corners that’s how you end up with code that is hard to understand and people making mistakes and not understanding what’s going on.

> Maybe it should cause 100s of failures if it's a breaking change (a bug). That's a pretty good sign you really did screw something up.

It's a syntax/interface/some-other-not-fundamental bug. A real bug would have already been spotted when checking the test-set performance .

> No it's not. If new code uses a battle tested core, I don't have to review those parts as thoroughly. If it's copy pasted, I still have to review it and make sure they didn't copy an old version with bugs or slightly modified it and broke something. Sounds like this is common as many people have complained about dozens of bugs!

The way code is shown to be correct is by getting SOTA results. If it does that it is "battle tested". If it didn't do that no one would even think of merging it in the first place.

> Yep, you've identified a place where you shouldn't try to fit every idea under a single "Attention" class. That's just common sense programming, not an argument against writing good shared functions or classes.

It is an argument against having shared classes. At the same time, sure you can have some shared code, Huggingface does that.

> It can sometimes. But not always. Having one massive file named main.py is not more readable than a well split program. This seems like basic common sense to me, but here's an actual paper on the subject:

There is an important distinction that you're ignoring here. Having semantically separate objects in one file is indeed confusing. But if put everything related to the model in one file that simplifies things and reduces the working memory people require to read your code.

> Then why does the Bert module have changes as recent as this week with changes from dozens of authors going back years?

The recent change for Bert is some inference Interfaxe code which has to be kept common across all models. That’s their decision, I wouldn’t even do that, just make kwargs mandatory imo.

> Maybe you should check your assumptions before you make a fundamental decision (you know, basic engineering). There's plenty of forked libraries that are not modified and are forked for archival purposes. Nor should you cater to a small minority if most people aren't doing this.

Everyone in deep learning likes to gamble on making some tweaks to the model hoping they’ll get the next ICLR oral. Why else would they care about modifying the model code?

--

I suggest you go read some modeling code from different frameworks, one example is fairseq. I like fairseq, I think it's well done considering it's aims and constraints. But you're crazy if you think it's easier to understand and modify the code for some specific model than in huggingface. Here's the link to fairseq's roberta, you'll need to understand look at a dozen files to see what's happening. In constrast, huggingface is one file.

Spent too much time on this already, not gonna reply anymore.

0

drinkingsomuchcoffee OP t1_j8v3y80 wrote

>You cant just copy paste a file if it’s centralized, you’ll have to copy paste multiple, and the main issue is it’s gonna take a while to understand which ones (and you'll have to modify the imports etc., unless you copy the entire repo! are you seriously suggesting that lmao)

Yep apparently they themselves claim to do this for every module. Thank you for pointing out how crazy this is and proving my point.

>Your definition of hackable is almost it. What’s missing is that being decentralized makes things much, much easier to understand because the code is very straightforward and doesn’t have to take 10 different things into account.

Oh really? I think those files depend on pytorch functions and also numpy. Should they copy those entire libraries into the file to be more "hackable"? Lmao

3

drinkingsomuchcoffee OP t1_j8tw3yt wrote

There's so many contradictions in that blog post and fallacies, I don't even know where to begin. I think I'll let empirical evidence do the talking for me, aka many people agreeing with my post.

3

baffo32 t1_j8vsq9s wrote

looks like there is emotional or funded influence here, cointerintuitive votes, strange statements stated as facts

Duplicated code makes a very very _unhackable project_ because one has to learn the code duplicating systems and add functionality to them for every factorization. It does make _hackable examples_ but the codebase doesn’t seem to understand where to draw the line at all.

The library looks like it was made entirely without an experienced lead software engineer. As a corporation they should have one.

&#x200B;

HuggingFace, please understand that software developers find DRY to be hackable. The two terms usually go together. It reads like a contradiction, like fake news trying to manipulate people by ignoring facts, to state it the other way around.

4

drinkingsomuchcoffee OP t1_j8zael9 wrote

I am the "bad guy" of the thread, so anything I say will be seen negatively, even if it's correct. This is typical human behavior, unfortunately.

I have a feeling most people here do not understand DRY done well, and are used to confusing inheritance hierarchies and incredibly deep function chains. Essentially they have conflated DRY with bad code, simple as that.

2

baffo32 t1_j8zd3ge wrote

You’re not the bad guy, I’m guessing maybe it’s a community of data workers who’ve never had a reason to value DRY.

2

Shinsekai21 t1_j8wrcvo wrote

>HuggingFace, FastAI and similar frameworks are designed to lower the barrier to ML, such that any person with programming skills can harness the power of SoTA ML progress.

I started out with FastAI and now learning PyTorch. I agreed.

I'm more of the top-down student (learn the practical stuff first then the fundamental). FastAI is doing great job at showing me what is possible and interesting with their lectures.

I moved to PyTorch because I wanted to understand more about whats underneath FastAI. I'm currently doing ZeroToMastery PyTorch and found that the knowledge I had with FastAI is helping alot.

−1