Clearly, large scale deep learning approaches in image classification or NLP use all sorts of Regularization mechanisms, but the parameters are typically unconstrained (i.e., every weight can theoretically attain any real value). In many Machine Learning domains, constrained optimization (e.g. via Projected Gradient Descent or Frank-Wolfe) plays a huge role.

I was wondering whether there are large-scale Deep Learning applications which rely on constrained optimization approaches? When I say large-scale, I mean large CNNs, transformers, diffusion models or the like. Are there settings where constrained optimization would even be a preferred approach, but not efficient/stable enough?

Happy for any paper suggestions or thoughts! Thanks!

Comments

You must log in or register to comment.

tdgros t1_j7vdocr wrote on February 9, 2023 at 6:01 PM

With constrained optimization, you usually have a feasible set for the variables you optimize, but in a NN training you optimize millions of weights that aren't directly meaningful, so in general, it's not clear if you can define a feasible set for each of them.

notdelet t1_j7vv9pi wrote on February 9, 2023 at 7:49 PM

You can get constrained optimization in general for unconstrained nonlinear problems (see the work N Sahinidis has done on BARON). The feasible sets are defined in the course of solving the problem and introducing branches. But that is both slow, doesn't scale to NN sizes, and doesn't really answer the question ML folks are asking (see the talk at the IAS on "Is Optimization the Right Language for ML").

d0cmorris OP t1_j819chm wrote on February 10, 2023 at 9:50 PM

Exactly. I mean I can easily define L2-constraints for the weights of my network and then do constrained optimization, which would at least theoretically be equivalent to L2-regularization/weight decay. But this is not quite useful, I am wondering whether there are applications of constraints where it actually makes sense.

Mental-Reference8330 t1_j8xup7w wrote on February 17, 2023 at 6:52 PM

in the early days, researchers considered the architecture itself to be a form of regularization. LeCunn didn't invent it, but he did popularize the idea that a convolutional layer (like LeNet in his case) is like a fully-connected layer, but constrained to only allow solutions where the layer weights could be expressed in terms of a convolution kernel. In their introduction, ResNets were also motivated by the fact that they're "constrained" to start from better minima, even though you could also convert a resnet model to a fully-connected model without loss of precision.

DACUS1995 t1_j7vltrw wrote on February 9, 2023 at 6:51 PM

As you said most deep-learning models use some sort of regularization at training so there is some implicit constraint on the actual values of the weights, even more so when the number of parameters goes in the range of billions where you will have an inherent statistical distribution of the feature importance. On the more explicit and fixed side, there are a couple of papers and efforts in the area of quantization where parameter outliers in various layers affect the precision of quantized representation, so you would want a reduced variance in the block or layers values. For example, you can check this: https://arxiv.org/abs/1901.09504.

[deleted] t1_j7ybm4a wrote on February 10, 2023 at 7:11 AM

[deleted]

jimmymvp t1_j807b94 wrote on February 10, 2023 at 5:42 PM

There's a bunch of cool work on using constrained optimization as a layer in neural nets, differentiation through argmin. I'm not sure if this answers your question.

eigenham t1_j80ijfv wrote on February 10, 2023 at 6:54 PM

Link(s)? I can find more from examples, just need a thread to pull. Thanks!

d0cmorris OP t1_j8190la wrote on February 10, 2023 at 9:47 PM

Do you have any links? That would be great!

Pfohlol t1_j8j6pfb wrote on February 14, 2023 at 6:26 PM

Here's one to get started https://proceedings.mlr.press/v98/cotter19a.html