Hey ,

In Timm's implementation of stochastic depth (https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/layers/drop.py) the tensor is scaled by the probability of keeping the actual block. I didn't understand why he does so specially that this is not mentioned in the paper.

Can anyone explain this to me please ?

Thanks !

The code :

def drop_path(x, drop_prob: float = 0., training: bool = False, scale_by_keep: bool = True):

keep_prob = 1 - drop_prob shape = (x.shape[0],) + (1,) * (x.ndim - 1)

random_tensor = x.new_empty(shape).bernoulli_(keep_prob)

if keep_prob > 0.0 and scale_by_keep:

random_tensor.div_(keep_prob)

return x * random_tensor

Comments

You must log in or register to comment.

killver t1_ivzoqe1 wrote on November 11, 2022 at 8:28 PM

Why don't you ask in his repo?

Good idea 😂 Will do so :)

I guess it’s due to normalisation idea similar to using dropout to reduce overfitting?

But there is no normalization in dropout right ?

Yes there is. During training with dropout probability p you rescale you inputs with a factor 1/(1-p).

It actually is mentioned in the paper. It's equation (5)

It's also in the keras implementation