Microsoft Improves Transformer Stability to Successfully Scale Extremely Deep Models to 1000 Layers
A Microsoft research team proposes DeepNorm, a novel normalization function that improves the stability of transformers to enable scaling that is an order of magnitude deeper (more than 1,000 layer...
Source: syncedreview.com
A Microsoft research team proposes DeepNorm, a novel normalization function that improves the stability of transformers to enable scaling that is an order of magnitude deeper (more than 1,000 layers) than previous deep transformers.