Menttor
Research Index

Join the Menttor community

Track progress and collaborate on roadmaps with students worldwide.

🐢
Research Decoded/Liu et al. (2024)

DoRA: Weight-Decomposed Low-Rank Adaptation

Liu, S. Y., Yen, C. Y., Nag, N., ... & Cheng, K. T. (2024). DoRA: Weight-Decomposed Low-Rank Adaptation. arXiv preprint arXiv:2402.09353.

Read Original Paper
DoRA: Weight-Decomposed Low-Rank Adaptation

The efficiency of fine-tuning large language models has long been governed by the trade-off between parameter count and expressive power. Standard Low-Rank Adaptation (LoRA) reduced the computational barrier by confining weight updates to a low-dimensional space, yet a performance gap persisted between these sparse updates and full parameter fine-tuning. Research into the behavioral patterns of these methods revealed that LoRA updates are limited by a rigid coupling between magnitude and direction. In LoRA, any significant change in the orientation of a weight vector is almost always accompanied by a proportional increase in its magnitude, a constraint that does not exist in full-parameter optimization. This lack of flexibility prevents the model from executing nuanced adjustments, such as precise directional shifts that require minimal changes in scale.

Weight Decomposition and Decoupling

Weight Decomposition and Decoupling

Magnitude and direction updates of (a) FT, (b) LoRA, and (c) DoRA across different layers.

The mechanism of Weight-Decomposed Low-Rank Adaptation (DoRA) addresses this coupling by reparameterizing the pre-trained weight matrix $W$ into two distinct components: a magnitude vector $m$ and a directional matrix $V$. By applying the standard LoRA update specifically to the directional component while allowing the magnitude to be learned independently, the model can navigate the optimization landscape with a level of granularity previously reserved for full fine-tuning. This mathematical decoupling allows the gradient to be conditioned more effectively, as directional shifts no longer force an unintended growth in weight magnitude. This finding revealed that the bottleneck in sparse adaptation was not the number of parameters being updated, but the structural constraints on how those parameters were allowed to move through the function space.

Learning Stability and Performance

Empirical analysis of DoRA's learning trajectories shows a striking similarity to full fine-tuning, characterized by a complex and often negative correlation between magnitude and direction updates. This behavioral alignment enables the model to achieve superior learning stability and accuracy across diverse reasoning and visual instruction tasks without introducing any additional latency during the inference phase. The results suggest that the intelligence of an update is a function of its ability to independently modify the structural properties of the network. This raises an open question about whether other fundamental components of neural architectures, such as normalization layers or attention heads, can be further decomposed to reveal similar hidden dimensions of optimization efficiency.

Dive Deeper