AI
AI News

STEM: Scaling Transformers with Embedding Modules

Source:arXiv
Original Author:Ranajoy Sadhukhan et al.
STEM: Scaling Transformers with Embedding Modules

Image generated by Gemini AI

STEM (Scaling Transformers with Embedding Modules) offers a novel approach to fine-grained sparsity in neural networks by replacing traditional feed-forward networks with static, token-indexed embedding lookups. This reduces runtime complexity and enhances training stability, allowing for efficient CPU offloading. STEM achieves up to 4% accuracy improvements on knowledge-intensive tasks while decreasing parameter access and per-token FLOPs by about one-third. Its architecture fosters better interpretability and knowledge editing capabilities without extra computation, particularly enhancing performance in long-context scenarios and across diverse model scales.

STEM: A New Approach to Scaling Transformers

A novel method called STEM (Scaling Transformers with Embedding Modules) has been introduced to enhance the performance and efficiency of Transformer models. By implementing a static, token-indexed approach that replaces the feedforward network (FFN) up-projection with a layer-local embedding lookup, STEM achieves improved stability and efficiency during training.

Empirical Performance and Improvements

Empirical results indicate that STEM maintains stable training even under conditions of extreme sparsity. The model demonstrates enhanced downstream performance when compared to dense baselines, significantly reducing per-token FLOPs and parameter accesses by eliminating approximately one-third of FFN parameters. This reduction contributes to a more efficient use of resources.

Accuracy Gains Across Model Scales

Across models with parameter scales of 350M and 1B, STEM achieves accuracy improvements ranging from 3% to 4% overall. The model shows particularly notable gains on knowledge and reasoning-intensive benchmarks, including ARC-Challenge, OpenBookQA, GSM8K, and MMLU.

Related Topics:

STEMScaling Transformersfine-grained sparsityknowledge capacityinterpretability

📰 Original Source: https://arxiv.org/abs/2601.10639v1

All rights and credit belong to the original publisher.

Share this article