Text track

RNN vs Transformer

Completed Background EDA Metrics Finalized

Owner

Nguyen Quoc Hieu

Models

LSTM, DistilBERT

Sections

5 report blocks

Repository

GitHub Repo

Streamlit

Streamlit App

Back to Assignment 1 overview

RNN vs Transformer report

This section explores the performance and architectural differences between sequential and attention-based deep learning models. Specifically, we compare an LSTM (Long Short-Term Memory) network, a type of RNN designed to mitigate the vanishing gradient problem and capture long-term dependencies in sequential data, against DistilBERT, a smaller, faster, and distilled version of the BERT Transformer that leverages bidirectional self-attention to understand context and relationships within text.

Background EDA Data pipeline Models Results Extensions

Dataset

20_newsgroups

18,846 docs, 20 classes

bidirectional Attention-based LSTM result

69.72%

Accuracy

fine-tune BERT best result

79.86%

Accuracy

Background Knowledge

Sequential Model

LSTM (Long Short-Term Memory)

❏Architecture of a traditional RNN Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states^[1] .

Image 1: Architecture of a traditional RNN

For each timestep \( t \), the activation \( a^{\langle t \rangle} \) and the output \( y^{\langle t \rangle} \) are expressed as follows:

\( a^{\langle t \rangle} = g_1( W_{aa} a^{\langle t-1 \rangle} + W_{ax} x^{\langle t \rangle} + b_a ) \) and \( y^{\langle t \rangle} = g_2( W_{ya} a^{\langle t \rangle} + b_y ) \)

Where \( W_{ax}, W_{aa}, W_{ya}, b_{a}, b{y} \) are coefficients that are shared temporally and \( g_{1}, g_{2} \) activation functions.

❏ Loss function In the case of a recurrent neural network, the loss function \( \mathcal{L} \) of all time steps is defined based on the loss at every time step^[2] :

\( \displaystyle \mathcal{L}(\hat{y}, y) = \sum_{t=1}^{T_y} \mathcal{L}(\hat{y}^{\langle t \rangle}, y^{\langle t \rangle}) \)

❏ Backpropagation through time Backpropagation is done at each point in time. At timestep \( T \), the derivative of the loss \( \mathcal{L} \) with respect to weight matrix \( W \) is expressed^[2] :

\( \displaystyle \frac{\partial \mathcal{L}^{(T)}}{\partial W} = \sum_{t=1}^{T} \left. \frac{\partial \mathcal{L}^{(T)}}{\partial W} \right|_{(t)} \)

❏ Types of gates In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted \( \Gamma \) and are equal to^[1] :

\( \Gamma = \sigma(W x^{\langle t \rangle} + U a^{\langle t-1 \rangle} + b) \)

where \( W, U, b \) are coefficients specific to the gate and \( \sigma \) is the sigmoid function. The main ones are summed up in the table below:

Type of gate	Role	Value range
Update gate \( \Gamma_u \)	How much past should matter now?	0 (ignore) → 1 (update fully)
Relevance gate \( \Gamma_r \)	Drop previous information?	0 (drop) → 1 (keep)
Forget gate \( \Gamma_f \)	Erase a cell or not?	0 (erase) → 1 (remember)
Output gate \( \Gamma_o \)	How much to reveal of a cell?	0 (hide) → 1 (reveal)

❏ LSTM Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs^[3] .

Image 2: Architecture of an LSTM memory cell

The multiplicative nodes in an LSTM memory cell function like gates that control the flow of information through the network. They enable the network to decide how much of each signal from the input data should pass through. There are three main places where these multiplicative interactions occur. An LSTM cell contains three gates and one cell state.

The forget gate multiplies the previous cell state by another value between 0 and 1, deciding how much information from the previous time step to retain or discard.

\( f_t = \sigma(W_f [h_{t-1}, x_t] + b_f) \)

The input gate multiplies the candidate cell state by a value produced by a sigmoid function as the activation function, between 0 and 1, determining how much new information from the current input to add to the memory.

\( i_t = \sigma(W_i [h_{t-1}, x_t] + b_i) \)

The input gate then generates a new candidate cell state:

\( \tilde{C}_t = \tanh(W_c [h_{t-1}, x_t] + b_c) \)

The cell state \( C_t \) is updated by the long-term memory:

\( C_t = f_t \star C_{t-1} + i_t \star \tilde{C}_t \)

Finally, the output gate multiplies the updated cell state (after passing through a tanh) by a gating value to determine what part of the internal state becomes visible as output \( o_t \) and a new hidden state \( h_t \):

\( o_t = \sigma(W_o [h_{t-1}, x_t] + b_o) \) and \( h_t = o_t \star \tanh(C_t) \)

Below is a table summing up the characterizing equations of the architecture:

Characterization	Definition	Equation (LSTM)
\( \tilde{C}_t \)	Candidate cell state	\( \tanh(W_c [h_{t-1}, x_t] + b_c) \)
\( C_t \)	Updated cell state (Long-term memory)	\( f_t \star C_{t-1} + i_t \star \tilde{C}_t \)
\( h_t \)	Hidden state (Output/Short-term memory)	\( o_t \star \tanh(C_t) \)

Attention Model

DistilBERT

❏ Knowledge distillation Knowledge distillation is a compression technique in which a compact model - the student - is trained to reproduce the behaviour of a larger model - the teacher - or an ensemble of models^[4] .

❏ Training loss The student is trained with a distillation loss over the soft target probabilities of the teacher:

\( \displaystyle L_{ce} = \sum_{i} t_i * \log(s_i) \)

where \( t_i \) (resp. \( s_i \)) is a probability estimated by the teacher (resp. the student). This objective results in a rich training signal by leveraging the full teacher distribution, using a softmax-temperature:

\( \displaystyle p_i = \frac{\exp(z_i / T)}{\sum_{j} \exp(z_j / T)} \)

❏ Student architecture The student - DistilBERT - has the same general architecture as BERT. The token-type embeddings and the pooler are removed while the number of layers is reduced by a factor of 2. Most of the operations used in the Transformer architecture (linear layer and layer normalisation) are highly optimized in modern linear algebra frameworks and our investigations showed that variations on the last dimension of the tensor (hidden size dimension) have a smaller impact on computation efficiency (for a fixed parameters budget) than variations on other factors like the number of layers^[4] .

Image 3: DistilBERT architecture overview

References

Background Knowledge Sources

[1] Shervine Amidi. Recurrent Neural Networks cheatsheet. CS 230 Deep Learning, Stanford University. Link
[2] Schmidt, R. M. (2019). Recurrent Neural Networks (RNNs): A gentle Introduction and Overview. arXiv preprint arXiv:1912.05911. Link
[3] Sherstinsky, Alex (2020). Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) network. Physica D: Nonlinear Phenomena, 404, 132306. Elsevier BV. Link
[4] Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2020). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Link

1. Problem Statement - Exploratory Data Analysis

Problem statement

Text Classification

There are diversities of papers, newsletter, books... with various genres. We are not, however, able to manually distinguish between them in a short time. Therefore, RNN-based LSTM and Transformer-based DistilBERT are both taken into account to estimate the accuracy of classifying categories of each document based on natural language input.

RNN baseline: LSTM
Transformer model: DistilBERT

Dataset summary

Dataset information

Dataset Name: 20_newsgroups
Number of classes (corpus): 20
Total documents: 18,846
- Training samples: 11,314
- Test samples: 7,532
Documents per corpus:
- alt.atheism: 799
- comp.graphics: 973
- comp.os.ms-windows.misc: 985
- comp.sys.ibm.pc.hardware: 982
- ...

Corpus Statistics

Statistic	Value / Info
Total words	3,423,145
Total characters	22,043,554
Min docs/class	628 (talk.religion.misc)
Max docs/class	999 (rec.sport.hockey)
Mean docs/class	942.3
Std	97.0

Per-document Word Count

Metric	Value
Mean	181.6
Median	83.0
Std	501.3
Min	0
Max	11765
Q25	40
Q75	166

Word Analysis

Word Frequency

Word Analysis

Word Correlation / Corpus

2. Dataset, DataLoader, and augmentation setup

1. Loading & EDA

Dataset overview

Loaded from HuggingFace: SetFit/20_newsgroups
18,846 total docs, 20 classes
Train: 11,314 | Test: 7,532
Class imbalance: 628 (min) to 999 (max) docs per class

2. Text Preprocessing

Cleaning and Truncation

Text cleaning: removed quoted text, email headers, and signatures
Short doc removal: dropped < 50 chars (~544 docs, 2.9%)
Truncation: max 400 words (aligned with BERT 512 token limit)

3. Train/Val/Test Split

Splitting strategy

Split	Samples	Note
Train	8,584	Stratified
Validation	2,146	Stratified
Test	7,112	Stratified

4. Tokenization

Tokenizer details

Tokenizer: distilbert-base-uncased
Max length per sentence: 512 tokens
Dynamic padding via DataCollatorWithPadding
~90% of docs fit under 512 tokens

5. Class Balancing

Weighted loss

Computed class weights: 1.0 / class_counts
Normalized weights to sum to num_classes
Applied via CrossEntropyLoss(weight=class_weights)

6. DataLoader

Batch setup

Batch size: 12 (train), 36 (val/test)
Dynamic padding per batch
No .set_format('torch') to the dataset, let DataCollator handle tensor conversion

3. Model building, training, evaluation, and comparison

RNN pipeline

BiLSTM text classification architecture

->

Input

Shape Map

Node	Shape	Note
token_indices	`(B, 256)`	Padded token ids
vocab_size	`len(vocab)`	`max_vocab=50000`, `PAD=0`, `UNK=1`

Token indices

token_indices: shape (B, 256)
vocab_size ≈ len(vocab), max_vocab=50000
PAD=0, UNK=1

Backbone: BiLSTM Encoder

Shape Map

Node	Shape	Note
Embedding	`(B, 256, 300)`	GloVe token vectors
TF-IDF weighting	`(B, 256, 300)`	Elementwise scaling by `(B, 256, 1)`
BiLSTM output	`(B, 256, 256)`	2 directions x 128 hidden size
Attention context	`(B, 256)`	Weighted sum over timesteps

Embedding

type: nn.Embedding
vocab_size=len(vocab), embedding_dim=300
padding_idx=0, pretrained GloVe 6B 300d
freeze_embeddings=False
output: (B, 256, 300)

->

TF-IDF Weighting

weights shape: (B, 256, 1)
operation: embedding * tfidf_weight
output: (B, 256, 300)

->

Dropout

p=0.3

->

BiLSTM

type: nn.LSTM
input_size=300, hidden_size=128, num_layers=2
bidirectional=True, batch_first=True, dropout=0.3
output lstm_out: (B, 256, 256)

->

Attention

Linear(256 -> 1)
Softmax(dim=1)
Weighted Sum over timesteps
attn_weights: (B, 256, 1)
context: (B, 256)

Head: Classification Head

Shape Map

Node	Shape	Note
context	`(B, 256)`	Attention pooled sequence vector
logits	`(B, 20)`	One score per class

Dropout

p=0.3

->

FC

Linear(256 -> 20)
output logits: (B, 20)

Output

Shape Map

Node	Shape	Note
logits	`(B, 20)`	Raw class scores
pred / class_id	`(B,)`	Argmax class index in `[0..19]`

Prediction

pred = Argmax(logits)
output class_id ∈ [0..19]

Transformer pipeline

DistilBERT text classification architecture

->

Input

Shape Map

Node	Shape	Note
input_ids	`(B, L)`	Token indices, `L ≤ 512`
attention_mask	`(B, L)`	1 = real token, 0 = padding

Tokenized tensors

input_ids: shape (B, L), max_length=512
attention_mask: shape (B, L)

Backbone: DistilBERT Encoder

Shape Map

Node	Shape	Note
Embedding output	`(B, L, 768)`	Token + position embeddings
Encoder x6 output	`(B, L, 768)`	Sequence representation preserved
[CLS] hidden state	`(B, 768)`	First-token vector for classification

Embedding

token embedding: vocab_size=30522, dim=768
position embedding: max_position_embeddings=512, dim=768
output: (B, L, 768)

->

Transformer Encoder x6

n_layers=6, dim=768, hidden_dim=3072
n_heads=12, dropout=0.1, attention_dropout=0.1
Multi-Head Self-Attention: num_heads=12, head_dim=64
Add & LayerNorm
Feed Forward Network: Linear(768 -> 3072) -> GELU -> Linear(3072 -> 768)
Add & LayerNorm
output: (B, L, 768)

->

[CLS] hidden state

Take hidden state at token [CLS]
shape: (B, 768)

Head: Classification Head

Shape Map

Node	Shape	Note
[CLS] input	`(B, 768)`	Classifier receives pooled [CLS] vector
Pre-Classifier + ReLU	`(B, 768)`	Feature projection keeps hidden dim
Classifier logits	`(B, 20)`	One score per class

Pre-Classifier

Linear(768 -> 768)
ReLU

->

Dropout

p=0.1

->

Classifier

Linear(768 -> 20)
output logits: (B, 20)

Output

Shape Map

Node	Shape	Note
logits	`(B, 20)`	Raw class scores
probs	`(B, 20)`	Softmax probabilities
pred / class_id	`(B,)`	Argmax class index in `[0..19]`

Prediction

probs = Softmax(logits)
pred = Argmax(probs)
output class_id ∈ [0..19]

RNN model

BiLSTM with Attention

Architecture

Embedding: 300d GloVe (43.3% vocab coverage) | TF-IDF weighted
Backbone: 2-layer BiLSTM (128 units/dir)
Attention: Self-attention over all LSTM timesteps
Dropout: 0.3 | Total Params: 15.8M

Training

Optimizer: Adam (1e-3) | Batch size: 16
Scheduler: Cosine with warmup | Epochs: 20
Loss: Weighted CrossEntropy | Seq length: 256

Evaluation

Evaluation Metrics:Accuracy, F1 Macro, F1 Weighted, and per-class classification_report.

Comparison

Systematic evaluation across Val Acc, Test Acc, F1 Macro, Train Time (s), Inference (ms), and Params.

Transformer model

DistilBERT

Model building

Pretrained checkpoint: distilbert-base-uncased
Classification head: pre_classifier (768×768) → ReLU → classifier (768×20)

Training

Fine-tuning epochs: 15 (Phase 1: only head unfrozen, Phase 2: all layers and head unfrozen)
Optimizer: AdamW
Learning rate: Utilizing a layerwise differential learning rate strategy (detailed in Extension 1).
Scheduler: Linear Warmup and Cosine Annealing

Evaluation

Evaluation Metrics:Accuracy, F1 Macro, F1 Weighted, and per-class classification_report.

Comparison

Comprehensive evaluation across Val Acc, Test Acc, F1 Macro, Train Time (s), Inference (ms), and Params.

4. Experimental results, figures, analysis, and discussion

Evaluation

DistilBERT Performance

After training, the model was evaluated on the held-out test set (7,112 samples) using multiple metrics to provide a comprehensive view of classification performance.

Test Results

Metric	Value
Accuracy	73.88%
F1 Macro	0.7268
F1 Weighted	0.7400
Precision	0.7299
Recall	0.7258
Best class	rec.sport.hockey (F1: 0.93)
Worst class	talk.religion.misc (F1: 0.32)

Image 4: DistilBERT Confusion Matrix

Image 5: DistilBERT F1 Scores Bar Chart

RNN Performance

BiLSTM with Attention

The BiLSTM model was evaluated after 20 epochs, reaching its peak validation accuracy of 69.72% at epoch 18.

Final Test Results

Metric	Score
Accuracy	62.90%
F1 Macro	0.6179
F1 Weighted	0.6311
Precision	0.6251
Recall	0.6155
Best class	rec.sport.hockey (F1: 0.89)
Worst class	talk.religion.misc (F1: 0.19)

Image 6: BiLSTM with Attention Confusion Matrix

Image 7: BiLSTM with Attention F1 Scores Bar Chart

Discussion

Key Insights and Discussion

Model Behavior and Overfitting

The gap between validation accuracy and test accuracy (e.g., DistilBERT 78.89% vs 73.88%) suggests mild overfitting. This is partly attributed to the relatively small training set (~8,500 samples) distributed across 20 diverse classes.
Class weighting in the loss function helped improve recall for underrepresented categories like talk.religion.misc and talk.politics.misc, though they still remain the most challenging for both architectures.

Semantic Analysis

Classes with clear topical boundaries (e.g., rec.sport.hockey, sci.space) consistently achieved higher F1 scores (~0.80+ for DistilBERT, ~0.70+ for BiLSTM) due to unique technical vocabularies.
Semantically overlapping classes (e.g., politics vs guns, or different computer hardware categories) showed higher confusion rates, which is expected given their shared terminology.

Error Patterns (Confusion Matrix Analysis)

Specific misclassifications reveal systemic issues in text understanding:

Religion: talk.religion.misc is frequently confused with alt.atheism and soc.religion.christian due to high lexical overlap in philosophical discussions.
Politics: talk.politics.misc is often misidentified as guns or mideast subcategories, as general political discussions often touch on these specific themes.
Technical brevity: Short documents in technical categories like comp.graphics are sometimes confused with mac.hardware when they lack specific keywords but share general technical terms.

Comparison

Systematic Comparison

A systematic comparison will be conducted across three dimensions:

Architecture comparison: DistilBERT (67M params) vs LSTM — comparing transformer-based models against a recurrent baseline to evaluate the impact of self-attention and pretrained representations on text classification performance.
Fine-tuning strategy comparison: For each transformer model, three strategies are evaluated:
- Freeze backbone (train classification head only)
- Full fine-tuning (train all layers from start with layerwise LR)
- Hybrid (train head first, then unfreeze backbone with differential LR)
Model efficiency comparison: Accuracy vs F1 Macro, Training time, Inference latency and Number of Parameters — to determine whether the larger BERT-base model justifies its additional computational cost over DistilBERT.

Results will be reported in the following format after all experiments are completed:

Model	Val Acc	Test Acc	F1 Macro	Train Time (s)	Inference (ms)	Params
DistilBERT (Hybrid)	78.89%	73.88%	72.68%	735s	44.1ms	67.0M
BiLSTM + Attention	69.72%	62.90%	0.6179	85s	-	15.8M

Key Findings: Transformer vs RNN

The ~11% accuracy gap between BiLSTM (62.9%) and BERT (73.88%) demonstrates that:

Contextualized representations are essential: Static GloVe embeddings cannot capture the polysemy and context-dependent meanings that DistilBERT/BERT handle via self-attention.
Attention benefit: While attention improved BiLSTM baseline (~54% → 63%), it cannot fully compensate for the lack of bidirectional contextual pretraining.
Efficiency Trade-off: BiLSTM with Attention trains 18-35x faster than transformer models, making it a viable candidate for resource-constrained environments where a 10% accuracy drop is acceptable.

5. Other extension reports

Extension 1

Fine-tuning Strategy Comparison

Three fine-tuning strategies were evaluated on both DistilBERT (67M params) and BERT-base (110M params):

Freeze backbone: Train only the classification head; backbone weights remain frozen
Full fine-tune: Train all layers from the start with layerwise differential learning rates
Hybrid: Train classification head first (5 epochs), then unfreeze backbone with differential LR (15 epochs)

Learning Rate Strategy

A layerwise differential learning rate strategy was applied to both the Hybrid and Full fine-tune approaches to prevent catastrophic forgetting and ensure stable convergence:

Phase 1 (Warmup): The learning rate is set to 5e-5 for the classifier layer to adapt it to the 20_newsgroups vocabulary.
Phase 2 (Full Tuning): The classifier remains at 5e-5. The backbone uses a progressive LR: 1e-6 for embedding layers, 2e-6 for initial encoder layers, and 5e-5 for the final layers closest to the output.

Results

Model	Strategy	Val Acc	Test Acc	Train Time (s)	Inference (ms)	Params
DistilBERT	Freeze backbone	67.38%	66.15%	292	44.3	66,968,852
DistilBERT	Hybrid	78.89%	73.88%	735	44.1	66,968,852
DistilBERT	Full fine-tune	79.33%	74.06%	644	44.4	66,968,852
BERT	Freeze backbone	64.80%	62.93%	501	88.2	109,497,620
BERT	Hybrid	79.86%	74.77%	1337	88.3	109,497,620
BERT	Full fine-tune	79.96%	74.17%	1168	87.9	109,497,620

Image 8: Training Loss Comparison

Image 9: Validation Accuracy Comparison (Transformers)

Analysis

The hybrid strategy achieved the best accuracy for both models, outperforming full fine-tuning by ~0.2-0.5% on test accuracy. Training the classification head first provides a stable initialization before updating the pretrained backbone, which leads to slightly better convergence.
Full fine-tuning performed comparably to hybrid but with slightly lower accuracy, suggesting that updating all layers simultaneously without a warm-up phase can lead to suboptimal adaptation of the classification head.
Freeze backbone performed significantly worse (~10% lower than hybrid), confirming that the pretrained representations alone are insufficient for this 20-class task and backbone adaptation is necessary.
Training time scales linearly with model complexity: hybrid takes ~15% longer than full fine-tune due to the additional Phase 1 epochs, but the accuracy gain justifies the cost.

Extension 2

Model Efficiency Comparison

A comparison of model efficiency across DistilBERT and BERT-base, evaluating accuracy vs model size and inference time.

Metric	DistilBERT	BERT-base	Ratio
Parameters	67.0M	109.5M	1.63x
Best Test Accuracy	73.88%	74.77%	+0.89%
Training Time	735s	1,337s	1.82x
Inference Time	44.1ms	88.3ms	2.0x

Analysis

BERT-base provides only a marginal accuracy improvement (+0.89%) over DistilBERT while requiring 1.63x more parameters, 1.82x longer training time, and 2.0x slower inference.
DistilBERT offers a significantly better accuracy-to-efficiency trade-off: it achieves 98.8% of BERT's accuracy at significantly lower computational cost (1.8x faster training, 2x faster inference).
This result aligns with the design goal of knowledge distillation — DistilBERT retains ~98% of BERT's performance while being significantly smaller and faster.

Simple Compression: DistilBERT as a Compressed Model

DistilBERT itself serves as a compressed version of BERT-base, produced through knowledge distillation during pretraining:

Aspect	BERT-base	DistilBERT	Reduction
Encoder layers	12	6	50%
Parameters	109.5M	67.0M	39%
Inference time	88.3ms	44.1ms	50%
Test accuracy (hybrid)	74.77%	73.88%	-0.89%

The experiment demonstrates that model compression via distillation is an effective strategy for reducing model size and inference latency with minimal accuracy loss. For this text classification task, the 39% parameter reduction and 50% inference speedup come at a cost of less than 1% accuracy — a favorable trade-off for most production applications.