Assignment 1 ยท Text

Text track

RNN vs Transformer

Completed Background EDA Metrics Finalized

Owner

Nguyen Quoc Hieu

Models

LSTM, DistilBERT

Sections

5 report blocks

Repository

GitHub Repo

Streamlit

Streamlit App

RNN vs Transformer report

This section explores the performance and architectural differences between sequential and attention-based deep learning models. Specifically, we compare an LSTM (Long Short-Term Memory) network, a type of RNN designed to mitigate the vanishing gradient problem and capture long-term dependencies in sequential data, against DistilBERT, a smaller, faster, and distilled version of the BERT Transformer that leverages bidirectional self-attention to understand context and relationships within text.

Dataset

20_newsgroups

18,846 docs, 20 classes

bidirectional Attention-based LSTM result

69.72%

Accuracy

fine-tune BERT best result

79.86%

Accuracy

Background Knowledge

Sequential Model

LSTM (Long Short-Term Memory)

❏Architecture of a traditional RNN Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states[1] .

Traditional RNN

Image 1: Architecture of a traditional RNN

For each timestep \( t \), the activation \( a^{\langle t \rangle} \) and the output \( y^{\langle t \rangle} \) are expressed as follows:

\( a^{\langle t \rangle} = g_1( W_{aa} a^{\langle t-1 \rangle} + W_{ax} x^{\langle t \rangle} + b_a ) \) and \( y^{\langle t \rangle} = g_2( W_{ya} a^{\langle t \rangle} + b_y ) \)
Where \( W_{ax}, W_{aa}, W_{ya}, b_{a}, b{y} \) are coefficients that are shared temporally and \( g_{1}, g_{2} \) activation functions.
❏ Loss function In the case of a recurrent neural network, the loss function \( \mathcal{L} \) of all time steps is defined based on the loss at every time step[2] :
\( \displaystyle \mathcal{L}(\hat{y}, y) = \sum_{t=1}^{T_y} \mathcal{L}(\hat{y}^{\langle t \rangle}, y^{\langle t \rangle}) \)
❏ Backpropagation through time Backpropagation is done at each point in time. At timestep \( T \), the derivative of the loss \( \mathcal{L} \) with respect to weight matrix \( W \) is expressed[2] :
\( \displaystyle \frac{\partial \mathcal{L}^{(T)}}{\partial W} = \sum_{t=1}^{T} \left. \frac{\partial \mathcal{L}^{(T)}}{\partial W} \right|_{(t)} \)
❏ Types of gates In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted \( \Gamma \) and are equal to[1] :
\( \Gamma = \sigma(W x^{\langle t \rangle} + U a^{\langle t-1 \rangle} + b) \)
where \( W, U, b \) are coefficients specific to the gate and \( \sigma \) is the sigmoid function. The main ones are summed up in the table below:
Type of gate Role Value range
Update gate \( \Gamma_u \) How much past should matter now? 0 (ignore) → 1 (update fully)
Relevance gate \( \Gamma_r \) Drop previous information? 0 (drop) → 1 (keep)
Forget gate \( \Gamma_f \) Erase a cell or not? 0 (erase) → 1 (remember)
Output gate \( \Gamma_o \) How much to reveal of a cell? 0 (hide) → 1 (reveal)
❏ LSTM Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs[3] .
LSTM Architecture

Image 2: Architecture of an LSTM memory cell

The multiplicative nodes in an LSTM memory cell function like gates that control the flow of information through the network. They enable the network to decide how much of each signal from the input data should pass through. There are three main places where these multiplicative interactions occur. An LSTM cell contains three gates and one cell state.

The forget gate multiplies the previous cell state by another value between 0 and 1, deciding how much information from the previous time step to retain or discard.
\( f_t = \sigma(W_f [h_{t-1}, x_t] + b_f) \)
The input gate multiplies the candidate cell state by a value produced by a sigmoid function as the activation function, between 0 and 1, determining how much new information from the current input to add to the memory.
\( i_t = \sigma(W_i [h_{t-1}, x_t] + b_i) \)
The input gate then generates a new candidate cell state:
\( \tilde{C}_t = \tanh(W_c [h_{t-1}, x_t] + b_c) \)
The cell state \( C_t \) is updated by the long-term memory:
\( C_t = f_t \star C_{t-1} + i_t \star \tilde{C}_t \)
Finally, the output gate multiplies the updated cell state (after passing through a tanh) by a gating value to determine what part of the internal state becomes visible as output \( o_t \) and a new hidden state \( h_t \):
\( o_t = \sigma(W_o [h_{t-1}, x_t] + b_o) \) and \( h_t = o_t \star \tanh(C_t) \)
Below is a table summing up the characterizing equations of the architecture:
Characterization Definition Equation (LSTM)
\( \tilde{C}_t \) Candidate cell state \( \tanh(W_c [h_{t-1}, x_t] + b_c) \)
\( C_t \) Updated cell state (Long-term memory) \( f_t \star C_{t-1} + i_t \star \tilde{C}_t \)
\( h_t \) Hidden state (Output/Short-term memory) \( o_t \star \tanh(C_t) \)

Attention Model

DistilBERT

❏ Knowledge distillation Knowledge distillation is a compression technique in which a compact model - the student - is trained to reproduce the behaviour of a larger model - the teacher - or an ensemble of models[4] .
❏ Training loss The student is trained with a distillation loss over the soft target probabilities of the teacher:
\( \displaystyle L_{ce} = \sum_{i} t_i * \log(s_i) \)
where \( t_i \) (resp. \( s_i \)) is a probability estimated by the teacher (resp. the student). This objective results in a rich training signal by leveraging the full teacher distribution, using a softmax-temperature:
\( \displaystyle p_i = \frac{\exp(z_i / T)}{\sum_{j} \exp(z_j / T)} \)

❏ Student architecture The student - DistilBERT - has the same general architecture as BERT. The token-type embeddings and the pooler are removed while the number of layers is reduced by a factor of 2. Most of the operations used in the Transformer architecture (linear layer and layer normalisation) are highly optimized in modern linear algebra frameworks and our investigations showed that variations on the last dimension of the tensor (hidden size dimension) have a smaller impact on computation efficiency (for a fixed parameters budget) than variations on other factors like the number of layers[4] .

DistilBERT Architecture

Image 3: DistilBERT architecture overview

References

Background Knowledge Sources

  • [1] Shervine Amidi. Recurrent Neural Networks cheatsheet. CS 230 Deep Learning, Stanford University. Link
  • [2] Schmidt, R. M. (2019). Recurrent Neural Networks (RNNs): A gentle Introduction and Overview. arXiv preprint arXiv:1912.05911. Link
  • [3] Sherstinsky, Alex (2020). Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) network. Physica D: Nonlinear Phenomena, 404, 132306. Elsevier BV. Link
  • [4] Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2020). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Link

1. Problem Statement - Exploratory Data Analysis

Problem statement

Text Classification

There are diversities of papers, newsletter, books... with various genres. We are not, however, able to manually distinguish between them in a short time. Therefore, RNN-based LSTM and Transformer-based DistilBERT are both taken into account to estimate the accuracy of classifying categories of each document based on natural language input.

  • RNN baseline: LSTM
  • Transformer model: DistilBERT

Dataset summary

Dataset information

  • Dataset Name: 20_newsgroups
  • Number of classes (corpus): 20
  • Total documents: 18,846
    • Training samples: 11,314
    • Test samples: 7,532
  • Documents per corpus:
    • alt.atheism: 799
    • comp.graphics: 973
    • comp.os.ms-windows.misc: 985
    • comp.sys.ibm.pc.hardware: 982
    • ...

Corpus Statistics

Statistic Value / Info
Total words 3,423,145
Total characters 22,043,554
Min docs/class 628 (talk.religion.misc)
Max docs/class 999 (rec.sport.hockey)
Mean docs/class 942.3
Std 97.0

Per-document Word Count

Metric Value
Mean 181.6
Median 83.0
Std 501.3
Min 0
Max 11765
Q25 40
Q75 166

Word Analysis

Word Frequency

Word Frequency

Word Analysis

Word Correlation / Corpus

Word Correlation

2. Dataset, DataLoader, and augmentation setup

1. Loading & EDA

Dataset overview

  • Loaded from HuggingFace: SetFit/20_newsgroups
  • 18,846 total docs, 20 classes
  • Train: 11,314 | Test: 7,532
  • Class imbalance: 628 (min) to 999 (max) docs per class

2. Text Preprocessing

Cleaning and Truncation

  • Text cleaning: removed quoted text, email headers, and signatures
  • Short doc removal: dropped < 50 chars (~544 docs, 2.9%)
  • Truncation: max 400 words (aligned with BERT 512 token limit)

3. Train/Val/Test Split

Splitting strategy

Split Samples Note
Train 8,584 Stratified
Validation 2,146 Stratified
Test 7,112 Stratified

4. Tokenization

Tokenizer details

  • Tokenizer: distilbert-base-uncased
  • Max length per sentence: 512 tokens
  • Dynamic padding via DataCollatorWithPadding
  • ~90% of docs fit under 512 tokens

5. Class Balancing

Weighted loss

  • Computed class weights: 1.0 / class_counts
  • Normalized weights to sum to num_classes
  • Applied via CrossEntropyLoss(weight=class_weights)

6. DataLoader

Batch setup

  • Batch size: 12 (train), 36 (val/test)
  • Dynamic padding per batch
  • No .set_format('torch') to the dataset, let DataCollator handle tensor conversion

3. Model building, training, evaluation, and comparison

RNN pipeline

BiLSTM text classification architecture

->
->
->

Transformer pipeline

DistilBERT text classification architecture

->
->
->

RNN model

BiLSTM with Attention

Architecture

  • Embedding: 300d GloVe (43.3% vocab coverage) | TF-IDF weighted
  • Backbone: 2-layer BiLSTM (128 units/dir)
  • Attention: Self-attention over all LSTM timesteps
  • Dropout: 0.3 | Total Params: 15.8M

Training

  • Optimizer: Adam (1e-3) | Batch size: 16
  • Scheduler: Cosine with warmup | Epochs: 20
  • Loss: Weighted CrossEntropy | Seq length: 256

Evaluation

  • Evaluation Metrics:Accuracy, F1 Macro, F1 Weighted, and per-class classification_report.

Comparison

  • Systematic evaluation across Val Acc, Test Acc, F1 Macro, Train Time (s), Inference (ms), and Params.

Transformer model

DistilBERT

Model building

  • Pretrained checkpoint: distilbert-base-uncased
  • Classification head: pre_classifier (768×768) → ReLU → classifier (768×20)

Training

  • Fine-tuning epochs: 15 (Phase 1: only head unfrozen, Phase 2: all layers and head unfrozen)
  • Optimizer: AdamW
  • Learning rate: Utilizing a layerwise differential learning rate strategy (detailed in Extension 1).
  • Scheduler: Linear Warmup and Cosine Annealing

Evaluation

  • Evaluation Metrics:Accuracy, F1 Macro, F1 Weighted, and per-class classification_report.

Comparison

  • Comprehensive evaluation across Val Acc, Test Acc, F1 Macro, Train Time (s), Inference (ms), and Params.

4. Experimental results, figures, analysis, and discussion

Evaluation

DistilBERT Performance

After training, the model was evaluated on the held-out test set (7,112 samples) using multiple metrics to provide a comprehensive view of classification performance.

Test Results
Metric Value
Accuracy 73.88%
F1 Macro 0.7268
F1 Weighted 0.7400
Precision 0.7299
Recall 0.7258
Best class rec.sport.hockey (F1: 0.93)
Worst class talk.religion.misc (F1: 0.32)
DistilBERT Confusion Matrix

Image 4: DistilBERT Confusion Matrix

DistilBERT F1 Scores Bar Chart

Image 5: DistilBERT F1 Scores Bar Chart

RNN Performance

BiLSTM with Attention

The BiLSTM model was evaluated after 20 epochs, reaching its peak validation accuracy of 69.72% at epoch 18.

Final Test Results
MetricScore
Accuracy62.90%
F1 Macro0.6179
F1 Weighted0.6311
Precision0.6251
Recall0.6155
Best class rec.sport.hockey (F1: 0.89)
Worst class talk.religion.misc (F1: 0.19)
BiLSTM+Attention Confusion Matrix

Image 6: BiLSTM with Attention Confusion Matrix

BiLSTM+Attention F1 Scores Bar Chart

Image 7: BiLSTM with Attention F1 Scores Bar Chart

Discussion

Key Insights and Discussion

Model Behavior and Overfitting

  • The gap between validation accuracy and test accuracy (e.g., DistilBERT 78.89% vs 73.88%) suggests mild overfitting. This is partly attributed to the relatively small training set (~8,500 samples) distributed across 20 diverse classes.
  • Class weighting in the loss function helped improve recall for underrepresented categories like talk.religion.misc and talk.politics.misc, though they still remain the most challenging for both architectures.

Semantic Analysis

  • Classes with clear topical boundaries (e.g., rec.sport.hockey, sci.space) consistently achieved higher F1 scores (~0.80+ for DistilBERT, ~0.70+ for BiLSTM) due to unique technical vocabularies.
  • Semantically overlapping classes (e.g., politics vs guns, or different computer hardware categories) showed higher confusion rates, which is expected given their shared terminology.

Error Patterns (Confusion Matrix Analysis)

Specific misclassifications reveal systemic issues in text understanding:

  • Religion: talk.religion.misc is frequently confused with alt.atheism and soc.religion.christian due to high lexical overlap in philosophical discussions.
  • Politics: talk.politics.misc is often misidentified as guns or mideast subcategories, as general political discussions often touch on these specific themes.
  • Technical brevity: Short documents in technical categories like comp.graphics are sometimes confused with mac.hardware when they lack specific keywords but share general technical terms.

Comparison

Systematic Comparison

A systematic comparison will be conducted across three dimensions:

  1. Architecture comparison: DistilBERT (67M params) vs LSTM — comparing transformer-based models against a recurrent baseline to evaluate the impact of self-attention and pretrained representations on text classification performance.
  2. Fine-tuning strategy comparison: For each transformer model, three strategies are evaluated:
    • Freeze backbone (train classification head only)
    • Full fine-tuning (train all layers from start with layerwise LR)
    • Hybrid (train head first, then unfreeze backbone with differential LR)
  3. Model efficiency comparison: Accuracy vs F1 Macro, Training time, Inference latency and Number of Parameters — to determine whether the larger BERT-base model justifies its additional computational cost over DistilBERT.

Results will be reported in the following format after all experiments are completed:

Model Val Acc Test Acc F1 Macro Train Time (s) Inference (ms) Params
DistilBERT (Hybrid) 78.89% 73.88% 72.68% 735s 44.1ms 67.0M
BiLSTM + Attention 69.72% 62.90% 0.6179 85s - 15.8M

Key Findings: Transformer vs RNN

The ~11% accuracy gap between BiLSTM (62.9%) and BERT (73.88%) demonstrates that:

  • Contextualized representations are essential: Static GloVe embeddings cannot capture the polysemy and context-dependent meanings that DistilBERT/BERT handle via self-attention.
  • Attention benefit: While attention improved BiLSTM baseline (~54% → 63%), it cannot fully compensate for the lack of bidirectional contextual pretraining.
  • Efficiency Trade-off: BiLSTM with Attention trains 18-35x faster than transformer models, making it a viable candidate for resource-constrained environments where a 10% accuracy drop is acceptable.

5. Other extension reports

Extension 1

Fine-tuning Strategy Comparison

Three fine-tuning strategies were evaluated on both DistilBERT (67M params) and BERT-base (110M params):

  • Freeze backbone: Train only the classification head; backbone weights remain frozen
  • Full fine-tune: Train all layers from the start with layerwise differential learning rates
  • Hybrid: Train classification head first (5 epochs), then unfreeze backbone with differential LR (15 epochs)

Learning Rate Strategy

A layerwise differential learning rate strategy was applied to both the Hybrid and Full fine-tune approaches to prevent catastrophic forgetting and ensure stable convergence:

  • Phase 1 (Warmup): The learning rate is set to 5e-5 for the classifier layer to adapt it to the 20_newsgroups vocabulary.
  • Phase 2 (Full Tuning): The classifier remains at 5e-5. The backbone uses a progressive LR: 1e-6 for embedding layers, 2e-6 for initial encoder layers, and 5e-5 for the final layers closest to the output.

Results

Model Strategy Val Acc Test Acc Train Time (s) Inference (ms) Params
DistilBERTFreeze backbone67.38%66.15%29244.366,968,852
DistilBERTHybrid78.89%73.88%73544.166,968,852
DistilBERTFull fine-tune79.33%74.06%64444.466,968,852
BERTFreeze backbone64.80%62.93%50188.2109,497,620
BERTHybrid79.86%74.77%133788.3109,497,620
BERTFull fine-tune79.96%74.17%116887.9109,497,620
Training Loss Curves

Image 8: Training Loss Comparison

Validation Accuracy Curves

Image 9: Validation Accuracy Comparison (Transformers)

Analysis

  • The hybrid strategy achieved the best accuracy for both models, outperforming full fine-tuning by ~0.2-0.5% on test accuracy. Training the classification head first provides a stable initialization before updating the pretrained backbone, which leads to slightly better convergence.
  • Full fine-tuning performed comparably to hybrid but with slightly lower accuracy, suggesting that updating all layers simultaneously without a warm-up phase can lead to suboptimal adaptation of the classification head.
  • Freeze backbone performed significantly worse (~10% lower than hybrid), confirming that the pretrained representations alone are insufficient for this 20-class task and backbone adaptation is necessary.
  • Training time scales linearly with model complexity: hybrid takes ~15% longer than full fine-tune due to the additional Phase 1 epochs, but the accuracy gain justifies the cost.

Extension 2

Model Efficiency Comparison

A comparison of model efficiency across DistilBERT and BERT-base, evaluating accuracy vs model size and inference time.

Metric DistilBERT BERT-base Ratio
Parameters67.0M109.5M1.63x
Best Test Accuracy73.88%74.77%+0.89%
Training Time735s1,337s1.82x
Inference Time44.1ms88.3ms2.0x

Analysis

  • BERT-base provides only a marginal accuracy improvement (+0.89%) over DistilBERT while requiring 1.63x more parameters, 1.82x longer training time, and 2.0x slower inference.
  • DistilBERT offers a significantly better accuracy-to-efficiency trade-off: it achieves 98.8% of BERT's accuracy at significantly lower computational cost (1.8x faster training, 2x faster inference).
  • This result aligns with the design goal of knowledge distillation — DistilBERT retains ~98% of BERT's performance while being significantly smaller and faster.

Simple Compression: DistilBERT as a Compressed Model

DistilBERT itself serves as a compressed version of BERT-base, produced through knowledge distillation during pretraining:

Aspect BERT-base DistilBERT Reduction
Encoder layers12650%
Parameters109.5M67.0M39%
Inference time88.3ms44.1ms50%
Test accuracy (hybrid)74.77%73.88%-0.89%

The experiment demonstrates that model compression via distillation is an effective strategy for reducing model size and inference latency with minimal accuracy loss. For this text classification task, the 39% parameter reduction and 50% inference speedup come at a cost of less than 1% accuracy — a favorable trade-off for most production applications.