Assignment 1 · Multimodal

Red theme · ViT-B/16 priority with RN50 extension

Multimodal classification

Zero-shot vs Few-shot on Image + Text

This report is written in the required order: problem → dataset → method → experiments → results → analysis → conclusion. The primary focus is ViT-B/16, with additional ResNet50 comparison in augmentation and no-augmentation settings.

Owner

Vu Hai Tuan

Instructor

Le Thanh Sach

Primary model

ViT-B/16

Extension model

ResNet50

Overall Pipeline (Zero-shot & Few-shot)

A unified, box-based overview of the full multimodal pipeline.

ZERO-SHOT PIPELINE

CLIP multimodal zero-shot classification

Inputimage + paired text
Backbone: CLIP EncoderViT-B/16 or RN50 (frozen)
Head: Prompt Matchingz = (f_img + f_txt)/2
Outputpredicted class
  • Prompt bank: class prompts are encoded once (e.g., a photo of <class>).
  • Decision rule: choose the class with maximum cosine similarity.

FEW-SHOT PIPELINE — LINEAR HEAD

CLIP multimodal few-shot adaptation (Linear)

InputK-shot support + query
Backbone: CLIP EncoderViT-B/16 or RN50 (frozen)
Head: Linear Classifierz = [f_img ; f_txt], o = Wz + b
Outputclass + Accuracy/Precision/F1
  • Training: only the linear head is trained on support samples.
  • Data setting: evaluated with 5-shot and 10-shot under no_aug / aug.

FEW-SHOT PIPELINE — PROTOTYPE HEAD

CLIP multimodal few-shot adaptation (Prototype)

InputK-shot support + query
Backbone: CLIP EncoderViT-B/16 or RN50 (frozen)
Head: Prototype Classifierp_c = (1/K) Σ z_i
Outputclass + Accuracy/Precision/F1
  • Inference: predict with argmax cos(z, p_c) over class prototypes.
  • Data setting: evaluated with 5-shot and 10-shot under no_aug / aug.

Multimodal Classification

Follow these steps

Problem Definition

Multimodal classification with paired image-text inputs, comparing zero-shot and few-shot on the main CLIP ViT-B/16 and CLIP RN-50 extension backbone.

Problem statement

CLIP ViT-B/16: zero-shot vs few-shot

The task is multimodal classification where each sample contains a paired image and text. We compare zero-shot and few-shot under the same protocol, with CLIP ViT-B/16 as the main backbone and CLIP RN50 as an extension.

  • Main backbone: CLIP ViT-B/16
  • Extension backbone: CLIP RN50
  • Metrics: Accuracy, Precision, and F1-score

Dataset summary

Why UPMC-Food101 fits this assignment

  • True image-text pairs from the same recipe/page
  • About 100,000 samples
  • 101 original classes (10-class subset in this report)
  • Web-collected variability for realistic evaluation

Dataset Description and EDA

Dataset summary

  • Uses genuine image-text pairs (same entity/event), not random pairing.
  • Supports train/validation/test and class-wise few-shot sampling.
  • Suitable for zero-shot and few-shot evaluation protocols.

EDA checklist

  • Class distribution across train/val/test.
  • Example samples showing correct image-text alignment.
  • Text length statistics (mean/median/percentiles).
  • Class imbalance notes for per-class analysis.

EDA

Train/Val/Test statistics and class set

  • Classes (10): donuts, french_fries, hamburger, hot_dog, ice_cream, pho, pizza, steak, sushi, tacos
  • Train titles: 67,988
  • Test titles: 22,716
  • Train images: 6,866 (missing titles: 0)
  • Test images: 2,295 (missing titles: 0)
  • Final split: train 6,179 · val 687 · test 2,295

10-class subset

UPMC-Food101 categories used in the multimodal task

donuts french_fries hamburger hot_dog ice_cream pho pizza steak sushi tacos

Dataset link: UPMC-Food101 on Kaggle

Dataset, DataLoader, and Augmentation Setup

The preprocessing pipeline follows the implementation in the ViT-B/16 notebook, with explicit image transforms, text mapping from CSV files, and reproducible split/DataLoader setup.

Image preprocessing

  • CLIP normalization is used with fixed mean/std for all runs.
  • no_aug: baseline transform pipeline (resize + normalize).
  • aug: RandomResizedCrop(224), RandomHorizontalFlip, and ColorJitter before normalization.
  • Validation/Test always use fixed transform (no random augmentation).

Text preprocessing

  • Text is loaded from train_titles.csv and test_titles.csv.
  • Each image filename is mapped to its paired title to keep true image-text alignment.
  • If a title is missing, fallback text is created: a photo of <class>.
  • Labels are encoded with label2id / id2label for consistent training and evaluation.

Split and DataLoader

  • Dataset is downloaded, unzipped, and loaded from fixed train/test directories.
  • Train is split into train/val with VAL_RATIO = 0.1, random_state = 42, and stratified labels.
  • MultimodalDataset returns (image, text, label) for each sample.
  • DataLoaders use batch size 64, shuffle in train only, and fixed evaluation loaders for val/test.

CLIP-Core Backbone

Core model

What is CLIP?

CLIP (Contrastive Language-Image Pretraining) learns aligned visual and textual representations from large-scale image-text pairs in a shared embedding space. This enables image-text similarity matching and strong prompt-based zero-shot classification without task-specific training.

  • Learns from large-scale paired image-text data rather than relying only on task-specific fixed class labels
  • Uses natural-language prompts for zero-shot classification
  • Serves as the pretrained backbone for both zero-shot and few-shot experiments

Architecture overview

Image encoder + Text encoder + similarity matching

CLIP architecture for multimodal image-text learning

CLIP matches strong vision baselines in zero-shot settings and serves as the parent model in this report.

Zero-shot Classification

  • We use pretrained CLIP ViT-B/16 and CLIP RN50 as the zero-shot baselines.
  • Each class label is converted into a natural-language text prompt.
  • For each test sample, the image and its paired text are encoded into the shared CLIP embedding space.
  • The image and text embeddings are fused to form a multimodal representation.
  • Cosine similarity is computed between the fused sample representation and the class prompt embeddings, and the highest-scoring class is selected.
  • Advantage: no task-specific training is required.
  • Limitation: performance is sensitive to prompt design, fusion strategy, and domain mismatch.

Few-shot Adaptation

Few-shot setup

  • Few-shot adaptation is built on pretrained CLIP ViT-B/16 and CLIP RN50 backbones.
  • We evaluate low-data settings with labeled support samples in 5-shot and 10-shot configurations.
  • Two lightweight adaptation strategies are compared: a Linear classifier and a Prototype-based classifier.
  • Both no_aug and aug settings are evaluated under the same protocol.

Fusion / adaptation note

  • In this implementation, few-shot adaptation uses both image and paired text embeddings extracted from CLIP.
  • The image and text features are concatenated to form a multimodal representation.
  • Let f_img(x) be the image embedding and f_txt(t) be the text embedding.
  • The fused multimodal feature is defined as z = [f_img(x) ; f_txt(t)].

Method 1

Linear Classifier

  • The fused feature z is fed into a trainable linear classifier.
  • Logits are computed as o = Wz + b.
  • Class probabilities are obtained by p(y|x,t) = softmax(Wz + b).
  • The model is trained with cross-entropy loss: L = -log p(y = y_true | x,t).
  • In this setting, CLIP encoders are frozen and only the lightweight classifier head is trained.

Method 2

Prototype-based Classifier

  • For each class c, a prototype is computed from the support set.
  • The class prototype is defined as p_c = (1 / K) ∑ z_i, where z_i are fused support features from class c.
  • For a query sample, prediction is based on similarity between z and each class prototype p_c.
  • A common decision rule is ŷ = argmax cos(z, p_c).
  • This method does not require learning a large classifier head and is effective in low-shot settings.

Experimental Setup

This section describes the implementation and evaluation protocol used for zero-shot and few-shot multimodal classification.

Implementation setting

  • The experiments are implemented in PyTorch using pretrained CLIP ViT-B/16 and CLIP RN50 backbones.
  • All experiments are conducted in a multimodal setting, where each sample contains an image, its paired text, and a class label.
  • A fixed random seed (seed = 42) is used for reproducible few-shot sampling.
  • The same train, validation, and test protocol is applied across all compared settings for a fair evaluation.

Zero-shot protocol

  • For zero-shot classification, each class name is converted into a prompt of the form "a photo of <class>".
  • Class prompt embeddings are computed once using the CLIP text encoder.
  • For each test sample, the image and its paired text are separately encoded and normalized.
  • The two embeddings are fused by averaging: z = (f_img + f_txt) / 2, followed by L2 normalization.
  • Prediction is obtained by comparing the fused sample feature with class prompt embeddings using scaled cosine similarity.

Few-shot protocol

  • Few-shot support sets are constructed with class-wise K-shot sampling, where exactly K samples are selected per class.
  • We evaluate 5-shot and 10-shot settings under both no_aug and aug image pipelines.
  • In the linear few-shot model, CLIP encoders are frozen and only a lightweight classifier head is trained.
  • Image and text embeddings are normalized and concatenated to form the multimodal feature: z = [f_img ; f_txt].
  • The classifier head is defined as a linear layer on top of the fused feature representation.

Prototype setting and evaluation

  • In the prototype-based setting, class prototypes are computed from the few-shot support set using fused multimodal features.
  • For each class, the prototype is obtained by averaging support features of that class, and prediction is based on similarity to the query feature.
  • Validation data is used to monitor the few-shot setting, while final results are reported on the held-out test set.
  • The main reported metrics are Accuracy and Weighted F1-score; Precision can also be included in the comparison table for consistency.
  • This setup allows a fair comparison between zero-shot and few-shot multimodal classification under limited labeled data.
Setting group Compared configurations
Zero-shot CLIP ViT-B/16, CLIP RN50
Few-shot (no_aug) 5-shot no_aug, 10-shot no_aug
Few-shot (aug) 5-shot aug, 10-shot aug

Results

The results are reported separately for the two CLIP backbones: ViT-B/16 as the main model and RN50 as the extension model.

Main model

CLIP ViT-B/16

For the ViT-B/16 backbone, we report one zero-shot baseline and two few-shot adaptation strategies: Linear classifier and Prototype-based classifier.

Method Head Setting Shot Accuracy Precision F1
Zero-shotPrompt matchingzero_shot00.89590.89810.8769
Few-shotLinearno_aug50.85010.86040.8419
Few-shotPrototypeno_aug50.88370.89440.8855
Few-shotLinearno_aug100.91630.91690.9131
Few-shotPrototypeno_aug100.92070.92630.9218
Few-shotLinearaug50.80440.83800.7840
Few-shotPrototypeaug50.88060.89070.8824
Few-shotLinearaug100.91900.92040.9159
Few-shotPrototypeaug100.92240.92620.9231
Zero-shot vs Few-shot results for CLIP ViT-B/16 with Linear and Prototype heads
  • Zero-shot already provides a strong baseline on ViT-B/16 with Accuracy = 0.8959 and F1 = 0.8769.
  • At 5-shot, the Linear head remains below the zero-shot baseline in both no_aug and aug settings, while the Prototype head stays much closer to the zero-shot result.
  • Increasing the support set from 5-shot to 10-shot consistently improves both heads, and all 10-shot variants outperform the zero-shot baseline in Accuracy and F1.
  • The Prototype head consistently outperforms the Linear head across all few-shot settings, showing better robustness in both low-shot and augmented conditions.
  • For the Linear head, the best result is 10-shot + aug with Accuracy = 0.9190, Precision = 0.9204, and F1 = 0.9159, slightly better than 10-shot + no_aug.
  • The best overall ViT-B/16 result is Prototype + 10-shot + aug, with Accuracy = 0.9224, Precision = 0.9262, and F1 = 0.9231.
  • Augmentation is not beneficial for the Linear head at 5-shot, but becomes slightly helpful at 10-shot. In contrast, the Prototype head remains strong and stable in both no_aug and aug settings.

Extension model

CLIP RN50

Method Head Setting Shot Accuracy Precision F1
Zero-shotPrompt matchingzero_shot00.86800.89960.8438
Few-shotLinearno_aug50.80390.83300.8058
Few-shotPrototypeno_aug50.88060.89190.8825
Few-shotLinearno_aug100.85140.85480.8489
Few-shotPrototypeno_aug100.85400.87270.8574
Few-shotLinearaug50.84100.85040.8362
Few-shotPrototypeaug50.79170.81280.7943
Few-shotLinearaug100.87840.87930.8774
Few-shotPrototypeaug100.85710.87510.8607
Zero-shot vs Few-shot results for CLIP RN50
  • RN50 zero-shot provides a strong baseline with Accuracy = 0.8680 and F1 = 0.8438.
  • The best RN50 result is Prototype + 5-shot + no_aug, which achieves Accuracy = 0.8806, Precision = 0.8919, and F1 = 0.8825, slightly outperforming the zero-shot baseline in Accuracy and F1.
  • Unlike ViT-B/16, RN50 does not show a consistent improvement when moving from 5-shot to 10-shot; the effect depends on the classifier head and augmentation setting.
  • For the Linear head, augmentation is beneficial in both shot settings, improving performance from 0.8039 → 0.8410 at 5-shot and from 0.8514 → 0.8784 at 10-shot.
  • For the Prototype head, augmentation is harmful at 5-shot but slightly helpful at 10-shot, indicating less stable behavior under low-shot augmentation.
  • Overall, RN50 is more sensitive to the adaptation setting than ViT-B/16, with the strongest RN50 performance obtained from the Prototype head in the 5-shot no_aug setting.

Discussion and Error Analysis

This section interprets the quantitative results using learning curves, calibration analysis, and confusion matrices for both ViT-B/16 and RN50.

ViT-B/16 discussion

Strong few-shot gains, but head behavior differs

  • Zero-shot ViT-B/16 is already a strong baseline with Accuracy = 0.8959 and F1 = 0.8769, showing that pretrained CLIP alignment transfers well to the multimodal food classification task.
  • Moving from 5-shot to 10-shot clearly improves performance for both heads, confirming that a small increase in labeled support data is enough to improve class separation.
  • The Prototype head is consistently more stable than the Linear head, especially in low-shot settings. The best overall ViT-B/16 result is Prototype + 10-shot + aug with Accuracy = 0.9224, Precision = 0.9262, and F1 = 0.9231.
  • For the Linear head, augmentation is not helpful at 5-shot, but becomes beneficial at 10-shot. This suggests that stronger variation is useful only when the support set is large enough to stabilize the classifier.

RN50 discussion

More sensitive to setting and less consistent than ViT-B/16

  • RN50 zero-shot is also competitive with Accuracy = 0.8680 and F1 = 0.8438, but it remains below the ViT-B/16 baseline.
  • Unlike ViT-B/16, RN50 does not improve consistently from 5-shot to 10-shot. Its behavior depends strongly on the classifier head and whether augmentation is applied.
  • The best RN50 setting is Prototype + 5-shot + no_aug with Accuracy = 0.8806 and F1 = 0.8825, slightly outperforming RN50 zero-shot.
  • For RN50, augmentation helps the Linear head in both shot settings, but it hurts the Prototype head at 5-shot. This indicates that RN50 is more sensitive to support-set quality and transformation noise than ViT-B/16.

Learning dynamics

Training curves and head comparison

10-shot few-shot summary for ViT-B/16 including training loss, validation accuracy, and head comparison

For ViT-B/16, the training-loss curves decrease smoothly and validation accuracy quickly saturates, indicating stable convergence for the 10-shot linear setting. The final comparison also shows that the Prototype head slightly outperforms the Linear head in both no_aug and aug settings.

10-shot few-shot summary for RN50 including training loss, validation accuracy, and head comparison

For RN50, optimization is also stable, but the final ranking is less consistent across settings. This supports the observation that RN50 is more sensitive to the adaptation strategy and augmentation pipeline than ViT-B/16.

Calibration analysis

Prototype is reliable; Linear is under-confident

Calibration summary for ViT-B/16 comparing Zero-shot, 10-shot Linear, and 10-shot Prototype
Reliability diagrams for ViT-B/16 comparing Zero-shot, 10-shot Linear, and 10-shot Prototype

For ViT-B/16, both Zero-shot and 10-shot aug Prototype are well calibrated, with low ECE values (around 0.0284 and 0.0232). In contrast, the 10-shot aug Linear model has a very large calibration error (ECE ≈ 0.6441), meaning that it often predicts the correct class but assigns poorly calibrated confidence scores.

Calibration summary for RN50 comparing Zero-shot, 10-shot Linear, and 10-shot Prototype
Reliability diagrams for RN50 comparing Zero-shot, 10-shot Linear, and 10-shot Prototype

The same pattern appears for RN50: Zero-shot and Prototype remain reasonably calibrated, while the Linear model has a much larger ECE. The Confidently Correct scores also show that the linear models rarely produce highly confident predictions even when they are correct.

Error analysis from confusion matrices

Hard classes are visually similar fast-food categories

Confusion matrix comparison for ViT-B/16 between Zero-shot, 10-shot Linear, and 10-shot Prototype

For ViT-B/16, the main source of error is the hot_dog class, which is often confused with visually similar categories such as hamburger, pizza, and steak. The 10-shot aug Prototype model reduces these confusions more effectively than the linear head.

Confusion matrix comparison for RN50 between Zero-shot, 10-shot Linear, and 10-shot Prototype

For RN50, few-shot adaptation improves several classes over zero-shot, but visually similar food items remain a major source of confusion. This is especially visible when augmentation is applied under low-shot conditions.

  • Classes such as ice_cream, pizza, and sushi are consistently easier, with strong diagonal values across most settings.
  • Overall, the error analysis supports the main result: Prototype-based few-shot adaptation is more robust than the Linear head, especially when class similarity is high and labeled data is limited.

Main takeaway

What the evaluation tells us

  • ViT-B/16 is the stronger and more stable backbone overall.
  • Prototype heads provide the best balance between accuracy, robustness, and calibration.
  • Linear heads can reach competitive accuracy, but their confidence estimates are unreliable and should be interpreted carefully.
  • The main remaining challenge is distinguishing fine-grained, visually similar food categories, especially under low-shot conditions.

Augmentation vs No-Augmentation Across Backbones

Since the RN50 results have already been reported above, this section focuses on how data augmentation changes performance for both ViT-B/16 and RN50 under the same few-shot settings.

Main backbone

ViT-B/16: augmentation helps only when support size is large enough

ViT-B/16 comparison of zero-shot, no-augmentation, and augmentation settings
  • For ViT-B/16 Linear, augmentation is harmful at 5-shot, where Accuracy drops from 0.8501 (no_aug) to 0.8044 (aug), and F1 drops from 0.8419 to 0.7840.
  • However, for ViT-B/16 Linear at 10-shot, augmentation becomes slightly beneficial: Accuracy improves from 0.9163 to 0.9190, and F1 improves from 0.9131 to 0.9159.
  • For ViT-B/16 Prototype, augmentation has only a small effect. At 5-shot, it slightly lowers Accuracy from 0.8837 to 0.8806, while at 10-shot it slightly improves Accuracy from 0.9207 to 0.9224.
  • This suggests that for ViT-B/16, augmentation is not universally helpful. It becomes more useful when the support set is larger, especially for the Linear head, while the Prototype head is already robust even without heavy augmentation.

Extension backbone

RN50: augmentation helps Linear, but is unstable for Prototype

RN50 comparison of zero-shot, no-augmentation, and augmentation settings
  • For RN50 Linear, augmentation is consistently beneficial. At 5-shot, Accuracy increases from 0.8039 to 0.8410, and at 10-shot, it increases from 0.8514 to 0.8784.
  • For RN50 Prototype, augmentation is harmful at 5-shot, where Accuracy drops sharply from 0.8806 to 0.7917. At 10-shot, augmentation gives only a very small gain, from 0.8540 to 0.8571.
  • This means that for RN50, augmentation interacts strongly with the classifier head. It clearly supports the Linear head, but it is unstable for the Prototype head, especially when the support set is very small.
  • Compared with ViT-B/16, RN50 shows higher sensitivity to the training setting, which indicates that the effect of augmentation depends not only on shot size but also on the backbone architecture itself.

Cross-backbone takeaway

Data augmentation is backbone- and head-dependent

  • For ViT-B/16, augmentation is most useful in the 10-shot setting and gives only small gains because the pretrained features are already strong and stable.
  • For RN50, augmentation is much more important for the Linear head, but less reliable for the Prototype head.
  • Therefore, augmentation should not be treated as a universally positive choice. Its effectiveness depends on the combination of backbone, classifier head, and support-set size.

Conclusion

  • Based on the reported accuracy and F1 scores, CLIP ViT-B/16 is the stronger backbone overall. Its best result is Prototype + 10-shot + aug with Accuracy = 0.9224 and F1 = 0.9231, which is clearly higher than the best RN50 result.
  • For ViT-B/16, the Prototype head is consistently the most reliable few-shot strategy, while the Linear head becomes competitive only in the 10-shot setting.
  • For RN50, the best result is Prototype + 5-shot + no_aug with Accuracy = 0.8806 and F1 = 0.8825, showing that RN50 can still perform well, but it is less stable than ViT-B/16 across different settings.
  • The augmentation analysis shows that data augmentation is not always beneficial. For ViT-B/16, it is mainly helpful at 10-shot; for RN50, it helps the Linear head but can hurt the Prototype head in low-shot conditions.
  • Overall, the experiments show that the best-performing configuration for this multimodal task is ViT-B/16 with Prototype-based few-shot adaptation, while the impact of augmentation depends strongly on the backbone and adaptation method.