Assignment 1 · Image

Dedicated image report page for the Stanford Dogs comparison

Image track

CNN vs ViT on Stanford Dogs.

This page contains the full image report for Assignment 1, including the required five report categories and the completed extension work.

Owner

Chu Nguyen Tuan Anh

Dataset

Stanford Dogs

Models

ResNet-50, ViT-B/16

Results

Benchmark, Calibration, Ablations

Transfer-learning report navigator

Read the image report in the same order as the transfer-learning workflow. Each pipeline node activates one major content block below.

Pipeline-first reading mode

Start from the workflow, then open the matching report section

This page still contains the full five-part image report, but the navigation is now organized by Input → Preprocessing → Backbone → Head → Output. That makes the report easier to follow for readers who want to understand the full transfer-learning process rather than jump between disconnected sections.

x T(x) z g(z) ŷ

Current stage

1. Problem and dataset exploration (EDA)

This stage corresponds to the raw input image x. It focuses on Stanford Dogs as a dataset, the official split reconstruction, metadata building, and EDA findings such as class balance, image size, and illumination variation.

1. Problem and dataset exploration (EDA)

This stage stays on the raw input side of the workflow: Stanford Dogs images before preprocessing, the exported metadata files reconstructed by the notebook, and the EDA evidence that later justifies resizing, normalization, and augmentation.

Problem statement

ResNet-50 vs ViT-B/16 on Stanford Dogs

The goal is to classify each image into one of 120 dog breeds and compare two major model families under the same dataset and evaluation protocol.

  • CNN model: ResNet-50
  • Transformer model: ViT-B/16
  • Both initialized from ImageNet-pretrained weights

Dataset summary

Why Stanford Dogs fits this assignment

  • 120 classes
  • 20,580 total images
  • 12,000 official train images and 8,580 official test images
  • 10,200 train / 1,800 validation / 8,580 test after the internal stratified split
  • Backed by metadata_with_quality.csv and split_metadata.csv

Stanford Dogs is a fine-grained benchmark with enough classes and enough training samples to make the CNN-versus-Transformer comparison meaningful and non-trivial.

Raw input context

What belongs to node 1 before preprocessing starts

For this report, node x stays focused on the raw Stanford Dogs input and the metadata analysis done before transform design. The notebook reconstructs the official split, exports per-image metadata, enriches it with brightness and color statistics, and only then moves on to tensor conversion in node 2.

  • Each sample starts as a raw RGB image with its original width, height, and aspect ratio still intact
  • The notebook parses train_list.mat and test_list.mat to reconstruct the official Stanford Dogs split
  • build_metadata(...) exports path, label, class name, width, height, aspect ratio, and split labels for each sample
  • A second pass writes metadata_with_quality.csv with brightness_mean, contrast_std, saturation_mean, r_mean, g_mean, and b_mean
  • split_metadata.csv records the internal stratified split: 10,200 train, 1,800 val, and 8,580 test
  • The later tensor target is (3, 224, 224) per image, but the actual resize-to-tensor logic is intentionally deferred to node 2

Official split files

What train_list.mat and test_list.mat actually are

These two files are not image folders. They are MATLAB annotation files shipped with Stanford Dogs and used to define the original benchmark partition. Each file stores sample-level metadata such as relative image paths and integer breed labels, so the notebook can recover the official train/test split instead of inventing a new one from scratch.

  • train_list.mat: official training-image entries and labels
  • test_list.mat: official test-image entries and labels
  • The notebook reads them with the Stanford Dogs root as context, then resolves each relative path into a real image under Images/
  • Only after that reconstruction step does the notebook create the internal train and val split from official_train

Dataset layout

How the Stanford Dogs dataset is organized on disk

The notebook downloads three archives: image files, breed annotations, and the official split lists. In practice, the code resolves TRAIN_LIST_MAT and TEST_LIST_MAT either from the dataset root or from the extracted lists/ folder.

dataset tree
stanford_dogs/
├── images.tar
├── annotation.tar
├── lists.tar
├── Images/
│   ├── n02085620-Chihuahua/
│   │   ├── n02085620_10074.jpg
│   │   └── ...
│   ├── n02085782-Japanese_spaniel/
│   └── ...
├── Annotation/
│   ├── n02085620-Chihuahua/
│   │   ├── n02085620_10074
│   │   └── ...
│   └── ...
├── lists/
│   ├── train_list.mat
│   └── test_list.mat
├── train_list.mat   # sometimes resolved from the root
└── test_list.mat    # sometimes resolved from the root
python
train_meta_full = build_metadata(train_rel_paths, train_labels, "official_train")
test_meta = build_metadata(test_rel_paths, test_labels, "official_test")
meta = pd.concat([train_meta_full, test_meta], ignore_index=True)

splitter = StratifiedShuffleSplit(
    n_splits=1,
    test_size=VAL_FROM_TRAIN_RATIO,
    random_state=SEED,
)
train_idx, val_idx = next(splitter.split(train_meta_full, train_meta_full["label"]))
python
quality_rows = []
for image_path in tqdm(meta["image_path"], desc="Computing image-quality metadata"):
    with Image.open(image_path).convert("RGB") as image:
        rgb_arr = np.asarray(image, dtype=np.float32) / 255.0
        gray_arr = np.asarray(image.convert("L"), dtype=np.float32) / 255.0
        hsv_arr = rgb_to_hsv(rgb_arr)
        quality_rows.append({
            "brightness_mean": float(gray_arr.mean()),
            "contrast_std": float(gray_arr.std()),
            "saturation_mean": float(hsv_arr[..., 1].mean()),
            "r_mean": float(rgb_arr[..., 0].mean()),
            "g_mean": float(rgb_arr[..., 1].mean()),
            "b_mean": float(rgb_arr[..., 2].mean()),
        })

quality_df = pd.DataFrame(quality_rows)
meta = pd.concat([meta.reset_index(drop=True), quality_df], axis=1)
meta.to_csv(EDA_ARTIFACT_ROOT / "metadata_with_quality.csv", index=False)

Artifact references used in this panel come directly from the notebook export: metadata_with_quality.csv, split_metadata.csv, quality_distributions.png, dataset_distributions.png, image_size_distribution.png, rgb_channel_summary.png, random_class_samples.png, darkest_examples.png, and brightest_examples.png.

CSV artifacts

What the two exported metadata files actually represent

Artifact 1

metadata_with_quality.csv

This is the master per-image metadata table for the whole Stanford Dogs dataset. It starts from the reconstructed official split, then appends geometry and image-quality statistics so the EDA can be reproduced from a single CSV.

  • One row per image across all 20,580 samples
  • Includes identity fields such as image_id, class_name, and official_split
  • Includes geometry fields such as width, height, and aspect_ratio
  • Includes quality/color fields such as brightness_mean, contrast_std, saturation_mean, r_mean, g_mean, and b_mean

Web source: metadata_with_quality.csv

Artifact 2

split_metadata.csv

This is the split-assignment table used after the internal stratified split. It keeps the same image identity and geometry fields, but its extra job is to mark whether each sample belongs to the final train, val, or test partition used by the loaders.

  • One row per image in the final benchmark split
  • Preserves official_split so the original Stanford Dogs partition is still traceable
  • Adds the final split column used by build_loaders(...)
  • Acts as the bridge between EDA metadata and actual training/validation/test data processing

Web source: split_metadata.csv

Preview of metadata_with_quality.csv
image_id class_name official_split width height aspect_ratio brightness_mean contrast_std saturation_mean
n02085620_5927 Chihuahua official_train 360 300 1.200 0.348 0.221 0.445
n02085620_4441 Chihuahua official_train 375 500 0.750 0.538 0.099 0.223
n02085620_1502 Chihuahua official_train 500 333 1.502 0.543 0.215 0.267
Preview of split_metadata.csv
image_id class_name official_split split width height aspect_ratio
n02107142_4013 Doberman official_train train 236 350 0.674
n02091244_5818 Ibizan hound official_train train 375 500 0.750
n02107574_2912 Greater Swiss Mountain dog official_train train 500 334 1.497
Final split counts from split_metadata.csv
split count
train 10,200
val 1,800
test 8,580

EDA highlights

Quality variation, split structure, and random breed samples

The exported EDA confirms that Stanford Dogs is visually diverse in brightness, contrast, saturation, image size, aspect ratio, and scene composition. The notebook artifacts also show that the class distribution is only moderately imbalanced, which makes a stratified train/validation split practical without distorting the benchmark too much.

  • metadata_with_quality.csv: brightness mean 0.4522, contrast std mean 0.2261, saturation mean 0.3047
  • metadata_with_quality.csv: RGB means are R = 0.4761, G = 0.4518, B = 0.3910
  • metadata_with_quality.csv: average width 442.5 px, average height 385.9 px, average aspect ratio 1.19
  • metadata_with_quality.csv: breed counts range from 148 to 252 images, with a mean of 171.5
  • split_metadata.csv: internal split is 10,200 train / 1,800 val / 8,580 test
Stanford Dogs quality distributions

Brightness, contrast, and saturation distributions confirm a broad range of lighting and color conditions.

Stanford Dogs split and breed distributions

The split and breed-distribution export shows a healthy sample count for both official partitions, while the breed histogram remains only moderately imbalanced.

Stanford Dogs image size and aspect ratio distributions

Width, height, and aspect-ratio distributions show that the dataset contains varied image geometries before resizing to the common model input size.

Stanford Dogs RGB channel summary

The RGB summary complements the quality CSV by showing that the dataset is not perfectly color-balanced, which helps justify later channel-wise normalization.

Stanford Dogs random breed samples

Random breed samples visualize the raw input domain before any preprocessing: different poses, backgrounds, framing, and lighting conditions appear immediately.

Extreme brightness examples

Darkest and brightest samples in the dataset

To complement the summary statistics, we also inspected the darkest and brightest images in the dataset. This helps verify that the measured brightness range corresponds to meaningful visual variation rather than only small numerical differences.

The darkest examples are mostly low-light indoor images, shadow-heavy scenes, or photos with strong exposure limitations. In several cases, the dog occupies only part of the frame or blends into a dark background, which makes recognition harder even for humans. These samples motivate brightness-robust preprocessing and moderate augmentation.

Darkest Stanford Dogs examples

The darkest images in Stanford Dogs illustrate low-light scenes, dark backgrounds, and reduced visibility of breed-specific features.

Brightest Stanford Dogs examples

The brightest images often contain white backgrounds or bright fur, showing the opposite end of the illumination range.

Together, these two galleries confirm that Stanford Dogs includes substantial illumination diversity. This is the last stop of the report before node 2 begins the actual data processing pipeline that standardizes geometry and channel scaling.