Image track

CNN vs ViT on Stanford Dogs.

This page contains the full image report for Assignment 1, including the required five report categories and the completed extension work.

Open report sections Back to Assignment 1 overview

Owner

Chu Nguyen Tuan Anh

Dataset

Stanford Dogs

Models

ResNet-50, ViT-B/16

Results

Benchmark, Calibration, Ablations

Transfer-learning report navigator

Read the image report in the same order as the transfer-learning workflow. Each pipeline node activates one major content block below.

Pipeline-first reading mode

Start from the workflow, then open the matching report section

This page still contains the full five-part image report, but the navigation is now organized by Input → Preprocessing → Backbone → Head → Output. That makes the report easier to follow for readers who want to understand the full transfer-learning process rather than jump between disconnected sections.

x T(x) z g(z) ŷ

Current stage

1. Problem and dataset exploration (EDA)

This stage corresponds to the raw input image x. It focuses on Stanford Dogs as a dataset, the official split reconstruction, metadata building, and EDA findings such as class balance, image size, and illumination variation.

1. Problem and dataset exploration (EDA)

This stage stays on the raw input side of the workflow: Stanford Dogs images before preprocessing, the exported metadata files reconstructed by the notebook, and the EDA evidence that later justifies resizing, normalization, and augmentation.

Problem statement

ResNet-50 vs ViT-B/16 on Stanford Dogs

The goal is to classify each image into one of 120 dog breeds and compare two major model families under the same dataset and evaluation protocol.

CNN model: ResNet-50
Transformer model: ViT-B/16
Both initialized from ImageNet-pretrained weights

Dataset summary

Why Stanford Dogs fits this assignment

120 classes
20,580 total images
12,000 official train images and 8,580 official test images
10,200 train / 1,800 validation / 8,580 test after the internal stratified split
Backed by metadata_with_quality.csv and split_metadata.csv

Stanford Dogs is a fine-grained benchmark with enough classes and enough training samples to make the CNN-versus-Transformer comparison meaningful and non-trivial.

Raw input context

What belongs to node 1 before preprocessing starts

For this report, node x stays focused on the raw Stanford Dogs input and the metadata analysis done before transform design. The notebook reconstructs the official split, exports per-image metadata, enriches it with brightness and color statistics, and only then moves on to tensor conversion in node 2.

Each sample starts as a raw RGB image with its original width, height, and aspect ratio still intact
The notebook parses train_list.mat and test_list.mat to reconstruct the official Stanford Dogs split
build_metadata(...) exports path, label, class name, width, height, aspect ratio, and split labels for each sample
A second pass writes metadata_with_quality.csv with brightness_mean, contrast_std, saturation_mean, r_mean, g_mean, and b_mean
split_metadata.csv records the internal stratified split: 10,200 train, 1,800 val, and 8,580 test
The later tensor target is (3, 224, 224) per image, but the actual resize-to-tensor logic is intentionally deferred to node 2

Official split files

What `train_list.mat` and `test_list.mat` actually are

These two files are not image folders. They are MATLAB annotation files shipped with Stanford Dogs and used to define the original benchmark partition. Each file stores sample-level metadata such as relative image paths and integer breed labels, so the notebook can recover the official train/test split instead of inventing a new one from scratch.

train_list.mat: official training-image entries and labels
test_list.mat: official test-image entries and labels
The notebook reads them with the Stanford Dogs root as context, then resolves each relative path into a real image under Images/
Only after that reconstruction step does the notebook create the internal train and val split from official_train

Dataset layout

How the Stanford Dogs dataset is organized on disk

The notebook downloads three archives: image files, breed annotations, and the official split lists. In practice, the code resolves TRAIN_LIST_MAT and TEST_LIST_MAT either from the dataset root or from the extracted lists/ folder.

dataset tree

stanford_dogs/
├── images.tar
├── annotation.tar
├── lists.tar
├── Images/
│   ├── n02085620-Chihuahua/
│   │   ├── n02085620_10074.jpg
│   │   └── ...
│   ├── n02085782-Japanese_spaniel/
│   └── ...
├── Annotation/
│   ├── n02085620-Chihuahua/
│   │   ├── n02085620_10074
│   │   └── ...
│   └── ...
├── lists/
│   ├── train_list.mat
│   └── test_list.mat
├── train_list.mat   # sometimes resolved from the root
└── test_list.mat    # sometimes resolved from the root

python

train_meta_full = build_metadata(train_rel_paths, train_labels, "official_train")
test_meta = build_metadata(test_rel_paths, test_labels, "official_test")
meta = pd.concat([train_meta_full, test_meta], ignore_index=True)

splitter = StratifiedShuffleSplit(
    n_splits=1,
    test_size=VAL_FROM_TRAIN_RATIO,
    random_state=SEED,
)
train_idx, val_idx = next(splitter.split(train_meta_full, train_meta_full["label"]))

python

quality_rows = []
for image_path in tqdm(meta["image_path"], desc="Computing image-quality metadata"):
    with Image.open(image_path).convert("RGB") as image:
        rgb_arr = np.asarray(image, dtype=np.float32) / 255.0
        gray_arr = np.asarray(image.convert("L"), dtype=np.float32) / 255.0
        hsv_arr = rgb_to_hsv(rgb_arr)
        quality_rows.append({
            "brightness_mean": float(gray_arr.mean()),
            "contrast_std": float(gray_arr.std()),
            "saturation_mean": float(hsv_arr[..., 1].mean()),
            "r_mean": float(rgb_arr[..., 0].mean()),
            "g_mean": float(rgb_arr[..., 1].mean()),
            "b_mean": float(rgb_arr[..., 2].mean()),
        })

quality_df = pd.DataFrame(quality_rows)
meta = pd.concat([meta.reset_index(drop=True), quality_df], axis=1)
meta.to_csv(EDA_ARTIFACT_ROOT / "metadata_with_quality.csv", index=False)

Artifact references used in this panel come directly from the notebook export: metadata_with_quality.csv, split_metadata.csv, quality_distributions.png, dataset_distributions.png, image_size_distribution.png, rgb_channel_summary.png, random_class_samples.png, darkest_examples.png, and brightest_examples.png.

CSV artifacts

What the two exported metadata files actually represent

Artifact 1

`metadata_with_quality.csv`

This is the master per-image metadata table for the whole Stanford Dogs dataset. It starts from the reconstructed official split, then appends geometry and image-quality statistics so the EDA can be reproduced from a single CSV.

One row per image across all 20,580 samples
Includes identity fields such as image_id, class_name, and official_split
Includes geometry fields such as width, height, and aspect_ratio
Includes quality/color fields such as brightness_mean, contrast_std, saturation_mean, r_mean, g_mean, and b_mean

Web source: metadata_with_quality.csv

Artifact 2

`split_metadata.csv`

This is the split-assignment table used after the internal stratified split. It keeps the same image identity and geometry fields, but its extra job is to mark whether each sample belongs to the final train, val, or test partition used by the loaders.

One row per image in the final benchmark split
Preserves official_split so the original Stanford Dogs partition is still traceable
Adds the final split column used by build_loaders(...)
Acts as the bridge between EDA metadata and actual training/validation/test data processing

Web source: split_metadata.csv

Preview of `metadata_with_quality.csv`
image_id	class_name	official_split	width	height	aspect_ratio	brightness_mean	contrast_std	saturation_mean
`n02085620_5927`	Chihuahua	`official_train`	360	300	1.200	0.348	0.221	0.445
`n02085620_4441`	Chihuahua	`official_train`	375	500	0.750	0.538	0.099	0.223
`n02085620_1502`	Chihuahua	`official_train`	500	333	1.502	0.543	0.215	0.267

Preview of `split_metadata.csv`
image_id	class_name	official_split	split	width	height	aspect_ratio
`n02107142_4013`	Doberman	`official_train`	`train`	236	350	0.674
`n02091244_5818`	Ibizan hound	`official_train`	`train`	375	500	0.750
`n02107574_2912`	Greater Swiss Mountain dog	`official_train`	`train`	500	334	1.497

Final split counts from `split_metadata.csv`
split	count
`train`	10,200
`val`	1,800
`test`	8,580

EDA highlights

Quality variation, split structure, and random breed samples

The exported EDA confirms that Stanford Dogs is visually diverse in brightness, contrast, saturation, image size, aspect ratio, and scene composition. The notebook artifacts also show that the class distribution is only moderately imbalanced, which makes a stratified train/validation split practical without distorting the benchmark too much.

metadata_with_quality.csv: brightness mean 0.4522, contrast std mean 0.2261, saturation mean 0.3047
metadata_with_quality.csv: RGB means are R = 0.4761, G = 0.4518, B = 0.3910
metadata_with_quality.csv: average width 442.5 px, average height 385.9 px, average aspect ratio 1.19
metadata_with_quality.csv: breed counts range from 148 to 252 images, with a mean of 171.5
split_metadata.csv: internal split is 10,200 train / 1,800 val / 8,580 test

Brightness, contrast, and saturation distributions confirm a broad range of lighting and color conditions.

Stanford Dogs split and breed distributions

The split and breed-distribution export shows a healthy sample count for both official partitions, while the breed histogram remains only moderately imbalanced.

Stanford Dogs image size and aspect ratio distributions

Width, height, and aspect-ratio distributions show that the dataset contains varied image geometries before resizing to the common model input size.

The RGB summary complements the quality CSV by showing that the dataset is not perfectly color-balanced, which helps justify later channel-wise normalization.

Random breed samples visualize the raw input domain before any preprocessing: different poses, backgrounds, framing, and lighting conditions appear immediately.

Extreme brightness examples

Darkest and brightest samples in the dataset

To complement the summary statistics, we also inspected the darkest and brightest images in the dataset. This helps verify that the measured brightness range corresponds to meaningful visual variation rather than only small numerical differences.

The darkest examples are mostly low-light indoor images, shadow-heavy scenes, or photos with strong exposure limitations. In several cases, the dog occupies only part of the frame or blends into a dark background, which makes recognition harder even for humans. These samples motivate brightness-robust preprocessing and moderate augmentation.

The darkest images in Stanford Dogs illustrate low-light scenes, dark backgrounds, and reduced visibility of breed-specific features.

The brightest images often contain white backgrounds or bright fur, showing the opposite end of the illumination range.

Together, these two galleries confirm that Stanford Dogs includes substantial illumination diversity. This is the last stop of the report before node 2 begins the actual data processing pipeline that standardizes geometry and channel scaling.

2. Data processing and preprocessing pipeline

This stage expands T(x) into the actual data-processing workflow used by the notebook: resizing, model-specific augmentation, normalization, and the Dataset/DataLoader handoff that produces fair 224x224 mini-batches for both ResNet-50 and ViT-B/16.

Preprocessing pipeline overview

Resizing, standardization, and the path to `(N, C, H, W)`

Pipeline overview

The preprocessing pipeline has three core objectives: make image sizes consistent, create useful training variation through augmentation, and normalize tensors so pretrained models receive inputs in a familiar scale.

In this notebook, the exported W&B run summary shows the same input contract for all four benchmark runs: image_size = 224 and batch_size = 32. That shared contract is what keeps the later ResNet-50 versus ViT-B/16 comparison fair at the data-processing level.

Image resizing and standardization

The standard evaluation flow is Resize((256, 256)) → CenterCrop(224) → ToTensor(). This ensures that all images can be stacked into one batch tensor even though the original Stanford Dogs images have different sizes and aspect ratios.

The final tensor shape per image is (3, 224, 224), and with the benchmark setting BATCH_SIZE = 32 a typical mini-batch becomes (32, 3, 224, 224).

Result tensor shape

In PyTorch, image batches are stored in channel-first format (N, C, H, W), where N is batch size, C is channel count, H is height, and W is width.

This layout is the standardized input representation that both backbones receive after preprocessing finishes.

Complete resize-to-tensor flow

The notebook standardizes geometry before normalization: first resize, then crop, then convert to tensor, and finally normalize. This order is important because normalization should be applied after the pixel data is already in tensor form.

The same deterministic evaluation path is reused for validation and test so the reported metrics stay stable.

Batching and loader handoff

After preprocessing is defined, the Dataset and DataLoader wrap these per-image transforms into mini-batches. The notebook uses BATCH_SIZE = 32, which is the same value recorded for every benchmark run in wandb_export/runs_summary.csv.

The notebook output prints ResNet loaders: 319 57 269 and ViT loaders: 319 57 269, corresponding to train, validation, and test batches under the shared split.

Overall objective: convert raw Stanford Dogs images into a fixed tensor representation that supports batching, augmentation, and a fair four-run benchmark.

Artifact references used in this panel come from notebook outputs plus exports in wandb_export, especially runs_summary.csv and augmented_batch_preview.png.

Data augmentation

Model-specific augmentation for the training path

Increase effective data diversity
Improve generalization
Reduce overfitting
Simulate realistic pose, framing, and illumination changes

Train: RandomResizedCrop, horizontal flip, rotation

These three operations make the CNN branch see the same breed under slightly different framing and orientation. RandomResizedCrop(scale=(0.72, 1.0)) changes how tightly the dog is framed, horizontal flip adds left-right variation, and RandomRotation(15) makes the model less sensitive to small camera tilt.

This is especially useful for dog photos because pose, camera angle, and subject placement vary a lot in the dataset.

Train: ColorJitter, Normalize, RandomErasing

ColorJitter(brightness=0.15, contrast=0.15, saturation=0.15) simulates lighting variation, while Normalize(IMAGENET_MEAN, IMAGENET_STD) aligns the input distribution with ImageNet-pretrained weights.

RandomErasing(p=0.10) acts like a mild occlusion simulation: part of the image is masked so the model learns not to depend on only one local patch such as one ear, one eye, or a small fur texture region.

Val/Test: Resize, CenterCrop, ToTensor, Normalize

Validation and test preprocessing is deterministic. We do not use random augmentation there because evaluation should measure the model itself, not randomness from the input pipeline.

Resize + CenterCrop + ToTensor + Normalize creates a stable and repeatable input path for every evaluation run.

The ResNet-50 branch uses the stronger augmentation policy, but it is still moderate enough to preserve breed identity for a fine-grained dataset.

Train vs validation

A milder ViT training policy with the same deterministic eval path

Train: Milder crop, horizontal flip, color jitter

The ViT training pipeline still uses crop and flip, but the crop range is narrower: RandomResizedCrop(scale=(0.78, 1.0)). This means the model sees less aggressive spatial distortion than the ResNet setup.

The reason is practical: Stanford Dogs is a fine-grained task, so overly strong augmentation can destroy subtle breed cues such as ear shape, muzzle geometry, or fur structure that the Transformer needs to separate similar classes.

Train: Normalize, RandomErasing

Just like ResNet, ViT uses ImageNet normalization so the pretrained backbone receives inputs in a familiar scale. The erasing step is also milder at p=0.08, because the augmentation policy is intentionally less aggressive overall.

In other words, the ViT pipeline still regularizes training, but it does so with more restraint to avoid hurting fine-grained recognition.

Val/Test: Resize, CenterCrop, ToTensor, Normalize

The evaluation transform is shared with the CNN setup. This is important because model comparison is more meaningful when both backbones are tested under the same deterministic input pipeline.

Using the same validation/test preprocessing ensures that any performance gap is mainly due to the model family and training strategy, not different evaluation transforms.

The ViT notebook keeps augmentation lighter so that fine-grained breed cues are preserved while still improving generalization.

Data normalization

Normalize tensors for stable training and pretrained-model compatibility

Normalization is the step that rescales each RGB channel so optimization becomes more stable and inputs better match the distribution used during model pretraining. In this notebook, both backbones use ImageNet statistics, even though the exported Stanford Dogs quality CSV shows slightly different dataset-specific RGB means.

Normalization goals

Why this step matters

Stabilize training
Help optimization converge faster
Standardize feature scales across channels
Match pretrained model expectations

Reference notes

ImageNet statistics used in the notebook

python

# normalization formula
# x_norm = (x - mean) / std

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

normalize = transforms.Normalize(IMAGENET_MEAN, IMAGENET_STD)

Notebook choice: ImageNet mean/std is used directly for both ResNet-50 and ViT-B/16.
Dataset evidence: metadata_with_quality.csv reports dataset RGB means of approximately (0.4761, 0.4518, 0.3910).
Practical reason: the pretrained checkpoint expectation matters more here than matching the raw dataset exactly.
Extension idea: custom mean/std estimation is still useful when training from scratch or switching to a different checkpoint family.

python

# reference pattern for custom mean/std estimation
channel_sum = torch.zeros(3)
channel_sq_sum = torch.zeros(3)
pixel_count = 0

for images, _ in train_loader:
    channel_sum += images.sum(dim=(0, 2, 3))
    channel_sq_sum += (images ** 2).sum(dim=(0, 2, 3))
    pixel_count += images.size(0) * images.size(2) * images.size(3)

mean = channel_sum / pixel_count
std = (channel_sq_sum / pixel_count - mean ** 2).sqrt()

Complete augmentation pipeline

Train and evaluation transforms used in the notebook

Shared evaluation transform

Validation and test preprocessing

Resize((256, 256))
CenterCrop(224)
ToTensor()
Normalize(IMAGENET_MEAN, IMAGENET_STD)
The same eval transform is reused for both models to keep comparison fair.

ResNet-50 train transform

Moderate augmentation for CNN robustness

Resize((256, 256))
RandomResizedCrop(224, scale=(0.72, 1.0))
RandomHorizontalFlip(0.5)
RandomRotation(15)
ColorJitter(brightness=0.15, contrast=0.15, saturation=0.15)
RandomErasing(p=0.10) after normalization

ViT-B/16 train transform

Slightly lighter augmentation for fine-grained breed cues

Resize((256, 256))
RandomResizedCrop(224, scale=(0.78, 1.0))
RandomHorizontalFlip(0.5)
ColorJitter(brightness=0.10, contrast=0.10, saturation=0.10)
RandomErasing(p=0.08) after normalization
No extra rotation is used here because the ViT setup is intentionally milder.

python

resnet_train_transform = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.RandomResizedCrop(IMAGE_SIZE, scale=(0.72, 1.0)),
    transforms.RandomHorizontalFlip(0.5),
    transforms.RandomRotation(15),
    transforms.ColorJitter(brightness=0.15, contrast=0.15, saturation=0.15),
    transforms.ToTensor(),
    transforms.Normalize(IMAGENET_MEAN, IMAGENET_STD),
    transforms.RandomErasing(p=0.10),
])

vit_train_transform = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.RandomResizedCrop(IMAGE_SIZE, scale=(0.78, 1.0)),
    transforms.RandomHorizontalFlip(0.5),
    transforms.ColorJitter(brightness=0.10, contrast=0.10, saturation=0.10),
    transforms.ToTensor(),
    transforms.Normalize(IMAGENET_MEAN, IMAGENET_STD),
    transforms.RandomErasing(p=0.08),
])

common_eval_transform = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.CenterCrop(IMAGE_SIZE),
    transforms.ToTensor(),
    transforms.Normalize(IMAGENET_MEAN, IMAGENET_STD),
])

Important note: augmentation belongs only to the training path. Validation and test remain deterministic so evaluation measures the model rather than randomness from the preprocessing pipeline.

Custom dataset and best practices

Dataset wrapper, DataLoader settings, and implementation tips

After resize, augmentation, and normalization are defined, the notebook wraps the metadata frame in a custom Dataset. Each image is loaded lazily from disk, converted to RGB, transformed on the fly, and then batched by the DataLoader. This is where raw image files finally become consistent mini-batches for both model families.

BATCH_SIZE = 32
NUM_WORKERS = 0 on Windows, otherwise 4
PIN_MEMORY = torch.cuda.is_available()
PERSISTENT_WORKERS = NUM_WORKERS > 0
drop_last is left at the PyTorch default False
Train uses shuffle=True; validation and test use shuffle=False
Notebook output confirms identical loader lengths for both families: 319 train, 57 val, 269 test

python

class StanfordDogsFrameDataset(Dataset):
    def __init__(self, frame: pd.DataFrame, transform=None):
        self.frame = frame.reset_index(drop=True).copy()
        self.transform = transform

    def __len__(self):
        return len(self.frame)

    def __getitem__(self, idx):
        row = self.frame.iloc[idx]
        image = Image.open(row["image_path"]).convert("RGB")
        label = int(row["label"])
        if self.transform is not None:
            image = self.transform(image)
        return image, label

python

# complete loader example
def build_loaders(train_transform):
    train_ds = StanfordDogsFrameDataset(train_meta, transform=train_transform)
    val_ds = StanfordDogsFrameDataset(val_meta, transform=common_eval_transform)
    test_ds = StanfordDogsFrameDataset(test_meta, transform=common_eval_transform)

    train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True,  num_workers=NUM_WORKERS, pin_memory=PIN_MEMORY, persistent_workers=PERSISTENT_WORKERS)
    val_loader   = DataLoader(val_ds,   batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_WORKERS, pin_memory=PIN_MEMORY, persistent_workers=PERSISTENT_WORKERS)
    test_loader  = DataLoader(test_ds,  batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_WORKERS, pin_memory=PIN_MEMORY, persistent_workers=PERSISTENT_WORKERS)
    return train_loader, val_loader, test_loader

resnet_train_loader, resnet_val_loader, resnet_test_loader = build_loaders(resnet_train_transform)
vit_train_loader, vit_val_loader, vit_test_loader = build_loaders(vit_train_transform)

print("ResNet loaders:", len(resnet_train_loader), len(resnet_val_loader), len(resnet_test_loader))
print("ViT loaders:", len(vit_train_loader), len(vit_val_loader), len(vit_test_loader))
# ResNet loaders: 319 57 269
# ViT loaders: 319 57 269

Training vs validation: use stochastic augmentation only during training.
Performance: tune batch_size, num_workers, and pin_memory based on hardware.
Implementation tip: keep normalization identical across train, validation, and test once the statistics are chosen.
Implementation tip: visually inspect a denormalized preview batch before starting long training runs.
Fair comparison: runs_summary.csv shows all four benchmark runs keep the same input size of 224 and batch size of 32.

The two artifact references mentioned earlier are now surfaced directly here. The W&B export file runs_summary.csv confirms that every benchmark run keeps the same preprocessing contract, while the notebook image artifact augmented_batch_preview.png shows what the transformed training batch actually looks like.

Preview of `runs_summary.csv` for node 2 data-processing settings
model_name	family	strategy	batch_size	image_size
ResNet-50	`cnn`	Full fine-tuning for 12 epochs	32	224
ResNet-50	`cnn`	Head 3 + full fine-tune 8 epochs	32	224
ViT-B/16	`vit`	Full fine-tuning for 12 epochs	32	224
ViT-B/16	`vit`	Head 3 + full fine-tune 8 epochs	32	224

The exported notebook artifact augmented_batch_preview.png confirms that the training transform is active, but still preserves the main breed identity after resize, crop, normalization, and augmentation.

3. Backbone architecture and transfer learning

This stage focuses on z, the learned feature representation produced by the pretrained backbone. It explains how ResNet-50 and ViT-B/16 process the shared 224x224 tensors, how transfer learning is configured for each backbone, and how the fair four-run benchmark is grounded in the exported artifacts.

Backbone overview

From 224x224 tensors to transferable feature representations

After node 1 and node 2 finish data processing, both model families receive the same minibatch format and then diverge internally. ResNet-50 keeps a convolutional feature grid all the way to global pooling, while ViT-B/16 turns the image into a sequence of patch tokens and summarizes it through the CLS token.

Every fair-benchmark run starts from the same input contract (32, 3, 224, 224)

The notebook output and runs_summary.csv agree on the shared preprocessing contract: batch_size = 32 and image_size = 224 for all four benchmark runs.
ResNet-50 keeps spatial feature maps, then compresses them to a 2048-dimensional pooled vector

Architecturally, a 224x224 input becomes deep convolutional maps near (N, 2048, 7, 7) before global average pooling reduces them to (N, 2048). The classifier head later reads that pooled descriptor.
ViT-B/16 converts the image into patch tokens, then uses self-attention to mix information

ViT splits a 224x224 image into 16x16 patches, so one image becomes 14 x 14 = 196 patch tokens. With one extra class token, the transformer processes a sequence of 197 tokens and learns a feature representation with hidden size 768. The final class token embedding is then passed to the classification head.
The shape notes here are architecture-derived; params and strategy evidence come from exported artifacts

The notebook does not print intermediate backbone tensors directly, so the shape flow below is an architecture explanation. The parameter counts, training strategies, and benchmark outcomes are taken from model_comparison.csv, runs_summary.csv, and the staged-run history files.

python

# Conceptual shape flow used in this report
images, labels = next(iter(resnet_train_loader))
# images.shape -> torch.Size([32, 3, 224, 224])

# ResNet-50
# backbone feature maps: (32, 2048, 7, 7)
# pooled features:       (32, 2048)
# logits after fc:       (32, 120)

# ViT-B/16
# 224x224 with patch size 16 -> 14 x 14 = 196 patches
# token sequence:        (32, 197, 768)   # 196 patches + 1 cls token
# cls embedding:         (32, 768)
# logits after head:     (32, 120)

Backbone evidence files used in this section: model_comparison.csv, runs_summary.csv, resnet50_staged_history.csv, and vit_b16_staged_history.csv.

CNN backbone

ResNet-50: residual CNN feature extraction with a deeper pooled representation

The CNN branch of the refactored notebook uses ResNet-50 as the benchmark CNN backbone. That gives the report a deeper convolutional baseline with a 2048-dimensional pooled representation before the final classifier.

create_resnet50(...) keeps the pretrained backbone and swaps only the classifier

TorchVision already packages the full residual backbone, global pooling, and original ImageNet head. The notebook simply replaces model.fc so the output width matches 120 Stanford Dogs classes.
Parameter count exported by both W&B and model_comparison.csv is 23,753,912

This makes ResNet-50 a substantial CNN backbone, but it is still much lighter than ViT-B/16. It is the CNN counterpart used throughout the current fair benchmark.
The staged ResNet-50 schedule slightly outperforms full fine-tuning in the exported benchmark

The staged run reaches 0.8655 test accuracy versus 0.8557 for full fine-tuning, and it also finishes a bit faster (205.0s versus 227.1s).

python

def create_resnet50(num_classes: int) -> nn.Module:
    model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
    model.fc = nn.Linear(model.fc.in_features, num_classes)
    return model


resnet_model = create_resnet50(NUM_CLASSES).to(DEVICE)
print("ResNet-50 params:", f"{count_total_params(resnet_model):,}")

Artifact-backed backbone summary: ResNet-50 is the 23.75M-parameter CNN baseline used in both the full and staged schedules of the fair benchmark.

Transformer backbone

ViT-B/16: patch tokens, CLS aggregation, and stronger transfer behavior

The transformer branch keeps ViT-B/16 with ImageNet-pretrained weights. Its backbone remains much heavier than ResNet-50, but the exported history files show that its pretrained representation is already extremely strong before full fine-tuning even starts.

create_vit_b16(...) replaces the classification head on top of the CLS embedding

TorchVision exposes the final classifier as model.heads.head. The notebook swaps that linear layer while leaving the rest of the transformer stack intact.
Parameter count exported by the benchmark is 85,890,936

ViT-B/16 is much larger than ResNet-50, which partly explains its longer training times, but it also yields the strongest fine-grained representation in the final benchmark.
The head-only warmup history already starts very high on validation accuracy

In vit_b16_staged_history.csv, the head-only phase moves from 0.9456 to 0.9522 validation accuracy within just three epochs, showing how strong the pretrained backbone is before the full-model stage.

python

def create_vit_b16(num_classes: int) -> nn.Module:
    model = models.vit_b_16(weights=models.ViT_B_16_Weights.IMAGENET1K_V1)
    model.heads.head = nn.Linear(model.heads.head.in_features, num_classes)
    return model


vit_model = create_vit_b16(NUM_CLASSES).to(DEVICE)
print("ViT-B/16 params:", f"{count_total_params(vit_model):,}")

Artifact-backed backbone summary: ViT-B/16 is the 85.89M-parameter transformer baseline, and its staged schedule reaches the best test accuracy in the current fair benchmark.

Transfer learning

The refactored notebook now evaluates both backbones under the same two schedules

This is the biggest change from the old report. Instead of giving the CNN and ViT different main strategies, the current notebook evaluates both under full fine-tuning for 12 epochs and head 3 + full fine-tune 8 epochs. That makes the backbone comparison much cleaner.

python

def set_trainable_stage(model: nn.Module, family: str, mode: str):
    if mode == 'head_only':
        for param in model.parameters():
            param.requires_grad = False
        if family == 'cnn':
            for param in model.fc.parameters():
                param.requires_grad = True
        elif family == 'vit':
            for param in model.heads.parameters():
                param.requires_grad = True
    elif mode == 'full_finetune':
        for param in model.parameters():
            param.requires_grad = True

Model	Family	Strategy	Params(M)	Test accuracy	Macro F1	Train time (s)
ResNet-50	CNN	Full fine-tuning for 12 epochs	23.75	0.8557	0.8485	227.1
ResNet-50	CNN	Head 3 + full fine-tune 8 epochs	23.75	0.8655	0.8599	205.0
ViT-B/16	Transformer	Full fine-tuning for 12 epochs	85.89	0.9077	0.9026	943.4
ViT-B/16	Transformer	Head 3 + full fine-tune 8 epochs	85.89	0.9348	0.9311	755.2

The staged ResNet-50 run slightly beats full fine-tuning on both accuracy and macro F1.
The staged ViT-B/16 run is clearly best overall and is also faster than the full 12-epoch ViT run.
The staged history files show immediate backbone reuse: ResNet-50 head-only val accuracy rises from 0.8122 to 0.8483, while ViT-B/16 rises from 0.9456 to 0.9522 in 3 epochs.
This node is therefore grounded in exported strategy evidence, not only architecture descriptions.

Exported fair-benchmark overview for ResNet-50 and ViT-B/16 on Stanford Dogs

The downloaded comparison_overview.png is the report-ready benchmark figure generated after the fair four-run experiment. It visually confirms the same ranking shown in model_comparison.csv: staged ViT-B/16 first, then full ViT-B/16, then staged ResNet-50, and finally full ResNet-50.

Direct web references for this node: model_comparison.csv, best_per_family_comparison.csv, comparison_overview.png, resnet50_staged_run_metadata.json, and vit_b16_staged_run_metadata.json.

Artifact snapshots

Every backbone file referenced above is now previewed directly on the page

The notebook and W&B exports do not just provide final metrics. They also record the run contract, staged history, and best-per-family summary used to justify the backbone comparison. The tables below are compact previews of those files so the reader does not need to open raw CSV or JSON first.

Preview of `runs_summary.csv`
Run ID	Model	Family	Strategy	Batch size	Image size	Total params
`4vbmbqdj`	ResNet-50	`cnn`	Full fine-tuning for 12 epochs	32	224	23,753,912
`82qpkftg`	ResNet-50	`cnn`	Head 3 + full fine-tune 8 epochs	32	224	23,753,912
`qt797v8c`	ViT-B/16	`vit`	Full fine-tuning for 12 epochs	32	224	85,890,936
`kqlnnxbe`	ViT-B/16	`vit`	Head 3 + full fine-tune 8 epochs	32	224	85,890,936

Preview of `best_per_family_comparison.csv`
Model	Family	Best strategy	Params(M)	Accuracy	Macro F1	ECE
ResNet-50	CNN	Head 3 + full fine-tune 8 epochs	23.75	0.8655	0.8599	0.0408
ViT-B/16	Transformer	Head 3 + full fine-tune 8 epochs	85.89	0.9348	0.9311	0.0198

Snapshot from `resnet50_staged_history.csv` and `vit_b16_staged_history.csv`
Run	Phase	Epoch in phase	LR	Train acc	Val acc	Val loss
ResNet-50 staged	`head_only`	1	0.00075	0.5808	0.8122	1.1028
ResNet-50 staged	`head_only`	3	0.00000	0.8609	0.8483	0.6702
ViT-B/16 staged	`head_only`	1	0.00075	0.8743	0.9456	0.1834
ViT-B/16 staged	`head_only`	3	0.00000	0.9562	0.9522	0.1589

Snapshot from the staged `run_metadata.json` files
Model	Run ID	Group	Job type	Phase plan	Training seconds	W&B run
ResNet-50	`82qpkftg`	`stanford-dogs-fair-benchmark`	`train`	`head_only(3, 1e-3)` -> `full_finetune(8, 1e-4)`	205.03	`82qpkftg`
ViT-B/16	`kqlnnxbe`	`stanford-dogs-fair-benchmark`	`train`	`head_only(3, 1e-3)` -> `full_finetune(8, 3e-5)`	755.21	`kqlnnxbe`

4. Classifier head and prediction mapping

This stage focuses on g(z), the adapted classifier head that maps backbone features to 120 Stanford Dogs logits, then converts those logits into probabilities, labels, and report-ready artifacts such as classification reports and metrics JSON files.

Classifier head overview

Minimal linear heads keep the comparison centered on the backbone

Once the backbone has produced a compact feature representation, the classifier head is the final component that converts those features into breed scores. In this notebook, both models use simple adapted heads: ResNet-50 sends a pooled feature vector to model.fc, while ViT-B/16 sends its class-token embedding to model.heads.head.

ResNet-50 head input is the pooled backbone vector (N, 2048)

The convolutional backbone first reduces spatial information with global average pooling, so the classifier head receives one 2048-dimensional feature vector per image instead of a full feature map.
ViT-B/16 head input is the class-token representation (N, 768)

The transformer backbone summarizes the image through the CLS token. That final token embedding is the representation passed into the adapted linear head for breed classification.
The output of both heads is a logits tensor with shape (N, 120)

Each row contains one score for each Stanford Dogs breed. These are raw logits, not probabilities yet, which is exactly what the training loss expects.
Training uses logits directly, while inference adds softmax and argmax

During training, the notebook uses CrossEntropyLoss on raw logits. During prediction, it applies softmax(dim=1) to obtain probabilities and argmax(dim=1) to choose the final breed.
The input widths are read from the pretrained models, not hard-coded by hand

The notebook queries model.fc.in_features for ResNet-50 and model.heads.head.in_features for ViT-B/16 so the adapted head always matches the backbone output.

python

# Conceptual head interface in this report

# ResNet-50
pooled_features.shape = (N, 2048)
logits = model.fc(pooled_features)          # (N, 120)

# ViT-B/16
cls_embedding.shape = (N, 768)
logits = model.heads.head(cls_embedding)    # (N, 120)

# Inference
probs = logits.softmax(dim=1)               # (N, 120)
preds = probs.argmax(dim=1)                 # (N,)

The notebook does not manually print these intermediate tensors before calling the head, but the widths above follow directly from model.fc.in_features and model.heads.head.in_features in the current code.

TorchVision head for CNN

ResNet-50 keeps the standard pooling path and swaps only the final linear layer

In TorchVision ResNet, global average pooling is already built into the forward pass, so the notebook only needs to replace the final fully connected layer. This is the simplest and most common classifier-head adaptation for transfer learning with CNNs.

model.fc.in_features provides the correct input width automatically

Instead of hard-coding the feature size, the notebook reads model.fc.in_features directly from the pretrained model, which ensures the replacement layer matches the backbone output. In the current ResNet-50 benchmark this width is 2048.
The new output width is num_classes = 120

This is the task-specific adaptation step: ImageNet has 1000 output classes, but Stanford Dogs has 120 breeds, so the head must be resized to the new label space.
The notebook intentionally keeps the CNN head simple

No extra hidden MLP, dropout stack, or custom projection layer is inserted for ResNet-50. That keeps the experiment focused on the backbone comparison rather than on head engineering.

python

def create_resnet50(num_classes: int) -> nn.Module:
    model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
    model.fc = nn.Linear(model.fc.in_features, num_classes)
    return model


resnet_model = create_resnet50(NUM_CLASSES).to(DEVICE)

For ResNet-50, the effective classifier head is a simple mapping from a 2048-dimensional pooled feature vector to 120 breed logits.

TorchVision head for ViT

ViT-B/16 replaces the classifier on top of the CLS representation

The Vision Transformer head is also adapted with a single linear layer, but its input comes from the transformer class token rather than from pooled convolutional features. In the notebook, this head is trained first by itself before the whole model is unfrozen.

model.heads.head.in_features defines the classifier input width

For ViT-B/16, the final token embedding size is 768, and TorchVision exposes that width through model.heads.head.in_features.
The adapted ViT head is still a single linear layer to 120 classes

Just like ResNet-50, the notebook avoids introducing a deeper custom head so the comparison remains centered on the pretrained backbone rather than on extra classifier capacity.
Head-only training is used as the first stage of ViT fine-tuning

The notebook first freezes the backbone and optimizes only the classifier head, then later unfreezes the whole model for full fine-tuning. In the exported staged history, this head-only phase already reaches 0.9522 validation accuracy at epoch 3.

python

def create_vit_b16(num_classes: int) -> nn.Module:
    model = models.vit_b_16(weights=models.ViT_B_16_Weights.IMAGENET1K_V1)
    model.heads.head = nn.Linear(model.heads.head.in_features, num_classes)
    return model


vit_model = create_vit_b16(NUM_CLASSES).to(DEVICE)

for param in vit_model.parameters():
    param.requires_grad = False
for param in vit_model.heads.parameters():
    param.requires_grad = True

For ViT-B/16, the classifier head maps the final 768-dimensional CLS embedding to 120 Stanford Dogs logits.

Prediction pipeline

Logits become probabilities, predicted labels, and report-ready artifacts

The adapted head outputs raw logits. Those logits are consumed in two different ways: during training they go directly into CrossEntropyLoss, and during inference they are converted into probabilities with softmax so the notebook can derive final predictions and export evaluation files. The staged history files also show how quickly the new head starts aligning with Stanford Dogs labels before the full backbone is unfrozen.

CrossEntropyLoss expects raw logits, not pre-softmax probabilities

This is why the forward pass returns unnormalized scores. The loss function internally applies the correct log-softmax behavior in a numerically stable way.
softmax(dim=1) converts logits into a 120-way probability vector

Each probability vector sums to one and can be reused for calibration analysis, confidence histograms, and other reporting artifacts generated later in the workflow.
argmax(dim=1) selects the top-1 breed prediction for each image

The predicted class index is compared against the ground-truth label to compute accuracy, F1, confusion matrices, and the final classification report.
evaluate_model(...) reuses the head outputs to save CSV, JSON, and image artifacts

The notebook exports classification reports, count-based confusion matrices, row-normalized confusion matrices, top-confusion tables, and summary metrics from the same prediction outputs.

python

def predict_model(model, loader):
    model.eval()
    all_targets, all_preds, all_probs = [], [], []
    with torch.no_grad():
        for images, targets in tqdm(loader, leave=False):
            images = images.to(DEVICE, non_blocking=True)
            outputs = model(images)
            probs = outputs.softmax(dim=1)
            preds = probs.argmax(dim=1)
            all_targets.extend(targets.numpy())
            all_preds.extend(preds.cpu().numpy())
            all_probs.append(probs.cpu())
    return np.array(all_targets), np.array(all_preds), torch.cat(all_probs, dim=0).numpy()


def evaluate_model(model, loader, class_names, model_name, artifact_dir):
    targets, preds, probs = predict_model(model, loader)
    report = classification_report(targets, preds, target_names=class_names, zero_division=0, output_dict=True)
    report_df = pd.DataFrame(report).transpose()
    cm = confusion_matrix(targets, preds)
    cm_norm = confusion_matrix(targets, preds, normalize='true')

Model	Head-only epoch 1 val acc	Head-only epoch 3 val acc	First full-finetune val acc	History source
ResNet-50 staged run	0.8122	0.8483	0.8456	`resnet50_staged_history.csv`
ViT-B/16 staged run	0.9456	0.9522	0.9256	`vit_b16_staged_history.csv`

The staged histories make the head adaptation behavior visible on the web report itself: ResNet-50 needs more work from the backbone, while ViT-B/16 already starts with a very strong transferred representation.

Artifact-backed head outputs

The adapted heads feed the downloaded report files used later on the web page

The downloaded artifacts make node 4 concrete: once logits and probabilities are produced, the notebook writes classification reports, metrics JSON files, confusion matrices, calibration plots, and qualitative galleries. For the backbone/head discussion here, the staged runs are the most relevant exports because they reflect the current fair benchmark setup.

Model	Feature fed to head	Head layer	Accuracy	Macro F1	Weighted F1	Exported files
ResNet-50	Pooled 2048-d feature vector	`nn.Linear(2048, 120)`	0.8655	0.8599	0.8659	`classification_report.csv` `metrics.json` `confusion_matrix_counts.png` `calibration.png`
ViT-B/16	CLS token 768-d embedding	`nn.Linear(768, 120)`	0.9348	0.9311	0.9350	`classification_report.csv` `metrics.json` `confusion_matrix_counts.png` `calibration.png`

Excerpt from the downloaded staged `classification_report.csv` files
Breed	Support	ResNet precision	ResNet recall	ResNet F1	ViT precision	ViT recall	ViT F1
Chihuahua	52	0.8000	0.8462	0.8224	0.8475	0.9615	0.9009
Japanese spaniel	85	0.9176	0.9176	0.9176	0.9512	0.9176	0.9341
Maltese dog	152	0.9054	0.8816	0.8933	0.9724	0.9276	0.9495
Pekinese	49	0.8696	0.8163	0.8421	0.9038	0.9592	0.9307
Shih-Tzu	114	0.7280	0.7982	0.7615	0.8595	0.9123	0.8851

Match the head input width to the backbone output dimension instead of guessing it manually.
Set the output width to the dataset class count, which is 120 for Stanford Dogs.
Use raw logits for loss computation and reserve softmax probabilities for inference and reporting.
The excerpt above comes directly from the staged per-class reports, so node 4 now shows concrete breed-level evidence from the downloaded artifacts.
Keep the head simple first; the benchmark files above were all generated with these minimal linear heads.

ResNet-50 staged confusion matrix with raw counts on Stanford Dogs

The staged ResNet-50 head exports this count-based confusion matrix directly from the same logits used to create classification_report.csv and metrics.json.

ViT-B/16 staged confusion matrix with raw counts on Stanford Dogs

The staged ViT-B/16 confusion matrix is the transformer counterpart to the CNN export and shows the cleaner top-1 prediction structure that follows from the stronger head outputs.

ResNet-50 staged calibration plot on Stanford Dogs

This calibration export reuses the probability vectors produced after softmax(dim=1), so it is a direct visualization of how reliable the ResNet-50 head confidence is on the test set.

ViT-B/16 staged calibration plot on Stanford Dogs

The ViT-B/16 calibration figure is also generated from the same head probabilities and complements the lower ECE reported in the staged metrics.json.

Node 4 ends at head outputs and prediction mapping. Node 5 is where those exported files are interpreted as final benchmark results, diagnostics, and recommendations.

5. Results, insights, and recommendations

This stage interprets the final Stanford Dogs outputs at report level: the fair four-run benchmark, W&B-backed provenance, downloaded notebook diagnostics, and practical recommendations for choosing between the ResNet-50 and ViT-B/16 pipelines.

Quick summary

Fair benchmark results at a glance

The current report now reflects the matched four-run benchmark exported by the rerun notebook: both ResNet-50 and ViT-B/16 are evaluated under full fine-tuning for 12 epochs and head 3 + full fine-tune 8 epochs. The summary below is grounded in model_comparison.csv, best_per_family_comparison.csv, comparison_overview.png, and the matching W&B export files.

Best accuracy / macro F1

ViT-B/16 staged

The strongest overall run is ViT-B/16 with head 3 + full fine-tune 8 epochs at 93.48% test accuracy, 0.9311 macro F1, and 0.9350 weighted F1.

Best overall 755.21 s

Best calibration

ViT-B/16 full

The lowest exported ECE is 0.0178, achieved by ViT-B/16 full fine-tuning for 12 epochs. It is not the most accurate run, but it is the most confidence-aligned one.

Calibration winner ECE 0.0178

Best CNN / fastest rerun

ResNet-50 staged

The fastest exported benchmark is ResNet-50 with head 3 + full fine-tune 8 epochs at 205.03 seconds, while still delivering 86.55% accuracy and 0.8599 macro F1.

Fastest run 23.75M params

Fair Stanford Dogs comparison overview for ResNet-50 and ViT-B/16

The exported comparison figure summarizes the matched benchmark directly from the notebook. It visually confirms the same ranking shown in model_comparison.csv: staged ViT-B/16 first, then full ViT-B/16, then staged ResNet-50, and finally full ResNet-50.

Model	Strategy	Accuracy	Macro F1	Weighted F1	ECE	Train time (s)	Params (M)
ResNet-50	Full fine-tuning for 12 epochs	0.8557	0.8485	0.8562	0.0468	227.08	23.75
ResNet-50	Head 3 + full fine-tune 8 epochs	0.8655	0.8599	0.8659	0.0408	205.03	23.75
ViT-B/16	Full fine-tuning for 12 epochs	0.9077	0.9026	0.9080	0.0178	943.43	85.89
ViT-B/16	Head 3 + full fine-tune 8 epochs	0.9348	0.9311	0.9350	0.0198	755.21	85.89

Preview of `best_per_family_comparison.csv`
Model	Family	Best strategy	Accuracy	Macro F1	ECE	Train time (s)
ResNet-50	CNN	Head 3 + full fine-tune 8 epochs	0.8655	0.8599	0.0408	205.03
ViT-B/16	Transformer	Head 3 + full fine-tune 8 epochs	0.9348	0.9311	0.0198	755.21

Result references: model_comparison.csv, best_per_family_comparison.csv, comparison_overview.png, and runs_summary.csv.

Fair comparison

The matched benchmark still clearly favors ViT-B/16

The best ViT run reaches 0.9348 accuracy, which is 6.93 percentage points higher than the best ResNet-50 run at 0.8655.
Both ViT runs outperform both CNN runs, so the ranking is stable across the two transfer-learning schedules.
The report can now attribute the gap to backbone behavior more confidently because the schedules are matched.
Simple linear heads are sufficient to expose a meaningful backbone gap on this 120-class fine-grained task.

Schedule effect

The staged recipe helps both model families

ResNet-50 improves from 0.8557 to 0.8655 accuracy when the head-only warmup is added before full fine-tuning.
ViT-B/16 improves even more strongly, from 0.9077 to 0.9348 accuracy under the same staged recipe.
The staged recipe is also faster than the corresponding full 12-epoch run for both backbones.
W&B histories show that ViT already starts extremely strong in the head-only phase, while ResNet-50 gains are more gradual.

Accuracy vs calibration

The most accurate run is not the most calibrated run

ViT-B/16 full fine-tuning has the lowest ECE at 0.0178, even though staged ViT-B/16 is more accurate overall.
ResNet-50 staged improves both accuracy and ECE relative to full fine-tuning, so it dominates inside the CNN family.
If confidence reliability matters most, full ViT is a meaningful alternative to the staged ViT winner.
If throughput matters more, staged ResNet-50 is the easiest fair-benchmark baseline to rerun.

Benchmark provenance

Notebook exports and W&B runs agree on the four-run benchmark record

The notebook writes comparison CSVs and summary figures, while W&B records the matching run IDs, strategies, and training seconds. This gives the report both clean presentation artifacts and exact experiment provenance.

Preview of `runs_summary.csv`
Run ID	Model	Family	Strategy	Batch size	Image size	Total params
`4vbmbqdj`	ResNet-50	`cnn`	Full fine-tuning for 12 epochs	32	224	23,753,912
`82qpkftg`	ResNet-50	`cnn`	Head 3 + full fine-tune 8 epochs	32	224	23,753,912
`qt797v8c`	ViT-B/16	`vit`	Full fine-tuning for 12 epochs	32	224	85,890,936
`kqlnnxbe`	ViT-B/16	`vit`	Head 3 + full fine-tune 8 epochs	32	224	85,890,936

Model	Full acc	Staged acc	Delta acc	Full macro F1	Staged macro F1	Full time (s)	Staged time (s)
ResNet-50	0.8557	0.8655	+0.0098	0.8485	0.8599	227.08	205.03
ViT-B/16	0.9077	0.9348	+0.0272	0.9026	0.9311	943.43	755.21

Model	Strategy	Run ID	Training seconds	Metadata file	W&B run
ResNet-50	Full fine-tuning for 12 epochs	`4vbmbqdj`	227.08	`resnet50_full_run_metadata.json`	`4vbmbqdj`
ResNet-50	Head 3 + full fine-tune 8 epochs	`82qpkftg`	205.03	`resnet50_staged_run_metadata.json`	`82qpkftg`
ViT-B/16	Full fine-tuning for 12 epochs	`qt797v8c`	943.43	`vit_b16_full_run_metadata.json`	`qt797v8c`
ViT-B/16	Head 3 + full fine-tune 8 epochs	`kqlnnxbe`	755.21	`vit_b16_staged_run_metadata.json`	`kqlnnxbe`

python

comparison_df = pd.DataFrame([experiment_to_row(result) for result in benchmark_results])
comparison_df['Params(M)'] = comparison_df['Params'] / 1_000_000
comparison_df.to_csv(ARTIFACT_ROOT / 'model_comparison.csv', index=False)

best_per_family_df = pd.DataFrame([experiment_to_row(best_cnn_result), experiment_to_row(best_vit_result)])
best_per_family_df['Params(M)'] = best_per_family_df['Params'] / 1_000_000
best_per_family_df.to_csv(ARTIFACT_ROOT / 'best_per_family_comparison.csv', index=False)

plt.savefig(ARTIFACT_ROOT / 'comparison_overview.png', bbox_inches='tight')

Provenance files visualized in this panel: runs_summary.csv, resnet50_full_run_metadata.json, resnet50_staged_run_metadata.json, vit_b16_full_run_metadata.json, and vit_b16_staged_run_metadata.json.

Downloaded diagnostics

Best-run artifacts show where the remaining errors come from

The most useful downloaded diagnostics come from the staged best-per-family runs: benchmark/cnn/restnet50_head_then_full and benchmark/vit/vit_b16_head_then_full. These folders contain the normalized confusion matrices, misclassified galleries, and interpretability views used below.

ResNet-50 staged normalized confusion matrix on Stanford Dogs

The normalized ResNet-50 confusion matrix highlights which breeds remain systematically difficult even after the improved staged schedule.

ViT-B/16 staged normalized confusion matrix on Stanford Dogs

The staged ViT-B/16 normalized matrix is cleaner and more diagonal, matching the stronger benchmark accuracy and macro F1.

ResNet-50 staged misclassified Stanford Dogs examples

The ResNet-50 staged misclassified gallery is useful for discussing the hard visual cases that still survive after the better CNN training recipe.

ViT-B/16 staged misclassified Stanford Dogs examples

The staged ViT-B/16 gallery shows the smaller remaining set of hard examples after the strongest benchmark run has already filtered out most easier mistakes.

ResNet-50 staged Grad-CAM gallery on Stanford Dogs

Grad-CAM for the staged ResNet-50 run shows which image regions the CNN relies on most strongly when making its breed predictions.

ViT-B/16 staged attention visualization gallery on Stanford Dogs

The attention gallery provides the transformer counterpart to Grad-CAM and shows how the strongest benchmark run distributes attention across patches and object parts.

The three panels above are rendered directly from the staged CNN and ViT diagnostic PNG exports in the benchmark folders, so the web report is showing the same downloaded artifacts produced by the notebook rerun.

Recommendation

For best accuracy and macro F1

Choose ViT-B/16 with head 3 + full fine-tune 8 epochs. It is the strongest fair-benchmark configuration on both accuracy and macro F1.

Recommendation

For a cheaper and faster rerun

Choose ResNet-50 with the staged schedule when you want the most practical benchmark rerun. It is much faster than either ViT run while still delivering a solid CNN baseline.

Recommendation

For calibration-sensitive deployment

Choose ViT-B/16 full fine-tuning if your deployment cares more about probability quality than absolute top-1 accuracy. It has the lowest exported ECE among the four runs.

CNN vs ViT on Stanford Dogs.

Transfer-learning report navigator

Start from the workflow, then open the matching report section

1. Problem and dataset exploration (EDA)

1. Problem and dataset exploration (EDA)

ResNet-50 vs ViT-B/16 on Stanford Dogs

Why Stanford Dogs fits this assignment

What belongs to node 1 before preprocessing starts

What train_list.mat and test_list.mat actually are

How the Stanford Dogs dataset is organized on disk

What the two exported metadata files actually represent

metadata_with_quality.csv

split_metadata.csv

Quality variation, split structure, and random breed samples

Darkest and brightest samples in the dataset

2. Data processing and preprocessing pipeline

Resizing, standardization, and the path to (N, C, H, W)

Model-specific augmentation for the training path

A milder ViT training policy with the same deterministic eval path

Normalize tensors for stable training and pretrained-model compatibility

Why this step matters

ImageNet statistics used in the notebook

Train and evaluation transforms used in the notebook

Validation and test preprocessing

Moderate augmentation for CNN robustness

Slightly lighter augmentation for fine-grained breed cues

Dataset wrapper, DataLoader settings, and implementation tips

3. Backbone architecture and transfer learning

From 224x224 tensors to transferable feature representations

ResNet-50: residual CNN feature extraction with a deeper pooled representation

ViT-B/16: patch tokens, CLS aggregation, and stronger transfer behavior

The refactored notebook now evaluates both backbones under the same two schedules

Every backbone file referenced above is now previewed directly on the page

4. Classifier head and prediction mapping

Minimal linear heads keep the comparison centered on the backbone

ResNet-50 keeps the standard pooling path and swaps only the final linear layer

ViT-B/16 replaces the classifier on top of the CLS representation

Logits become probabilities, predicted labels, and report-ready artifacts

The adapted heads feed the downloaded report files used later on the web page

5. Results, insights, and recommendations

Fair benchmark results at a glance

ViT-B/16 staged

ViT-B/16 full

ResNet-50 staged

The matched benchmark still clearly favors ViT-B/16

The staged recipe helps both model families

The most accurate run is not the most calibrated run

Notebook exports and W&B runs agree on the four-run benchmark record

Best-run artifacts show where the remaining errors come from

For best accuracy and macro F1

For a cheaper and faster rerun

For calibration-sensitive deployment

What `train_list.mat` and `test_list.mat` actually are

`metadata_with_quality.csv`

`split_metadata.csv`

Resizing, standardization, and the path to `(N, C, H, W)`