Paper Preprint published

We’re excited to share our first preprint: PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers. Read it on arXiv.

Overview

The core idea is simple: in vision-only in-context learning (ICL), the quality of the embedding model—its architecture and pretraining—is critical. We show that frozen, pretrained ResNets and ViTs with triplet-loss objectives enable PictSure to learn robustly without backward passes, relying purely on contextual examples at inference time.

On benchmarks spanning general, agricultural, and medical imagery, PictSure variants match or outperform much larger CLIP-based ICL models. In particular, they excel in out-of-domain tasks (e.g. Brain Tumor, OrganCMNIST), where language-aligned embeddings fall short.

PictSure at a glance

Transformer with asymmetric attention masks for in-context classification
Frozen ResNet18 / ViT backbones with supervised and triplet-loss pretraining recipes
Compact design (53M–128M params) that achieves state-of-the-art OOD generalization without gradient updates

Why it matters

Few-shot image classification usually relies on fine-tuning or language semantics. Both are brittle: fine-tuning requires costly adaptation, and text alignment often fails in domains with weak or ambiguous labels.
PictSure instead:

Uses no backward pass at inference — instant deployment
Stays in the visual space only — no reliance on CLIP-style semantics
Achieves higher stability and accuracy under distribution shift

Next steps

Broaden beyond 10-way classification
Larger-scale OOD benchmarks across specialized domains
Hugging Face demo + open library release: github.com/PictSure/pictsure-library

Citation

@article{schiesser2025pictsure,
  title={PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers},
  author={Schiesser, Lukas and Wolff, Cornelius and Haas, Sophie and Pukrop, Simon},
  journal={arXiv preprint arXiv:2506.14842},
  year={2025}
}

announcement icl vision