# PictSure

> PictSure is an open-source research project providing few-shot, vision-only in-context learning (ICL) image classifiers that require no fine-tuning or gradient updates at inference time.

PictSure is a compact transformer that classifies images using only visual embeddings and a handful of labeled context examples — no gradient updates, no language supervision. The core research finding: the quality of the embedding model (architecture and pretraining objective) is the decisive factor for out-of-domain generalization.

Two pretrained models are available on Hugging Face:
- PictSure-ResNet18: 53M parameters, frozen ResNet18 backbone with supervised pretraining
- PictSure-ViT-Triplet: 128M parameters, ViT backbone pretrained with triplet loss for a more structured embedding space

Both models match or outperform much larger CLIP-based ICL models on general, agricultural, and medical imaging benchmarks, especially on out-of-domain tasks (e.g. Brain Tumor, OrganCMNIST) where language-aligned embeddings fall short.

## Key Resources

- [Homepage](https://pictsure.eu/): Project overview, architecture explanation, and quick-start code
- [Research Paper — arXiv:2506.14842](https://arxiv.org/abs/2506.14842): "PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers"
- [GitHub Library](https://github.com/PictSure/pictsure-library): Open-source Python library for PictSure inference
- [PictSure-ResNet18 on Hugging Face](https://huggingface.co/pictsure/pictsure-resnet): 53M parameter compact model
- [PictSure-ViT-Triplet on Hugging Face](https://huggingface.co/pictsure/pictsure-vit): 128M parameter ViT variant
- [Hugging Face Organization](https://huggingface.co/pictsure): All released PictSure models

## Architecture

PictSure uses a 4-block Transformer encoder with asymmetric attention masks:
- Support tokens (image + label) attend to other support tokens only
- Query token attends to all support tokens
- Support tokens do NOT attend to the query token
- The query representation feeds a linear classification head

Embedding backbones (ResNet18 or ViT) are frozen at inference. The model performs classification purely via forward pass — no backward passes, no parameter updates.

## Quick Start

```python
from PictSure import PictSure
from PIL import Image

model = PictSure.from_pretrained("pictsure/pictsure-vit")

context_images = [Image.open("cat1.jpg"), Image.open("cat2.jpg"),
                  Image.open("dog1.jpg"), Image.open("dog2.jpg")]
context_labels = [0, 0, 1, 1]

model.set_context_images(context_images, context_labels)
prediction = model.predict(Image.open("unknown.jpg"))
```

## Citation

```bibtex
@article{schiesser2025pictsure,
  title={PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers},
  author={Schiesser, Lukas and Wolff, Cornelius and Haas, Sophie and Pukrop, Simon},
  journal={arXiv preprint arXiv:2506.14842},
  year={2025}
}
```

## Articles


- [Models now on Hugging Face](https://pictsure.eu/articles/modelrelased/): Our pretrained PictSure models are now live on Hugging Face.

- [Paper Preprint published](https://pictsure.eu/articles/firstpaper/): A quick overview of the vision-only ICL classifier and what we’re releasing.


## Optional

- [Full content](https://pictsure.eu/llms-full.txt): Complete plain-text content of all articles for LLM consumption