Few-Shot, Vision-Only ICL

PictSure: An In-Context Learning Image Classifier

A compact transformer that classifies images in-context using only visual embeddings — no gradient updates, no language supervision. Strong out-of-domain performance depends critically on how the embedding model is pretrained.

Details

Architecture at a glance

Sequence of support tokens (image + label) and a query token enter a 4-block Transformer encoder with asymmetric attention: support attends to support; query attends to all support; support does not attend to query. The query representation feeds a classification head to predict the label.

Details in the paper.

Few-Shot Classification Example


from PictSure import PictSure
from PIL import Image
model = PictSure.from_pretrained("pictsure/pictsure-vit")
context_images = [
    Image.open("cat1.jpg"),
    Image.open("cat2.jpg"),
    Image.open("dog1.jpg"),
    Image.open("dog2.jpg")
]
context_labels = [0, 0, 1, 1]  
model.set_context_images(context_images, context_labels)
test_image = Image.open("unknown_animal.jpg")
prediction = model.predict(test_image)
print(f"Predicted class: {prediction}")

Open Source Models & Code

All PictSure models and inference code are fully open source and available for use and modification. The training script will be released soon.

Each model differs only in the embedding backbone and pretraining method; all share the same ICL transformer and label-in-token design.

Hugging Face Models GitHub Repository

PictSure: An In-Context Learning Image Classifier

PictSure News

Models now on Hugging Face

Paper Preprint published

Details

Architecture at a glance

Few-Shot Classification Example

Open Source Models & Code