Let's Imagine a Case:

A curious boy encountered a unique challenge when collecting several unlabeled images from a smartphone located in the Amazon jungle. Tasked with identifying the diverse bird species within these images, the boy faced a daunting task, especially without any prior knowledge of species names typically provided by ornithologists.

To address this complex challenge, we introduce the FineR system. This novel solution empowers the boy to not only identify but also effectively classify the various bird species captured in the ongoing smartphone camera. FineR is designed to democratize FGVR, freeing the dependence on specialized expert knowledge.

Democratizing Fine-grained Visual Recognition with Large Language Models

ICLR 2024

Mingxuan Liu¹, Subhankar Roy⁴, Wenjing Li^3,6, Zhun Zhong^3,5, Nicu Sebe¹, Elisa Ricci^1,2,

¹University of Trento, ²Fondazione Bruno Kessler, ³Hefei University of Technology,

⁴University of Aberdeen ⁵University of Nottingham, ⁶University of Leeds

Paper arXiv Code Data

TL;DR: We propose Fine-grained Semantic Category Reasoning (FineR) system to address fine-grained visual recognition without needing expert annotations and knowing category names as a-priori. FineR leverages large language models to identify fine-grained image categories by interpreting visual attributes as text. This allows it to reason about subtle differences between species or objects, outperforming current FGVR methods.

Abstract

Identifying subordinate-level categories from images is a longstanding task in computer vision and is referred to as fine-grained visual recognition (FGVR). It has tremendous significance in real-world applications since an average layperson does not excel at differentiating species of birds or mushrooms due to subtle differences among the species. A major bottleneck in developing FGVR systems is caused by the need of high-quality paired expert annotations.

To circumvent the need of expert knowledge we propose Fine-grained Semantic Category Reasoning (FineR) that internally leverages the world knowledge of large language models (LLMs) as a proxy in order to reason about fine-grained category names. In detail, to bridge the modality gap between images and LLM, we extract part-level visual attributes from images as text and feed that information to a LLM. Based on the visual attributes and its internal world knowledge the LLM reasons about the subordinate-level category names.

Our training-free FineR outperforms several state-of-the-art FGVR and language and vision assistant models and shows promise in working in the wild and in new domains where gathering expert annotation is arduous.

FineR Method

Problem Formulation

Our goal is to discover and recognize fine-grained categories using unlabelled images without having access to expert annotations and the ground-truth categories that exist in the incoming images. Thus, the challenges in such a FGVR task are to first identify (or discover) the classes that exist in the unlabelled images from few observation and then assign a class name to each incoming instance.

FineR System Overview

The idea underpinning our proposed FineR system is that LLMs, which already encode the world knowledge about fine-grained categories such as species of animals and plants, can be leveraged to reason about candidate class names. Subsequently, the discovered candidate class names and images are used to yield a multi-modal classifier to classify a test instance using a VLM (such as CLIP).

FineR operates in three phases: (i) Translating Useful Visual Information from Visual to Textual Modality; (ii) Fine-grained Semantic Category Reasoning in Language; and (iii) Multi-modal Classifier Construction. An overview of FineR system is shown below.

Experimental Results

Evaluation Metrics

We employ two synergistic metrics: Clustering Accuracy (cACC) and Semantic Similarity (sACC). cACC evaluates the model's performance in clustering images from the same category together, but does not consider the semantics of the cluster labels. This gap is filled by sACC, which leverages Sentence-BERT to compare the semantic similarity of the cluster's assigned name with the ground-truth category.

Benchmarking on Fine-grained Datasets

Quantitative Comparison I: The Battle of Machine-driven Approaches

We benchmarked our FineR system with constructed baseline and SOTA methods for the task of FGVR without expert knowledge on the five-grained datasets, including Caltech-UCSD Bird-200, Stanford Car-196 and Dog-120, Flower-102, and Oxford-IIIT Pet-37.

Quantitative Comparison II: From Layperson to Expert - Where Do We Stand?

Echoing with our initial motivation of democratizing FGVR, we conducted a human study to establish layperson-level baselines on the Car-196 and Pet-37 datasets, while the ground-truth class names are considered as expert-level baseline (or upper bound).

Qualitative Comparison

We visualize and analyze the predictions of our FineR system and the compared methods.

Benchmarking on the Novel Pokemon Dataset

To further investigate the FGVR capability of FineR on more novel concepts, we introduce a new Pokemon dataset comprised of 10 Pokemon characters, sourced from Pokedex and Google Image Search. In stark contrast to the compared methods, FineR successfully discovers 7/10 ground-truth Pokemon categories, nearly reaching the upper bound performance.

The Story of Blackberry Lily - A Reverse Comparison

Reverse comparison of prediction results for the"Blackberry Lily" image (upper-left corner) in Flower-102. We evaluate the visual counterparts associated with the predicted semantic concepts. To conduct this comparison, we employ two distinct methods for inversely identifying their visual counterparts: (i) Google Image Search: we query and fetch images that are paired with the predicted class names from Google; (ii) Stable Diffusion: we utilize the predicted semantic class names as text prompts to generate semantically-conditioned images using Stable Diffusion. Partially correct and wrong predictions are color coded. None of the methods correctly predict the ground-truth label. However, the visual counterparts reversely predicted by FineR is extremely similar to the ground-truth ones, because it captures the useful semantic "Orange-spotted".

Ablation Study

We conducted an ablation analysis of the main components of the proposed FineR system.

Sensitivity Analysis

Analysis of the Hyperparameter α

We explore the impact of the hyperparameter α on multi-modal fusion during classifier construction.

Analysis of the Hyperparameter K

Additionally, we examine the effects of sample augmentation times K in mitigating visual bias due to limited sample sizes.

Analysis of the Number of Unlabeled Images Used for Class Name Discovery

We also analyze the system’s performance under varying amounts of unlabeled images per category for class name discovery.

Analysis of VLM Model Size

Finally, we assess the influence of CLIP VLM model size on our system performance, which shows FineR can be scaled by larger models.

BibTeX

@inproceedings{liu2024democratizing,
    title={Democratizing Fine-grained Visual Recognition with Large Language Models},
    author={Mingxuan Liu and Subhankar Roy and Wenjing Li and Zhun Zhong and Nicu Sebe and Elisa Ricci},
    booktitle={The Twelfth International Conference on Learning Representations},
    year={2024},
    url={https://openreview.net/forum?id=c7DND1iIgb}
}