A curious boy encountered a unique challenge when collecting several unlabeled images from a smartphone located in the Amazon jungle. Tasked with identifying the diverse bird species within these images, the boy faced a daunting task, especially without any prior knowledge of species names typically provided by ornithologists.
To address this complex challenge, we introduce the FineR system. This novel solution empowers the boy to not only identify but also effectively classify the various bird species captured in the ongoing smartphone camera. FineR is designed to democratize FGVR, freeing the dependence on specialized expert knowledge.
Identifying subordinate-level categories from images is a longstanding task in computer vision and is referred to as fine-grained visual recognition (FGVR). It has tremendous significance in real-world applications since an average layperson does not excel at differentiating species of birds or mushrooms due to subtle differences among the species. A major bottleneck in developing FGVR systems is caused by the need of high-quality paired expert annotations.
To circumvent the need of expert knowledge we propose Fine-grained Semantic Category Reasoning (FineR) that internally leverages the world knowledge of large language models (LLMs) as a proxy in order to reason about fine-grained category names. In detail, to bridge the modality gap between images and LLM, we extract part-level visual attributes from images as text and feed that information to a LLM. Based on the visual attributes and its internal world knowledge the LLM reasons about the subordinate-level category names.
Our training-free FineR outperforms several state-of-the-art FGVR and language and vision assistant models and shows promise in working in the wild and in new domains where gathering expert annotation is arduous.
Our goal is to discover and recognize fine-grained categories using unlabelled images without having access to expert annotations and the ground-truth categories that exist in the incoming images. Thus, the challenges in such a FGVR task are to first identify (or discover) the classes that exist in the unlabelled images from few observation and then assign a class name to each incoming instance.
The idea underpinning our proposed FineR system is that LLMs, which already encode the world knowledge about fine-grained categories such as species of animals and plants, can be leveraged to reason about candidate class names. Subsequently, the discovered candidate class names and images are used to yield a multi-modal classifier to classify a test instance using a VLM (such as CLIP).
FineR operates in three phases: (i) Translating Useful Visual Information from Visual to Textual Modality; (ii) Fine-grained Semantic Category Reasoning in Language; and (iii) Multi-modal Classifier Construction. An overview of FineR system is shown below.
We employ two synergistic metrics: Clustering Accuracy (cACC) and Semantic Similarity (sACC). cACC evaluates the model's performance in clustering images from the same category together, but does not consider the semantics of the cluster labels. This gap is filled by sACC, which leverages Sentence-BERT to compare the semantic similarity of the cluster's assigned name with the ground-truth category.
We benchmarked our FineR system with constructed baseline and SOTA methods for the task of FGVR without expert knowledge on the five-grained datasets, including Caltech-UCSD Bird-200, Stanford Car-196 and Dog-120, Flower-102, and Oxford-IIIT Pet-37.
Echoing with our initial motivation of democratizing FGVR, we conducted a human study to establish layperson-level baselines on the Car-196 and Pet-37 datasets, while the ground-truth class names are considered as expert-level baseline (or upper bound).
We visualize and analyze the predictions of our FineR system and the compared methods.
To further investigate the FGVR capability of FineR on more novel concepts, we introduce a new Pokemon dataset comprised of 10 Pokemon characters, sourced from Pokedex and Google Image Search. In stark contrast to the compared methods, FineR successfully discovers 7/10 ground-truth Pokemon categories, nearly reaching the upper bound performance.
Reverse comparison of prediction results for the"Blackberry Lily" image (upper-left corner) in Flower-102. We evaluate the visual counterparts associated with the predicted semantic concepts. To conduct this comparison, we employ two distinct methods for inversely identifying their visual counterparts: (i) Google Image Search: we query and fetch images that are paired with the predicted class names from Google; (ii) Stable Diffusion: we utilize the predicted semantic class names as text prompts to generate semantically-conditioned images using Stable Diffusion. Partially correct and wrong predictions are color coded. None of the methods correctly predict the ground-truth label. However, the visual counterparts reversely predicted by FineR is extremely similar to the ground-truth ones, because it captures the useful semantic "Orange-spotted".
We conducted an ablation analysis of the main components of the proposed FineR system.
We explore the impact of the hyperparameter α on multi-modal fusion during classifier construction.
Additionally, we examine the effects of sample augmentation times K in mitigating visual bias due to limited sample sizes.
We also analyze the system’s performance under varying amounts of unlabeled images per category for class name discovery.
Finally, we assess the influence of CLIP VLM model size on our system performance, which shows FineR can be scaled by larger models.
@inproceedings{liu2024democratizing,
title={Democratizing Fine-grained Visual Recognition with Large Language Models},
author={Mingxuan Liu and Subhankar Roy and Wenjing Li and Zhun Zhong and Nicu Sebe and Elisa Ricci},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=c7DND1iIgb}
}