Multimodal Conditional Retrieval with High Controllability

Abstract

Searching for images using text has limitations because language has difficulties in expressing certain abstract intentions, e.g. artistic styles are difficult to describe for non-experts. As for the image search image model, images can convey abstract intentions, but cannot express the specific purpose, so many of the current graph search work only has a single function, such as content search, style search. Our work aims to combine the strengths of both, merging the ability of text to express specific ideas with the ability of images to convey abstract concepts, thus achieve a better capture of the user's intentions. To this end, we propose CCSR, a multimodal conditional content-style joint retrieval model. Our model is the first to apply contrastive learning to conditional retrieval and introduces a novel Mixture-of-Expert models (MOE) system to enable collaboration between multiple expert systems. We adopt a novel prompt learning strategy that allows the model to adaptively select specific prompts, thereby enhancing its focus on the current task. In addition, to evaluate the joint content-style retrieval capability of our model, we present a new dataset, StyleCoco, containing rich content categories and style categories. The experimental results indicate that CCSR has achieved state-of-the-art performance in conditional style retrieval, content retrieval, and style-content retrieval. The dataset and code will be publicly available on this site.

Note: The dataset and core code will be released after published.