Description

The Text and Vision (TVGraz) dataset is an annotated multi-modal dataset which currently contains 10 visual object categories, 4030 images and associated text. The visual appearance of the objects in the dataset is challenging and offers a less biased benchmark. The objective of the multi-modal dataset is to provide a common means for evaluation of object categorization research based on text and vision. The archive "TVGraz_script.tar.gz" contain a python script name "download_TVGRAZ_dataset.py", which will download TVGraz dataset images and text from their respective urls, upon execution and according to the "category_list.txt" file. After downloading the textual data will be in raw format per category per image. Download: TVGraz dataset capturing tool TVGraz: Multi-Modal Learning of Object Categories by Combining Textual and Visual Features (bib) Inayatullah Khan, Amir Saffari, and Horst Bischof In Proc. Workshop of the Austrian Association for Pattern Recognition, 2009