Ingemar J. Cox
Thomas V. Papathomas - Joumana Ghosn
Peter N. Yianilos1 - Matt L. Miller
Systems that retrieve images based on their content must in some way codify these images so that judgments and inferences may be made in a systematic fashion. The ultimate encoding would somehow capture an image's semantic content in a way that corresponds well to human interpretation. By contrast, the simplest encoding consists of the image's raw pixel values. Intermediate between these two extremes is a spectrum of possibilities, with most work in the area focusing on low level features, i.e. straightforward functions of the raw pixel values (see [13,15,3,4,6,9,8,10] and many others [11,16,17,18]). Some such features, such as color, begin to capture an image's semantics, but at best they represent a dim reflection of the image's true meaning.
The ultimate success of content based image retrieval systems will likely depend on the discovery of effective and practical approaches at a much higher level. In this paper we report conceptual and experimental progress towards this objective.
Any attempt to codify image semantics inevitably leads to design of a language with which to express them. If a human operator is required to formulate a query using this language, and interpret a database image's description in terms of the language, two serious problems arise. First, the language must not only be effective in theory, but must also serve as a natural tool with which a human can express a query. Second, inaccurate or inconsistent expression of each database image in terms of the language can lead to confusion on the part of the user, and ultimately undermine the effectiveness of, and confidence in, the system. The need for accurate and consistent expression can also limit the language's design.
For these reasons we are led to study hidden languages for semantic encoding, and in particular hidden boolean attributes affixed to each database image.
Our ability to follow this research direction is made possible by the general navigational paradigm introduced in  and used by the PicHunter image retrieval system. (see  for other learning-based work). With this approach a user navigates through a database by selecting similar images from the set currently displayed. No explicit query is formulated. Instead, the system chooses the next display set based on the user's earlier selections. All earlier selections influence the system's next choice - not just the most recent user response. This takes place within a simple Bayesian relevance feedback framework in which the system learns to evaluate the probability that an image is the user's target given his actions, by instead learning to predict these actions conditioned on a presumptive target - starting from a uniform prior.
Thus the focus is shifted entirely to the task of learning a predictive model to explain the users selections. The significance of this shift is that this model can rely on information beyond that which the user sees. In particular, the system's model can rely on hidden attributes affixed to each image.
As a result, we are free to consider attribute schemes that might not work well in a traditional non-hidden approach. We might, for example, use a scheme that employs 10,000 attributes, far more than a human operator could reasonably be expected to deal with. Moreover some of these attributes might correspond to complex semantic concepts that are not easily explained, or to overlapping concepts that do not fit well into the kind of hierarchies that humans frequently prefer. They might even include entirely artificial attributes that arise from a machine learning algorithm. Because the attributes are hidden, it may be that the system performs well despite considerable error in the assignment of attributes. For this reason we are free to consider attributes even if their proper identification seems very difficult.
The overall implementation of a hidden attribute approach may be divided into two components: the design of the schema of attributes, and the approach taken to assigning attribute values to each database image. Both of these contain rich opportunities for future work. PicHunter's  use of low-level image statistics may be viewed as a hidden attribute approach. This paper represents a first step intended to help establish the general approach's potential at a higher semantic level by focusing on a particularly simple case.
A set of approximately 125 semantic attributes was chosen and values were assigned manually to each image in our experimental database. In some sense this might be viewed as a best-case scenario since the schema is hand-designed, and the values are assigned by humans. However some existing commercial collections of images include such schemes and annotations, so beyond providing justification for future work, our positive experimental results may be of immediate practical significance.
We remark that there are errors and inconsistencies even in attributes assigned by humans. Here, the fact that the attribute values are hidden can result in more robust performance in the presence of error. We also observe that in some settings, such as the emerging area of Internet Web publication, authors are implicitly annotating their images by their choice of text to surround them. Exploiting this textual proximity represents an immediate and interesting direction for future work. This general direction is explored in [14,1].
It is not clear how high in the semantic sense our approach of hidden attributes might reach. It is certainly conceivable that a large portion of an image's semantic content might be captured by a sufficiently large and rich collection of attributes - obviating the need to produce a single succinct and coherent expression of an image's meaning.
Section 2 of this paper describes our set of attributes, the manner in which their values were assigned, and other aspects of the experimental setup. Section 3 summarizes the results. In section 4 final remarks are made regarding these experiments and broader issues as well.
Our experiments compare the performance of PicHunter based on low-level non-verbal features only (such as color content, contrast, brightness, edge content, etc.), with a new version that incorporates a vector of verbal semantic attributes (such as ``sky'', ``hill'', ``person'', ``city'', ``bird'', etc.).
A system of approximately 125 keywords was identified based on knowledge of our experimental database of 1,500 images. Each image was then visually examined and all relevant keywords identified. An additional set of category keywords were then assigned automatically. For example, the ``lion'' attribute causes the category attribute ``animal'' to be present. Altogether there are 134 attributes. These supplement the 18 low-level features used by the basic PicHunter version, and described in . The 134 semantic attributes are regarded as a boolean vector, and normalized Hamming distance combines their influence to form, in effect, a 19th PicHunter feature.
The PicHunter user interface is particularly Spartan. Nine candidate images are displayed along with three buttons used to abort the search, signal that the search is successful, or request that the system display another nine candidates. Prior to requesting additional candidates the user selects the subset of the nine visible images that he/she regards as most similar to the target image.
Our experiments implement the target testing model of  in which the user seeks to locate a given target under the user interface described above. Performance is measured by the number of display iterations required to locate the target image. That is, how many nine-image displays the system had to present before the sought after image appeared.
The primary purpose of our experiments is to compare the performance of the original version of PicHunter and an annotated version. The secondary goal is to examine whether user performance improves after the user receives an explanation of the particular features in use. For notational purposes, we refer to the original version as ``Orig.'' The version using semantic attributes is denoted ``Sem.'' The experimental step consisting of explaining a feature set to the user is denoted ``Expl.''
All experiments were conducted on 1280x1024-pixel color monitors, driven by Silicon Graphics Indigo2 workstations. The monitor screen measured 38 cm by 29 cm, and was viewed from a distance of 70 cm. Individual images were either in ``portrait'' (4.83 x 7.25 cm on the screen) or in ``landscape'' (7.25 x 4.83 cm) format. They were padded with dark pixels either horizontally or vertically to form square icons. The images in the database  were copied from a set of CDs by Corel Inc., each CD containing 100 images. Each image is referred to by its unique identification number, which is denoted by ``ID'' in this paper.
Eight users, labeled A to H, participated in this experiment. Users were tested for color blindness using Ishihara test plates and found to have normal color vision. All users also had normal or corrected-to-normal vision with regard to acuity.
There were two major phases in this experiment. Each phase involved the same 17 images that users had to converge to. In the first phase, users were told to select images that they thought were similar to the target, without being told what to base their judgment of similarity upon. There were two groups of four users in this first phase. The first group, G1=A,B,C,D, was subjected first to the original (``standard'') PicHunter, and then to the semantic (``word'') version, while this order was reversed for the other group, G2=E,F,G,H. Before embarking on the second phase, users were divided in two new groups of four, G3 and G4, to balance performances, based on their performances in the first phase. Toward this goal, we first constrained the new groups so that each had exactly two users from each of G1 and G2, to balance previous exposure. Second, among all the partitions that were constrained as above, we selected one that resulted in two new groups which differed as little as possible with respect to their mean group performance and with respect to the standard deviation around that mean. Thus, the new groups were G3=1,2,5,6 and G4=3,4,7,8, where 1,2,3,4 and 5,6,7,8 are permutations of A,B,C,D and E,F,G,H, respectively, that minimized differences of means and standard deviations between G3 and G4.
Subsequently, the second phase consisted of first giving each individual instructions for judging image similarity, based on the algorithm's user model, and then letting them go through the picture search process, as before. Both the original and the semantic version were also used in the second phase. The sequence of versions was selected for each observer so as to obtain an overall balanced experimental design. Table 1 below shows the sequence of experiments for each observer. As can be seen from this table, pairs of users were subjected to the same experimental conditions. Phase 1 consisted of steps 1 and 2, whereas phase 2 included the rest of the steps 3-6. Again, half of the users were first subjected to the original (``standard'') PicHunter, and then to the semantic (``word'') version, each preceded by an explanation, while this order was reversed for the rest of the users.
Before a session with the original version of PicHunter in the second phase, users were asked to base similarity on image appearance (color, brightness, contrast, sharpness, etc.), and ignore the image semantic contents, i.e., ignore the objects, animals, people, flowers, trees, cities, buildings, etc. They were told to look at the image as if they were a machine that cannot extract any meaning from images, that has a good camera and a computer that can estimate color content, brightness, contrast, sharpness, etc., but it cannot express in words what the image contains. They were also made aware of the priority of the features in the user model, from the most to the least important, according to .
Similarly, before a session with the semantic version of PicHunter in the second phase, users were told to base similarity not only on image appearance, but also on image semantic contents, as one would describe them by words. In addition, they were given the list of representative semantic labels shown in Table 2, to suggest the level of semantic ``resolution''.
Our experimental results are given in tables 3 and 4. Rows correspond to target images . Columns correspond to the 8 users. Each matrix entry in position is the number of 9-image displays it took the user corresponding to column to converge to the target corresponding to row . The smaller the entry the better the performance was for the corresponding row-column combination. For each matrix, we provide the row and column sums, aligned with corresponding rows and columns. The inverse of a given row's sum indicates how well observers performed collectively for that row's image; similarly, the inverse of a column sum is a measure of the corresponding user's performance across all the images. We also show the sum of all the matrix elements as a figure of merit for the collective users' performance across all images under the conditions represented by the given matrix.
Although the experiments were designed with PicHunter in mind, their results can be applied to any image retrieval system and, more generally, to any system that involves judgment of image similarity by humans.
Searching the database linearly until the desired image is located requires 9-image displays. It is apparent that the table entries are in almost all cases much smaller than this. Moreover, the reduction in search times with the introduction of hidden semantic attributes (32% and 28%) is immediately apparent - and significant as verified by analysis of variance.
It is clear that humans pay a lot of attention to semantic content when judging image similarity - but the criteria used and the nature of the composite judgment is complex indeed. All eight users were interviewed by one of the authors following completion of the experiments. In addition, eleven other users participated in shorter PicHunter searches and related pilot studies. Without exception all reported that semantic features played a key role in their judgment. For this reason we are not surprised that performance with the annotated version of PicHunter is superior to that of the non-semantic version.
Semantically annotated images are appearing in structured environments such as medical image databases, news organization archives - and the trend seems to extend to generic electronic collections. In addition to using these annotations in a hidden fashion, mature image search systems may be hybrids that include an explicit query mechanism that corresponds to the space of available annotations. Even in query-based systems learning may play a role as illustrated by related work in the field of textual information retrieval .
Finally, the issue of feature relevancy must be addressed. In observing the 8 users' strategies in Experiment 4, we observed that test images were sometimes selected because of similarity with the target in terms of, say, color (``it has as much blue as the target''), and other times because of similarity in, say, overall brightness. To the extent that a user relies on a small number of features during a session, it may be possible to learn which are being used, and in so doing improve performance. Hybrid systems might allow explicit identification of relevant features.
The authors thank Bob Krovetz for useful discussions regarding text database search.
This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.47)
Copyright © 1993, 1994, 1995, 1996,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -image_type gif -nonext_page_in_navigation -noprevious_page_in_navigation -up_url ../main.html -up_title 'Publication homepage' -accent_images textrm -numbered_footnotes hannote.tex
The translation was initiated by Peter N. Yianilos on 2002-06-27