Scatter/Gather is a document browsing technique aimed at supporting such exploratory learning [3, 4] . The emphasis in this browsing technique is to present users with an automatically computed overview of the contents of a document collection, and to provide a method for navigating through this summary at different levels of granularity. The central primitive operation in Scatter/Gather involves document clustering based on pairwise document similarity. The technique aims to place similar documents into the same cluster. Recursively clustering a collection produces a cluster hierarchy.
For each cluster, at each level of this hierarchy, the user is presented with summary information about the cluster that presumably communicates something about the kinds of documents it contains. The user may then select (gather) those clusters that seem interesting or relevant. This subset of clusters can then be reclustered (scattered ) to reveal more fine-grained clusters of documents. With each successive iteration of scattering and gathering clusters, the clusters become smaller and more detailed, eventually bottoming out at the level of individual documents.
In this paper we present experimental evidence to support the view that Scatter/Gather interaction induces the development of a better understanding of the contents of large text collections, refines users' formulations of search queries, and is a feasible information retrieval technique in its own right when compared to standard search techniques.
Although there are some subtle issues involved in the assessment of precision and recall measures (such as deciding on how to evaluate documents as being "relevant"), even more difficulties arise when trying to assess whether a browsing system is communicating knowledge to a user about the structure of a document collection. As will become clear, this was a major challenge to our research. We have had to develop multiple methodologies and analyses in order to converge on an assessment of the mental models that people get from interacting with Scatter/Gather.
< img src= "pp_fg1.gif" alt="Figure 1" >
Figure 1. The Scatter/Gather interface presents a user with clusters of like documents.
Clustering depends on some measure of inter-document similarity. A common approach [10] , with a number of variants, is to represent documents as vectors of equal length, where each component of a vector is associated with one of the unique content words in the document collection. In some schemes, the component may contain a value indicating the presence of a word in the document (i.e., a binary coding). In other schemes, a vector component may indicate the frequency or some normalized frequency of a word in the document. The similarity of two documents can then be computed by a cosine measure, which is the cosine of the angle between two vectors, which is sometimes also known as a normalized correlation.
Single-linkage hierarchical clustering is a commonly used method outside of information access. For instance, we use it below in our data analysis. It is, however, too slow for even moderately large document collections. Beginning with individual documents, single-linkage hierarchical clustering iteratively agglomerates the most similar pair of clusters built so far into a new cluster. The global consideration of all pairwise similarities at each stage of clustering leads to running times of O(n ^ 2) in the number, n, of documents.
Speedier algorithms [4] were developed for Scatter/Gather based on a nonagglomerative partitional clustering scheme. The essence of this approach is to flatly partition a collection into k subsets, and recursively partition the subsets as needed to induce a hierarchical structure. At each partitioning, k document seeds are selected and remaining documents are clustered with the most similar seed. At each stage the procedure may be run further to iteratively improve the selection of seeds and improve the clustering. The developed schemes have running times of O(kn ).
Linear O(kn) run-times are still too slow for very large collections. Constant-time interaction costs for Scatter/Gather navigation through a cluster hierarchy are achieved by precomputing a cluster hierarchy to be used in Scatter/Gather interaction [3] . The off-line precomputation uses the linear-time algorithms and summarizes document clusters by meta- documents containing profiles of topical words and the most typical titles.
These topical words and typical titles are also used to present users a summary of the documents in a cluster. Topical words are those that occur most frequently in a cluster, and typical titles are those with the highest similarity to a centroid of the cluster. Together, the topical words and typical titles form a cluster digest. The cluster digest scheme combined with the constant-time interaction cost satisfy the two requirements desired for a cluster-based browsing method.
Users could also select one or more clusters and then select the Show Titles button at the top of Figure 1, which would then display all of the document titles in the selected clusters, in a separate Show Titles Window. The user can scan the displayed list of titles, seeking those that appear relevant to the task at hand. Relevant documents can be selected by pointing, and then saved to file by cut and paste techniques.
The experiment used the 2.2 gigabyte TIPSTER text collection created for the TREC conference [7] . Twelve topics were drawn from the first 100 topics used in the TREC conference. These twelve were chosen based on a level of difficulty measured by the mean number of relevant documents in the Tipster collection as identified by information retrieval experts associated with TREC [7] . The four topics with the fewest (expert-identified) relevant documents (M = 46) were placed in the "hard" group, the four topics with the most relevant documents (M = 865) were placed in the "easy" group, and the four topics about the median number of relevant documents were placed in the "medium" group (M = 303).
Participants spent between two and five hours total on study activities, with two phases a day over the course of two days. Four blocks of topics were constructed for presentation over the four phases of tasks. Each topic-block contained one easy topic, one medium topic, and one hard topic, in that order. The order of blocks across phases was counterbalanced over sets of four participants, within groups, according to a randomized Latin square.
Participants were randomly assigned to one of three study conditions: Scatter/Gather Speeded (SGS, N = 4), Scatter/Gather with Relevance Ratings (SGR, N = 4), and SimSearch (SS, N = 8). Phases 1 and 3 varied by condition, and Phases 2 and 4 were identical across conditions. In the SGS condition, subjects were given one hour to find articles in each of Phases 1 and 3. In the SGR condition, subjects were not given a time limit, and were asked to complete additional classification and relevance activities: Given worksheets, they were asked to indicate how they would classify each presented cluster (i.e., using words or short phrases), and to estimate what percentage of texts in a cluster seemed relevant to the topic. In the SS condition, subjects used SimSearch rather than Scatter/Gather to find their relevant articles in phases one and three. Note that SGR and SGS conditions were combined for some analyses into one group (SG; see below).
Although there was not a significant difference in the total number of documents saved by SG (M = 12.26) versus SS (M = 16.44) participants [F(1, 14) = .24, MSE = 2355.77], there was a difference in number of relevant documents saved, F(1, 14) = 5.44, MSE = 108.55, p = .04, with SS participants (M = 10.40) outgaining SG participants (M = 5.44). This is reflected in the differences in precision scores in Table 1.
The SG group showed substantially more variation in number of documents saved (SD = 65.21; SD = 17.94 with largest outlier removed) than the SS group (SD = 11.27). Furthermore, 27% of SG queries (13 of the 48), resulted in no documents saved, whereas all the SS participants saved at least one document on each query. These results suggest that the chance of finding relevant documents is a bit more "hit and miss" for participants using a pure SG system.
Table 1. Per-query averages for SimSearch (SS) and Scatter/Gather (SG) groups (SGS = Speeded, SGR = Rating conditions).
< img src= "pp_tb1.gif" alt="Table 1" >
As an approximation, however, we were able to formulate the expected distribution of relevant documents across clusters and to compare this against the observed distribution of ratings. The distribution of relevant documents across clusters was examined for each of 29 queries used in TREC. Sets of the 200 most similar documents were retrieved using the SimSearch technique and then these retrieved sets were clustered using the Scatter/Gather clustering algorithm into five partitions. The clustering, however, proceeded wihout any information about the nature of the query used to select the initial 200 documents. Clusters larger than a criterion size were clustered again, and this process recursed. For each query, the clusters were ranked i = 1, 2, ..., 5 by the number of relevant documents ui assigned to cluster i from the U relevant documents contained in the original set that was clustered. Among regressions of linear, exponential, and power functions, the best fit was obtained by an exponential distribution of relevant documents across clusters,
< img src= "pp_eq1.gif" alt="Equation 1" >
Using Equation 1, we estimated the distribution of relevant documents (those identified by TREC experts) across the 10 topmost clusters and compared this against the mean precision estimates given by SGR participants, and these data are presented in Figure 2. (For the sake of clarity, Figure 2 omits estimates of zero precision by our participants.) This makes the assumption the rank ordering of people's ratings mirror the rank ordering of No. relevant documents in clusters.
The linear relationships apparent in the log-log coordinates of Figure 2 suggest that the Scatter/Gather users perceive a distribution of relevant documents across clusters that has a power law relation to the expected distribution of relevant documents. Such relations are common in many domains where people must assess the strength of evidence to judge the probability of events [9] . Users apparently show the same biases in estimating how many relevant documents are in a collection as they do in estimations of events in other domains, such as sports or health risks.
It seems, then, that three conclusions can be drawn about how well Scatter/Gather communicates the location of relevant documents among the clusters presented to users: (1) even though the clustering algorithm obtains no information about users' queries, most of the documents relevant to a typical (TREC) query will tend to be grouped into few clusters, (2) users judge most of the relevant documents to be grouped into few clusters, and (3) the users' judgments appear to have a well- defined power-law relationship to the actual distribution of relevant documents.
< img src= "pp_fg2.gif" alt="Figure 2" >
Figure 2. The match of the perceived precision (No. relevant / No. total documents) of clusters to the predicted distribution of relevant documents for different degrees of Query Difficulty.
Table 2. Search terms used, Phases 2 and 4.
< img src= "pp_tb2.gif" alt="Table 2" >
If it is the browsing interaction with Scatter/Gather that induces the topic structure in users, then we should expect the richness of the diagrams drawn by SG participants to increase with increased browsing. Figure 3 shows that, indeed, the breadth of SG diagrams was highly correlated with the number of Scatter/Gather cluster windows (windows such as Figure 1) that participants saw during a block of queries. The number of cluster windows was not correlated at all, however, with the depth of the diagrams or the number of nodes or links.
Regarding content, there seemed to be more diversity in topics listed in SGR diagrams, e.g., in addition to terms used in the given query topics. The SS group listed items more related to their specific query topics, as well as more general nodes like "news," "AP," "WSJ," etc. Finally, the SGS diagrams seem to have a few new topics not given in the queries (e.g., "medicine," "science"), but not as many as the SGR group. Two judges rated the topics appearing in the diagrams on a three-point "specific-general" scale to determine the diversity of topics contained in the diagrams (a rating of one was assigned to a topic judged to be very specific, e.g., "corrupt government officials", and a rating of three was assigned to a very general topic, e.g., "law"). There was an 82% inter- rater agreement overall and 85% agreement on the most general topics (i.e, those having a rating of three). Forty diagram topics (20% of the distinct topics used) were rated as most general by both judges, and the groups were compared on the proportion of "general" topics their diagrams contained. Overall, the SG subjects used a significantly greater percentage of general topics than the SS subjects (52% and 31% respectively; t(15)=2.56, p < .05). The SGS and SGR subjects did not differ significantly in the proportion of general topics they included in their diagrams (54% and 49% respectively). There were no significant between- or within-group differences in the number of general topics for diagrams drawn in Phase 2 versus Phase 4.
< img src= "pp_fg3.gif" alt="Figure 3" >
Figure 3. Relation of No. Scatter/Gather windows seen to the breadth of a subject's topic structure diagram.
Table 3. Diagram analyses, Phases 2 and 4.
< img src= "pp_tb3.gif" alt="Table 3" >
A multi-dimensional scaling analysis was used to assess the similarity of participants' diagrams based on the number of shared topics. Such an analysis attempts to arrange entities in a space such that the distances between the entities in that space correspond (as much as possible) to some measured differences (or similarities) between all pairs of entities. We wanted to look at the output of such an analysis to see which users (the entities in our analysis) seemed to cluster together in terms of their conception of the topics contained in the text collection. That is, multi-dimensional analysis was used to lay out a kind of semantic space for users conceptions of topics in the collection.
The similarity matrix we used for this analysis indicated the proportion of shared topics between any two diagrams (number of topics in common divided by the number of topic nodes in the larger diagram). One SS participant did not use topics and was not included in the analysis. On average, SGS and SGR participants' diagrams were more similar to each other than they were to SS participants' diagrams (sharing 10% versus 5% of topics, respectively). SS participants' diagrams were no more similar to each other than they were to SG participants' diagrams (sharing only 5% of topics). The SGS diagrams were the most similar of the three groups, sharing an average of 14% of their topics. Figure 4 shows a three-dimensional multi- dimensional scaling solution for these diagram topic similarity data. The circles represent SG participants and the boxes represent SS participants. In addition, a minimal spanning tree has been laid over the points in Figure 4 to highlight clusterings. Overall, it appears that SG participants are more central and closer to one another in Figure 4 than SS participants. As a measure of incoherence, we computed the root mean squared (RMS) distances among points in the Figure 4. This incoherence measure was lower for the SGS (RMS = 1.08) and SGR (RMS = 1.22) participants than for the SS participants (RMS = 1.68). The people using Scatter/Gather seemed to be closer to one another in terms of their conception of the topics contained in the collection
< img src= "pp_fg4.gif" alt="Figure 4" >
Figure 4. Multi-dimensional scaling solution based on diagram content similarity for the Scatter/Gather (circles) and SimSearch (boxes) participants.
Figure 5 shows a hierarchical clustering based on the similarity data for the diagrams from Phase 4. This is another way of seeing how individual people cluster together in terms of their conception of the topics in the collection. This tree representation clearly shows the strong similarity grouping of the SGS and SGR diagrams. Overall then, it appears that the Scatter/Gather interface is inducing a more coherent view of the text collection than SimSearch.
< img src= "pp_fg5.gif" alt="Figure 5" >
Figure 5. Hierarchical clustering of diagram topic similarity data from Phase 4 (single linkage method, nearest neighbor). SG diagrams (SGS and SGR) appear to be the most similar grouping.
In conducting this research, we also faced numerous methodological problems. It is easy to collect measures based on counts of documents retrieved, but not so easy to assess the knowledge about a collection that is communicated by a browsing technique. We developed several convergent assessment measures in this regard, though clearly a great deal of refinement remains to be done.
2. Buckley, C., J. Allan, and G. Salton. Automatic routing and ad-hoc retrieval using SMART. in Second Text Retrieval Conference TREC-2. (1994), National Institute of Standards and Technology.
3. Cutting, D.R., D.R. Karger, and J.O. Pedersen. Constant interaction-time Scatter/Gather browsing of very large document collections. in SIGIR '93. (1993).
4. Cutting, D.R., D.R. Karger, J.O. Pedersen, and J.W. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. in SIGIR '92. (1992), pp. 318-329.
5. Froehlich, T.J., Relevance reconsidered -- Towards an agenda for the 21st century. Journal of the American Society for Information Science, 45 (1994). 124-134.
6. Harman, D., Evaluation issues in information retrieval. Jorunal of the American Society for Information Science, 28 (1992). 439-440.
7. Harman, D. Overview of the first text retrieval conference. in 16th Annuam International ACM/SIGIR Conference. (1993), Pittsburgh, PA. ACM. pp. 36-38.
8. Pirolli, P. and S. Card. Information foraging in information access environments. in Conference on Human Factors in Computing Systems, CHI-95. (1995, Association for Computing Machinery.
9. Tversky, A. and C.R. Fox, Weighing risk and uncertainty. Psychological Review, 102 (1995). 269-283.
10. vanRijsbergen, C.J., Information retrieval. Butterworth & Co., Boston, MA, 1979.
11. Wildemuth, B.M. and R.D. Bliek, Measures of searcher performance: A psychometric evaluation. Information Processing and Management, 29 (1993). 533-550.
--========================_80863802==_--