Visualizing the contents of social bookmarking systems

I’ve decided to post a few visualizations I’ve made some time ago as part of my PhD work. The method has been published, but more as a sidenote [1]; also, I’ve since applied it to additional datasets, so I thought it might be interesting to share those images.

In my PhD thesis, I try to make sense of the large bodies of data that are accumulated by people saving resources online, tagging them with whatever words they choose in order to find them later on. Each time a user u saves a document d, using the tags t1, t2, and t3, three triples (d,u,t1), (d,u,t2), (d,u,t3) are created. In this way, the users of large social bookmarking sites like Delicious have created datasets of several hundreds of millions of triples. Over the last years, a whole body of literature has been created that’s concerned with making sense of this data (there’s an intuition that something valuable is in there; after all, each of these millions of triples means that somebody has thought *something*!). Here, I want to show a few visualizations of various social bookmarking datasets aimed to provide a quick idea of the complexity and the approximate content of these networks.

Here’s what I’ve done:

  • Take the 1000 most popular documents from each dataset
  • Compute the similarity between them in terms of associated tags
  • Draw connections between the 3000 closest pairs of documents
  • Draw connections between each document and its most frequently associated tag
  • Scale the tags by frequency
  • Remove everything that’s not connected to the largest component of the graph
  • Run the resulting network through GraphViz, a fantastic graph visualization software

For the technically inclined, the similarity between documents is computed as a so-called cosine similarity: Each document is represented by a large vector containing, for each tag, the number of users that have used that tag to describe the document. The cosine similarity between those vectors is then a number between 0 (no similarity) and 1 (complete match). If you imagine two vectors (0,1) and (1,0), the angle between them is 90 degrees and the cosine of that angle is 0. This can be scaled up to many dimensions and is frequently used in information retrieval as a basic similarity measure between measure. Also, the choice of the numbers 1000 and 3000 is somewhat arbitrary. They just happen to create what I find the most expressive visualizations over all datasets. The actual visualization, as noted, is outsourced to GraphViz; for a quick intuition about what it does, just imagine that each connection is a spring between two nodes – the visualization is a stable state of the resulting system where the different spring forces are in balance.

Visualize.us, a social bookmarking site for images.

One month of Delicious (12/2007) [2], a social bookmarking service for URLs

CiteULike [3], a social bookmarking service for scientific articles

Bibsonomy [4], a social bookmarking service for scientific articles and URLS

Bibsonomy, without the spam removed

Through the way they are created, these visualizations are heavily biased towards the most popular documents, so a lot of content is lost – the original point of these visualizations was just to demonstrate the brutal way in which spammers destroy inherent patterns in the data (even though I think there’s more things to find). For work more aimed at a complete overview of a particular dataset, please check this nice visualization[5].

Higher resolution or vectorized versions available on request; if you would like to cite this work in an academic context, please refer to [1]

[1] Nicolas Neubauer and Klaus Obermayer: Hyperincident Connected Components in Tagging Networks. In Proceedings of the 20th ACM Conference on Hypertext and Hypermedia [pdf]
[2] Robert Wetzker, Carsten Zimmermann, Christian Bauckhage: Analyzing Social Bookmarking Systems: A del.icio.us Cookbook. In Proceedings of the ECAI 2008 Mining Social Data Workshop (2008), pp. 26-30.
[3] http://www.citeulike.org/faq/data.adp
[4] Knowledge & Data Engineering Group, University of Kassel Benchmark folksonomy data from bibsonomy, version of june 30th, 2008.
[5] Nianli Ma, Russell J. Duhon, Elisha F. Hardy, Katy Börner (2009) Bibsonomy Anatomy, Sunbelt Viszards Map. Online at http://cns.iu.edu/research/09-Bibsomony.jpg

Leave a Comment




*