1 Motivation and Background
This work stems from the challenges we encountered in finding datasets tailored to test graph layout algorithms. When developing a graph layout algorithm which handles specific features (such as layers, or clusters), it is essential to have a benchmark dataset that reflects these features. While conducting our own evaluations, we found it difficult to find datasets that would incorporate features that we needed. This submission is part of an ongoing effort to keep a curated list of datasets.
Our work focuses on providing a graph benchmark collection that categorizes datasets by how they organize their graphs and emphasizes their features. We aim to facilitate researchers’ choice of benchmarks to reflect real use cases or allow comparisons to other algorithms in their respective fields. Although a number of graph repositories exist, their target and objectives are not always aligned with the needs of graph drawing researchers. While the scope of this work revolves around compiling graphs and networks used in the graph drawing literature, we highlight that other, adjacent fields have also created similar repositories tailored to different needs. For example, the Network Repository consists of a comprehensive collection of datasets that contain many attributes and are used for benchmarking in machine learning, data mining, and many other network applications [@nr]. In biology, the KEGG Encyclopedia of Genes and Genomes contains network information relevant to biological pathways [@kegg]. Some general-purpose collections used in network science are also relevant to our discussion. Among the most famous ones, the SuiteSparse Matrix Collection, the Stanford Network Analysis Project (SNAP), and the Pajek collection stand out since they propose large compilations of datasets that often come from diverse sources. The Open Graph Benchmark collection from @hu2020 is also worth mentioning. It provides an important infrastructure for evaluating machine learning methods on graph-structured data, including datasets, tasks, and evaluation metrics, but it is not focused on layout quality or human-perceived readability of graph visualizations. Our work complements such efforts by specifically targeting datasets designed to evaluate the perceptual and aesthetic dimensions of layout algorithms, which are not addressed in @hu2020. Much simpler examples of similar collections can be found in curated lists of links on GitHub, often referred to as awesome lists, where a short comment usually accompanies every entry (for instance, here and here). The purpose and scope of such repositories is provided with more depth in ?@sec-established-network-repositories.
Such lists can serve as great tools to find particular case studies, but they do not serve the same purpose as a uniform collection like Rome-Lib: a collection of graphs with similar features that can be used to test an algorithm on thousands of graphs with increasing nodes. Rome-Lib is hosted on the main website of the Graph Drawing website, as proof of its usefulness as a benchmark dataset, together with the AT&T graphs and the random DAGs. Another example is the Graph Partitioning Archive, also known as the Walshaw Collection, which compiles relevant graphs and partitioning algorithms from disparate sources in the relevant literature [@walshaw2000]. The uniformity of such collections allows scientists to easily run thousand of tests on similar graphs, allowing to test the scalability of an algorithm varying density, number of nodes and number of edges — as opposed to the previously mentioned collections, where the focus is on the diversity of the graphs. See ?@sec-uniform-benchmark-datasets for more information on this topic.
We care particularly about the reproducibility of past and future research. A dataset that has been used in an evaluation and is now unaccessible greatly hinders the reproducibility of the evaluation, and in the worst case it makes it impossible to reproduce, and, as such, much less meaningful. Losing a dataset to link rot is an unfortunately common problem in the digital age, as URLs change, websites go down, and data is lost. One example of this is the Open Graph Archive from @bachmaier2012, which was a project to create a graph database that categorizes, analyses, and visualizes graphs uploaded from the community, a laudable effort now rendered unfortunately inaccessible. For more discussion, see ?@sec-lost.
We tried to mitigate the problem of lost datasets by documenting what we could still find about them. For every dataset that we found that is now lost or inaccessible, we documented every detail we could find about it in literature, including descriptions and pictures of the rendered graphs, so that we can conserve a hint of what the dataset contained. We also reached out to colleagues in other universities who we knew had worked with certain datasets in the past, and asked if they could check their internal storage — for example, shared drives or old project folders — to see if the data was still available. In a couple of cases, this led to successfull outcomes: see Storylines (Movie Plots).
We store our collection on the Open Science Framework (OSF), which is the currently recommended solution in the VIS community for long-term archival of research data. As per OSF’s backup and preservation policy, storage and open access is guaranteed for the next 50 years.
In this context, it is also worth mentioning that in recent years there have been several initiatives aimed at encouraging care for replicability in research. The Graphics Replicability Stamp is one of these, meant to be an endorsement of the replicability of the results presented in a paper, which ensures the replicability of the results of a paper through an additional review process. Another similar intiative are the ACM badges, or the SIGMOD availability and reproducibility initiative, which goes one step further and publishes full reports commenting on how reproducible a paper is.
Contributions to the dataset collection (corrections, integrations, replacement) are most welcome — and strongly encouraged. However, to ensure data quality and avoid accidental overwriting or inconsistency, we don’t allow direct edits to the files by everyone. Instead, there are two main ways to contribute:
Pull requests: If you prefer to fill in all the information yourself, you can submit a pull request directly to the repository. The data for both the papers and the datasets is stored in CSV format, available here.
GitHub issues: If you’d rather just point us to a new dataset or share some additional info (e.g., missing metadata, clarifications, or links), we’ve created an issue template to make that easier. We’ll then take care of adding the dataset and filling in whatever information we can find or infer.
Even without external contributions, we actively monitor the space and try to keep the repository up to date as new datasets emerge. And if none of the options above work for you, feel free to just reach out to us — we’re happy to handle things more informally as well.
That said, it’s worth noting a clear limitation: this collection does not aim to be exhaustive. The starting point for the dataset list was a literature review covering a few hundred papers, which means it’s entirely possible that some benchmarks were missed — especially if they weren’t cited often or were introduced in more obscure venues. For this reason, contributions from the community are especially valuable to help fill in the gaps and keep the resource as useful and complete as possible.
This work proposes a working classification of datasets and collections based on their structure while also providing a higher emphasis on the features and usage within the literature of our field. Our collection is offered as a complement to the previously mentioned collections, as we intend to aid researchers in finding graphs in the context of layout algorithms and network visualization, with a focus on encouraging replicability.