1 Random generation
We discussed above that in a few cases, authors have generated their own datasets to test their algorithms. We call these generated datasets synthetic or custom, as opposed to datasets found in the wild. Generating a synthetic dataset can be essential for several reasons, particularly in the context of algorithm development and validation:
- Privacy and Publication Constraints: Imagine you’ve created an innovative algorithm using a dataset that, due to privacy concerns, cannot be shared publicly. To validate and share your work without compromising data confidentiality, creating a synthetic dataset that mirrors the essential features of the original data allows you to showcase your algorithm’s effectiveness while adhering to privacy restrictions.
- Benchmarking and Comparative Analysis: Suppose you have access to a single dataset that can be shared and wish to prove your algorithm’s robustness across various scenarios. Generating synthetic datasets with comparable characteristics provides a controlled environment to conduct benchmark studies. This approach enables scientists to demonstrate that your algorithm performs consistently well across datasets with analogous features, thereby reinforcing its applicability and reliability.
- Addressing Data Availability Issues: In certain situations, you might design an algorithm to process graphs with specific attributes only to discover the absence of publicly available datasets showcasing these features. Synthetic datasets become invaluable here, allowing you to create tailored data that incorporates the necessary characteristics. This approach not only facilitates the testing of your algorithm in a relevant context but also helps in illustrating its potential in hypothetical yet plausible scenarios.
We use the word synthetic too in cases where the dataset has been altered, sliced, or other modifications have been applied to it.
The following is a practical example of a case where you would need to edit a network:
Imagine you are building a visualization to deal with the entirety of the Enron Corpus dataset , which has hundreds of thousands of nodes and edges. Because of its size, you decide to slice up the large dataset in many smaller files, so you can run tests. This particular dataset, in addition to being a challenging problem because of its size, also has an interesting distribution of connection densities: some nodes are extremely well connected, while others are much less connected. Indeed, the dataset is comprised of a collection of emails sent by Enron executives - between themselves or between other employees of the company. Because of how the dataset is constructed - where only the emails from the executives are taken into account - it has a distinctive skew in the connectedness of the data: 158 nodes are extremely well connected, while the rest of the nodes are much less connected. Because of this, there’s uncountable mistakes that can be done through slicing this graph: for instance, slicing it so that the subset includes a less dense section of the entire graph will fail to provide a representative section.
While the creation of a synthetic dataset is a perfectly viable way to produce a benchmark set, in addition to replicability criteria (as discussed above), we also have to pay particular attention to a number of statistics related to the structure of graphs — both when generating new datasets, and when slicing up existing datasets to reduce their size, or to create more graphs from a single, larger graph.
Comparing two synthetic graph collections meaningfully is not straightforward. Without shared generation procedures or metadata, it’s difficult to claim that they represent comparable input conditions for layout evaluation. A number of structural and contextual features may affect comparability — including, for example, graph size (nodes and edges), density, and distribution of components or motifs. Even small changes in these properties can have a significant impact on layout results.
While a full discussion of all relevant factors is beyond the scope of this paper, we highlight this issue to caution against assuming that two synthetic datasets are interchangeable just because they were randomly generated. A more careful analysis of their structural characteristics is often necessary.
The generation of random graphs that accurately mimic specific features presents a complex challenge, that has been explored abundantly in the past, but has not resulted in a universal solution. Two examples of popular models used to generate synthetic datasets are the Erdős-Rényi (ER) @erdHos1960evolution and the Barabási-Albert (BA) @barabasi1999 models, each one with their distinct focus and limitations:
The ER model excels in creating graphs with uniform edge distribution @erdHos1960evolution, ideal for testing an algorithm’s ability to evenly space nodes and reduce edge crossings. However, it falls short in replicating the complex, non-uniform connections seen in real-world networks, limiting its applicability to scenarios requiring simplicity and randomness. Conversely, the BA model produces scale-free networks with a few highly connected nodes @barabasi1999, reflecting the hierarchical structure of many natural and human-made systems. It challenges layout algorithms to effectively display these hubs without cluttering. The limitation here is its focus on growth and preferential attachment, which might not suit all types of networks, particularly those without clear hub structures.
As there is no universal solution for random graph generation, our recommendation is to try as much as possible to pay attention to research replicability criteria, such as redistributing the generated dataset as supplemental material in the paper.