1 Data Processing
Further complicating the replicability and reproducibility issues previously outlined is the problem of storing data in a consistent format. Datasets cited in different papers link to downloads for which the graph structure is very difficult to piece together (for instance, if it is in an uncommon format)—most commonly, paper authors write code to extract this structure for their specific needs, but this code is often tedious to recreate (if the authors went through the trouble of meticulously describing their process in the paper) or difficult and time-consuming to reappropriate (if it is contained in their supplemental materials, and they still exist on the web). In the worst case, both of these options are lost over time and the original graphs used become impossible to recreate. Hence, storing and organizing this data in multiple accessible formats is of great importance for the replicability of these works and for further research in the field.
For each dataset included in our repository, we have performed the work required to extract the graph structure per the original author’s specifications, or linked to an established network repository such as SNAP or SuiteSparse Matrix Collection that hosts the files. Writing custom code for this purpose was necessary for 32 of the 49 included datasets. Our code is available at github.com/VisDunneRight/benchmark_sets_analysis_data/.
We chose to convert and store several of the datasets in a uniform JSON representation because it is a highly versatile and easily accessible format that allowed us to easily represent a large and varied amount of characteristics that different graphs might have—such as nodes having associated timestamps, labels, or belonging to a clusters, and edges having weights. The style of JSON representation we used for graphs is based on several d3 examples, such as the one used in the d3 documentation, and is the same format used by the JSON read/write functions in NetworkX. For ease and accessibility, we have also converted and made available all graphs in three additional commonly-used formats: GraphML, GEXF, and GML. All of the formats contain the same information, including all of the additional attributes nodes or edges might have.