Real-World Data Set Descriptions¶

PyTorch Geometric Signed Directed provides data loaders for various real-world data sets. These data loaders include

Within the data loaders, we could load:

Directed Unsigned Real-World Data Sets¶

Blog: from the paper The political blogosphere and the 2004 U.S. election: divided they blog., which records 19,024 directed edges between 1,212 political blogs from the 2004 US presidential election.
Migration: from the paper State-to-state migration Flows, 1995 to 2000., which reports the number of people that migrated between pairs of counties in the US during 1995-2000. It involves 3,075 countries and 721,432 directed edges after obtaining the largest weakly connected component. Since the original directed network has a few extremely large entries, to cope with these outliers we preprocess the input network with normalization, see descriptions from the paper DIGRAC: Digraph Clustering Based on Flow Imbalance .
WikiTalk: from the paper Signed networks in social media, which contains all users and discussions from the inception of Wikipedia until Jan. 2008. The 2,388,953 nodes in the network represent Wikipedia users and a directed edge from node \(v_i\) to node \(v_j\) denotes that user i edited at least once a talk page of user j. We extract the largest weakly connected component.
Telegram: from the paper The Activity of the Far Right on Telegram, which is a pairwise influence network between 245 Telegram channels with 8,912 edges. Labels are generated from the method discussed in the paper, with a total of four classes.
Cora-ML: from the paper Deep Gaussian embedding of graphs: Unsupervised inductive learning via ranking, which is a popular citation network with node labels based on paper topics with seven classes. In this citation network, nodes represent papers, edges denote citations of one paper by another, and node features are the bag-of-words representation of papers. The resulting network has 2,995 nodes and 8,416 edges.
CiteSeer: from the paper CiteSeer: An automatic citation indexing system, which is a citation data set from an automatic citation indexing system. In this citation network, nodes represent papers, and edges denote citations of one paper by another. Node features are the bag-of-words representation of papers, and node labels are determined by the academic topic of a paper. The resulting network has 3,312 nodes and 4,715 edges.
Texas, Wisconsin, and Cornell: from the paper Geom-GCN: geometric graph convolutional networks, which are WebKB data sets extracted from the CMU World Wide Knowledge Base (Web->KB) project. They record hyperlinks between websites at different universities. WebKB is a webpage data set collected from computer science departments of various universities by Carnegie Mellon University. In these networks, nodes represent web pages, and edges are hyperlinks between them. Node features are the bag-of-words representation of web pages. The web pages are manually classified into the five categories, student, project, course, staff, and faculty. The resulting networks have 183, 251, and 183 nodes respectively, along with 325, 515, and 298 edges, respectively.
Chameleon and Squirrel: from the paper Multi-scale attributed node embedding, which represent links between Wikipedia pages related to chameleons and squirrels. In these networks, nodes denote web pages and edges represent mutual links between them. Node features correspond to several informative nouns in the Wikipedia pages. The nodes are classified by Geom-GCN: geometric graph convolutional networks into five categories in terms of the number of the average monthly traffic of the web page. The resulting networks have 2,277 and 5,201 nodes respectively, along with 36,101 and 222,134 edges, respectively.
WikiCS: from the paper Wiki-cs: A wikipedia-based benchmark for graph neural networks, which is a directed network whose nodes correspond to Computer Science articles, and edges are based on hyperlinks. This network has 10 classes resenting different branches of the field. The resulting network has 11,701 nodes and 297,110 edges.
Lead-Lag: from the paper Detection and clustering of lead-lag networks for multivariate time series with an application to financial markets, which contains yearly lead-lag matrices from 269 stocks from 2001 to 2019. Each lead-lag matrix is built from a time series of daily price log returns. The lead-lag metric for entry (i,j) in the network encodes a measure of the extent to which stock i leads stock j, and is obtained by applying a functional that computes the signed normalized area under the curve (auc) of the standard cross-correlation function (ccf). The resulting matrix is skew-symmetric, and entry (i,j) quantifies the extent to which stock i leads or lags stocks j, thus leading to a directed network interpretation. Starting from the skew-symmetric matrix, authors of the paper DIGRAC: Digraph Clustering Based on Flow Imbalance further convert negative entries to zero, so that the resulting directed network can be directly fed into other methods; note that this step does not throw away any information, and is pursued only to render the representation of the directed network consistent with the format expected by all methods compared. The average number of edges is 29,159.

Signed Real-World Data Sets¶

Sampson: the Sampson monastery data from the paper A novitiate in a period of change: An experimental and case study of social relationships, which covers 4 social relationships, each of which could be positive or negative. We combine these relationships into a network of 25 nodes. For this data set we use as node attribute whether or not they attended the minor seminary of “Cloisterville”. As ground truth we take Sampson’s division of the novices into four groups: Young Turks, Loyal Opposition, Outcasts, and an interstitial group. The number of positive and negative edges are 148 and 182, respectively.
Rainfall: from the paper Climate inference on daily rainfall across the Australian continent, 1876–2015 and further processed by the authors of the paper SSSNET: semi-supervised signed network clustering, which contains Australian rainfalls pairwise correlations. This data set is based on the analysis over 294 million daily rainfall measurements since 1876, spanning 17,606 sites across continental Australia. The resulting network has 306 nodes, and the number of positive and negative edges are 64,408 and 29,228, respectively.
Fin-YNet: from the paper SSSNET: semi-supervised signed network clustering, which consists of yearly correlation matrices for 451 stocks for 2000-2020 (21 distinct networks), using so-called market excess returns; that is, we compute each correlation matrix from overnight (previous close to open) and intraday (open-to-close) price daily returns, from which we subtract the market return of the S&P500 index. The resulting networks have on average 148,527 positive edges and 54,313 negative edges.
S&P1500: from the paper SSSNET: semi-supervised signed network clustering, which considers daily prices for 1,193 stocks, in the S&P 1500 Index, between 2003 and 2015, and builds correlation matrices also from market excess returns. The result is a fully-connected weighted network, with stocks as nodes and correlations as edge weights. The resulting network has 1,069,319 positive edges and 353,930 negative edges.
PPI: from the paper Integrating protein-protein interaction networks with phenotypes reveals signs of interactions, which is a signed protein-protein interaction (PPI) network. The edge signs represent activation-inhibition relationships. This is a Drosophila melanogaster signed PPI network consisting of 6,125 signed PPIs connecting 3,352 proteins that can be used to identify positive and negative regulators of signaling pathways and protein complexes. The data set is further processed by the authors of the paper SSSNET: semi-supervised signed network clustering to keep the largest connected component. The resulting network has 3,058 nodes, 7,996 positive edges, and 3,864 negative edges.
Wiki-Rfa: from the paper Exploiting social network structure for person-to-person sentiment analysis, which is a signed network describing voting information for electing Wikipedia managers. Positive edges represent supporting votes, while negative edges represent opposing votes. The data set is further processed by the authors of the paper SSSNET: semi-supervised signed network clustering to keep the largest connected component and remove nodes with very low degrees. The resulting network has 7,634 nodes, 135,753 positive edges, and 37,579 negative edges.
BitCoin-Alpha and BitCoin-OTC: from the paper Edge weight prediction in weighted signed networks, which describe bitcoin trading. As a cryptocurrency, Bitcoin is used to trade anonymously over the web, whose counterparty risk has led to the emergence of several exchanges where Bitcoin users rate the level of trust they have in other users. Two such exchanges are OTC (for short) and Alpha (for short). Both exchanges enable users to rate others on a scale of -10 to 10 (excluding zero), where a rating of -10 should be given to fraudsters while 10 means to trust the person as trusting oneself. The rating values in between have intermediate meanings. The resulting networks have 3,783 and 5,881 nodes respectively. BitCoin-Alpha has 22,650 positive edges and 1,536 negative edges, while BitCoin-OTC has 32,029 positive edges and 3,563 negative edges.
Slashdot: from the paper Finding large balanced subgraphs in signed networks, which relates to a technology-related news website. This network contains friend/foe links between the users of Slashdot. The resulting network has 82,140 nodes, 380,933 positive edges, and 119,548 negative edges.
Epinions: from the paper Controversial users demand local trust metrics: An experimental study on epinions.com community, which describes trust-distrust consumer reviews on epinions.com. epinions.com is a website in which users can write reviews about products and assign them a rating. This website also allows the users to express their Web of Trust, i.e. “reviewers whose reviews and ratings they have consistently found to be valuable” and their Block list, i.e. a list of authors whose reviews they find consistently offensive, inaccurate, or in general not valuable. Inserting a user in the Web of Trust is the same as issuing a trust statement which is recorded as a positive edge between the user and the reviewer, while inserting them in the Block List means issuing a distrust statement which is recorded as a negative edge. The resulting network has 131,580 nodes, 589,888 positive edges, and 121,322 negative edges.
FiLL: from the paper Msgnn: A spectral graph neural network based on a novel magnetic signed laplacian, which Financial lead-lag relationship data sets. For each year in the data set, the authors build a signed directed graph (FiLL-pvCLCL) based on the price return of 444 stocks at market close times on consecutive days. The authors also build another graph (FiLL-OPCL), based on the price return of 430 stocks from market open to close. The lead-lag metric that is captured by the entry (i,j) in each network encodes a measure that quantifies the extent to which stock i leads stock j, and is obtained by computing the linear regression coefficient when regressing the time series (of length 245) of daily returns of stock i against the lag-one version of the time series (of length 245) of the daily returns of stock j. Specifically, the paper uses the beta coefficient of the corresponding simple linear regression, to serve as the one-day lead-lag metric. The resulting matrix is asymmetric and signed, rendering it amenable to a signed directed network interpretation. The data matrices stored in PyGSD are dense but there is a sparsification parameter provided for the data loader; the MSGNN paper uses a parameter value of 0.2. The resulting annual graphs FiLL-OPCL have on average 84,467 positive edges and 100,013 negative edges, while the resulting annual graphs FiLL-pvCLCL have on average 84,677 positive edges and 112,015 negative edges.