LEN: Large Engagement Networks

Social media users and inauthentic accounts, such as bots, may coordinate in promoting their topics. Such topics may give the impression that they are organically popular among the public, even though they are astroturfing campaigns that are centrally managed. It is challenging to predict if a topic is organic or a coordinated campaign due to a reliable lack of ground truth.

Groundtruth

We create a ground truth by detecting the campaigns promoted by ephemeral astroturfing attacks. These attacks push any topic to Twitter's trends list by employing bots that tweet in a coordinated manner within a short period and then immediately delete their tweets. We also manually curate a dataset of organic Twitter trends. We then create engagement networks out of these datasets and present them as graph classification datasets, where the task is to distinguish between campaigns and organic trends. Engagement networks consist of users as nodes and edges indicate interactions (retweets, replies and quotes) between users.

New public dataset

We release the data of 170 campaigns and 135 non-campaigns. Our graph dataset posits a challenge for large graph classification problems. Traditional graph classification datasets are small, with tens of nodes and hundreds of edges at most. In comparison to standard benchmarks, our graphs are at a larger scale. On average, each engagement network in our dataset contains ~11K nodes and ~23K edges. We show that state-of-the-art GNN methods give only mediocre results on our datasets, hence our datasets offer a new challenge for graph classification problem. We believe that our dataset will help advance the frontiers of graph classification techniques on large networks and provide an interesting use case in terms of distinguishing coordinated campaigns and organic trends.

Dataset Properties

Our dataset (LEN), comprises of graphs where the nodes denote the users and the edges indicate the type of interaction between the user, which in this case could be retweet reply or quote. We have a total of 305 engagement graphs, comprising of 170 campaign graphs and 135 non-campaign graphs. There are 7 sub-types in campaign and 8 in non-campaign as indicated in the Overall Graph Description below.

Network statistics and label descriptions for the complete dataset
Sub-types # graphs # nodes # edges Explanation
Min Max Avg Min Max Avg
Campaign Politics 62 100 50,286 6,570 203 71,704 10,210 Political promotions, slogans, misinformation camp.
Reform 58 131 19,578 1,229 540 1,105,918 25,268 People organized for political reforms.
News 24 581 54,996 10,368 942 80,784 15,582 News pumped up by bots and trolls for more attention.
Finance 14 273 9,976 1,802 243 10,725 2,334 Finance marketing (mostly cryptocurrency).
Noise 9 454 55,933 12,180 473 48,937 10,882 Cannot be put in any type.
Cult 6 313 7,880 2,303 637 11,615 3,431 Slogans by a famous cult with immense access to bots.
Entertainment 3 678 4,220 2,237 3,806 132,013 48,767 Celebrities attempting to promote themselves.
Common 3 3,487 9,974 5,919 2,818 9,470 7,066 Common sub-strings combined without known reasons.
Overall 170 100 55,933 5,157 203 1,105,918 16,006
Non-Campaign News 52 818 95,575 24,834 709 213,444 43,201 Popular events, sourced outside Twitter.
Sports 30 469 75,653 9,530 403 101,656 12,948 Popular sports events.
Festival 17 885 119,952 35,466 803 199,305 55,947 About festivals, holidays, special days.
Internal 11 4,188 87,720 33,061 4,374 196,103 54,442 Popular events, sourced inside Twitter.
Common 10 1,214 64,320 17,079 1,270 99,306 24,869 Common substrings combined by people.
Entertainment 8 1,477 20,060 7,289 1,712 45,211 12,578 Popular TV shows and YouTube videos.
Announced cam. 4 6,650 26,358 13,382 14,362 50,864 24,817 Official campaigns launched by major political parties.
Sports cam. 3 2,880 4,661 3,654 4,451 7,367 5,534 Hashtags launched by popular sports teams.
Overall 135 469 119,952 20,632 403 213,444 33,765


We also have a smaller, balanced verssion of LEN (LEN-Small), which has 51 campaign and 49 non-campaign graphs. A brief description of this table is given below in the Small Graph Description Table.

Statistics of the engagement networks for the small dataset with 100 networks. We omit the label description here as they remain the same.
Sub-types # graphs # nodes # edges
Min Max Avg Min Max Avg
Campaign Politics 14 100 1,908 805 203 2,000 1108
Reform 16 131 634 297 540 2,027 1192
News 3 581 1,671 1123 942 1,726 1410
Finance 9 273 1,590 775 243 1,862 1024
Noise 5 454 2,520 1060 473 1,634 1074
Cult 4 313 705 512 637 1,035 843
Overall 51 100 2,520 661 203 2,027 1113
Non-Campaign News 10 818 6,169 3757 709 9,076 4578
Sports 23 469 8,355 3357 403 9,998 3994
Festival 2 885 5,982 3433 803 6,509 3656
Internal 1 4,188 4,188 4,188 4,374 4,374 4374
Common 5 1,214 4,962 2,989 1,270 6,277 3559
Entertainment 5 1,477 7,739 4,391 1,712 10,608 6021
Sports cam. 3 2,880 4,661 3,654 4,451 7,367 5534
Overall 49 469 8,355 3545 403 10,608 4364

Download

Our original dataset comprises 101 GB of graph data, and 5.4 GB of small graphs data. Click here to download all the data (including metadata) separately.