Blog post

Quantifying social media's
social structure

How online society is fragmented and polarized
by Isaac Waller and Ashton Anderson

Is the internet bringing us closer together? Or pushing us further apart? Despite its ubiquity in modern life, there’s still a lot we don’t understand about the internet and social media. We can now connect with anyone from around the world, and find communities centered around any topic, no matter how niche. This has undoubtedly been immensely positive in some aspects, like giving people places to connect with others around their hobbies, or providing support groups for people like LGBT youth. But it has also arguably led some to unhealthy isolation, cults, and conspiracy theories. Despite having the ability to connect with anyone, are we instead connecting with people more and more similar to ourselves? To understand this phenomenon, we first need to be able to measure it.

If you’re reading this, you likely participate in many different communities online. And since nobody’s interests are exactly the same as their friends, in doing so you leave behind evidence of how your communities relate to others. In our paper, we create a highly detailed map of communities, where communities are similar if they’re engaged with by “similar” people. (We don’t quantify individual identities – only communities.)

This means that communities with similar topics are similar in our model. But this data doesn’t just reveal interests. It also can reveal differences in social groups, identities, and culture – because these are all ways that people (and their online communities) differ. For example, let’s take a look at the age dimension, which ranges from younger to older.

Finding social dimensions

Reddit is divided into tens of thousands of distinct communities called subreddits. Here you can see a few hundred of them positioned in a community embedding (hover over them to see their names.) Each community has a vector representation based on who is active in it.

These communities are each individual and distinct, and are composed of many thousands of users interacting in an unstructured way. It's not clear how to quantify the relationships between these communities.

In addition, Reddit posts and comments are made under pseudonymous usernames. This makes it difficult to quantify how these communities relate to concepts like age or partisan leaning.

Our method finds social dimensions in a community embedding that map onto important social constructs such as age, gender, and US political partisanship.

We construct these dimensions by first identify a seed pair of communities that differ primarily in the target concept and are otherwise similar. For example, the age dimension is seeded by r/teenagers and r/RedditForGrownups, personal discussion forums for teenagers and adults.

We then algorithmically augment the seed pair with other extremely similar pairs of communities. For example, the algorithm finds r/teenrelationships and r/relationship_advice, communities for teenage and general relationship advice respectively, as a similar pair. The resulting set of pairs is then averaged together to form the final age dimension.

By projecting every community vector onto this dimension, we quantify every community's social association with age. Communities are closer to the younger pole if and only if their members are more active in other young communities. Users ‘vote with their feet’ to decide the social orientation of communities: only action across large numbers of people matters.

Younger communities appear on the left while older communities appear on the right.

Note that this technique quantifies the social makeup of entire communities, not individuals.

Our method takes advantage of how community embeddings1 encode social relationships between communities in a high-dimensional vector space. In a community embedding, communities are close together if their userbases are surprisingly similar. By finding dimensions in this space that correspond to social concepts, we can measure how every one of the 10,006 communities in our embedding scores on social dimensions like age, ranging from younger to older; gender, ranging from masculine to feminine; and U.S. partisanship, ranging from left-wing to right-wing. Much like how word embeddings have been shown to encode cultural dimensions2, we show that community embeddings accurately encode a variety of social dimensions.

These scores represent how behaviourally aligned a community’s user base is with the left or right end of the spectrum. By examining these scores, we can understand how individual communities relate to these dimensions–and by looking at how entire groups of communities vary on these axes, we can understand how the entire platform is organized with respect to age, gender, and U.S. political partisanship. In the interactive plot below,each point is a community, and the y-axis divides communities into clusters of communities. Entire top-level clusters of communities skew strongly towards the poles of the dimensions. For example, pro- gramming communities skew towards the masculine and old poles, and personal matters communities skew towards the feminine and left-wing poles. Politics communities exhibit a bimodal distribution on the partisan axis. Take a look for yourself!


This stratification along social dimensions demonstrates the importance of age, gender, and US partisanship to the high-level organization of activity on Reddit. Furthermore, this method quantifies how this platform is socially organized, both in individual community scores and in the skews of each cluster.

Using this new-found information, we can study many micro- and macroscale phenomena. We turn our attention to online political polarization on Reddit.

The dynamics and mechanisms of online political polarization

A particularly widespread concern is whether online populations increasingly sort into politically homogeneous ‘echo chambers’ and whether social media platforms tend to shift users towards ideological extremes. Did activity on Reddit ‘polarize’ by increasingly taking place in homogeneous partisan-leaning communities? And if so, did users themselves migrate towards the partisan extremes?

Using social dimensions, we can answer these questions. The partisan dimension accurately measures the political association of every subreddit, and by focusing only on political subreddits we can measure how political activity changed over time.

We find that Reddit became substantially more polarized around the 2016 US presidential election, and peaked in November 2016 when the election took place. For example, the percentage of political activity that took place in far-left and far-right communities was only 2.8% in January 2015, but peaked at 24.8% in November 2016. The platform never returned to pre-2016 polarization levels.

Does this mean individual users polarize over time on Reddit? Overall increases in platform-level polarization could be driven either by individual-level change, with existing users moving towards the partisan extremes, or by population-level turnover, with new users entering the platform in more extreme communities. To quantify this, we split users into “cohorts” by the year that they first had political activity on Reddit – i.e. the 2013 cohort is made up of users who had first political activity during 2013 – and measure how their polarization changed over time.

Political polarization on Reddit

This yellow line shows the average polarization of political activity for users from 2012 or earlier.

In 2016, the average polarization of activity for this set of older users went up significantly, at the same time that the platform polarized overall.

Otherwise, polarization remained steady.

We see a similar pattern for the 2013, 2014, and 2015 cohorts.

Average polarization is steady other than a spike in 2016.

The 2016 cohort is different from the others.

Activity of users with first political activity in 2016 was significantly more polarized right from the start.

It declined after 2016, but remained at an elevated level.

In the post-2016 period, newer users were less and less polarized.

The 2017 and 2018 cohorts were less polarized than the 2016 cohort, but still more polarized than older cohorts.

So what to take away from the above? Individual users generally do not polarize over time. Aside from 2016, within-cohort polarization is either stable or decreases over time. During 2016, all active cohorts were remarkably synchronized in their polarization trends. The intense increase in polarization in 2016 was disproportionately driven by new (and newly political) users. Despite accounting for only 38% of political activity, they made the change in polarization 2.17x higher than it otherwise would’ve been. Individual polarization level is unrelated to previous activity on the platform.

Examining polarization over time separately for left- and right-wing communities reveals a stark ideological asymmetry. Activity on the right was substantially more polarized than activity on the left in every month between 2012–2018. In 2016, discussion on the right shifted significantly rightward. During the same period, discussion on the left and in the center did not polarize at all. The overall platform shift in polarization in 2016 is thus entirely driven by the change in activity on the right, despite the fact that the right is the smallest group by discussion volume. Similar to the analogous findings for overall polarization, new users on the right in 2016 were significantly more polarized than all prior cohorts and disproportionately drove the observed polarization of the right-wing on Reddit, consistent with the rise of large right-wing communities such as r/The_Donald. Changes in polarization on the left are small by comparison.

These findings suggest that sometimes observations, such as increased polarization, may be due to changing dynamics of the specific population rather than a society-level change in beliefs.

More to come!

We hope that others will find this method useful to study a wide variety of phenomena in online platforms. Sociologists dating back to Simmel, who pioneered the notion of “the web of group affiliations,” have employed complex characterizations of group membership to understand social identity. We have shown that by harnessing mass co-membership data, we can use high-dimensional representations of online communities to produce valid, fine-grained, and semantically meaningful measurements of their social alignment. Furthermore, aggregating these measurements provides a macroscale description of how platforms are organized along key social dimensions. Our methodology can be generally applied to quantify the social organization of online discussion, to situate important content and behaviours in the context of the platform, and to understand the nature of individual- and platform-level online polarization and the mechanisms that drive it.

Full methodological and other details can be found in our paper and our GitHub repo.

  1. Martin T. community2vec: Vector representations of online communities encode semantic relationships. Proceedings of the Second Workshop on NLP and Computational Social Science. 2017. doi:10.18653/v1/W17-2904 

  2. Kozlowski AC, Taddy M, Evans JA. The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings. American Sociological Review. 2019. doi:10.1177/0003122419877135