COMPUTATIONAL

SOCIAL SCIENCE



Modern Kazakhstan faces a number of challenges, the solution to which requires the use of new approaches based on data or evidence-based policy. Thanks to the digitalization of recent decades, large amounts of data about society have become available to sociologists in a short period of time without conducting specialized surveys or collecting statistical information. Social media, digital footprints, the Internet of Things, government and business platforms create countless datasets every day, with which researchers can answer their questions on a larger scale and with greater accuracy. Sociology is becoming computational.


In this context, one of the vanguard trends in contemporary social research is the use of Computational Social Science.

What is Computational Social Science
CSS is an interdisciplinary field that develops theories of human behavior by applying computational methods to large datasets from social media, the Internet or other digitized archives (e.g., administrative records).

What are the boundaries through which we can claim that any scholarly work belongs to computational sociology?

We understand CSS not as a theoretical framework, but as a grand methodological approach to working with data of a social nature. Thus, work on any sociological topic done within a range of methods can be categorized as CSS.

CSS METHODS

Social Network Analysis

SNA (social network analysis) is a method based on the mathematical theory of graphs and aimed at studying the structure of interactions between objects within a certain network. A network consists of nodes (individual actors in the network) and edges (relationships between nodes). SNA tools are used to analyze patterns of relationships between people in groups. Visualization of the network allows to see the structure of relationships, to identify non-obvious links and their nature, as well as formed groups of nodes – clusters.

To understand SNA it's necessary to define the object of research in this direction – “networks”. A network is a set of relations between objects/agents/nodes and links/ribs between them. We can think of social networks in this way – as a set of users (nodes) and their subscriptions to each other (edges). Then, we can investigate who is the most popular in the network and why, and ask a number of other questions. We can visualize many other phenomena as a network: relationships in a school class (who is friends with whom), career trajectories of employees in a company (who has the most authority and gets promoted), the literary market (which presses publish different authors), and so on.

For example, Andrew Beveridge and Jie Shan used social network analysis to investigate the relationships between characters in the fantasy novel series “A Song of Ice and Fire” by George R.R. Martin. The authors applied graph theory algorithms to create a network based on the third book of the series (Storm of Swords), as it is in this part that the characters are divided into different social circles.


The method was used to extract a structure of 107 nodes (characters) and 353 weighted edges, where the weights corresponded to the number of shared character mentions within 15 words. Seven communities (clusters) were identified, which helped to reveal the hidden political map within the narrative without diving into the plot of the book.

The figure shows a visualization of the characters' social network. The color of the nodes signifies their belonging to one of the clusters, and the size of the node is proportional to the importance of the character in terms of the network structure. The font size is an indicator of how often a character mediated between others.

  • Figure from Andrew Beveridge и Jie Shan (2016). The social network generated from "A storm of Swords"
Natural language processing
NLP (natural language processing) – is a group of methods that allow any kind of text to be analyzed using computer algorithms. NLP is also called computer/computational linguistics. One of the main advantages of the method is the automation of text data processing, as well as the ability to work with large volumes of texts that cannot be read even in a human lifetime.

Despite the fact that researchers are likely to encounter NLP every day (both autocorrection of messages in correspondence and ChatGPT are based on such methods), work in this framework is not so common in sociology. NLP is usually used in cultural studies, political sociology, inequality studies, and so on.

For example, Kozlowski et al. used text corpus analysis and word embeddings to study how meaning associations related to the definition of social class are formed in mass discourse. By analyzing large text files, they identified how different occupations, values, and social practices are associated with ideas about poverty, wealth, masculinity, and femininity.

In the figure presented, the authors visualize the distribution of sports using the “poor-rich” and “feminine-masculine” axes. Using the axes, they show how texts relate certain sports activities to social status and gender representations. Thus, boxing, hockey and basketball are associated with poverty and masculinity, tennis and golf – with wealth and masculinity, and softball and volleyball – “rich” and “feminine” sports.

  • Figure from Kozlowski et al. (2019). Associations by sport with class are presented on the X-axis, with gender on the Y-axis
Causal inference
Causal Inference (CI) is a data analysis technique that seeks to identify causal relationships between phenomena. In simple terms, CI helps answer the question, “Did one event affect another?” For example, does a job retraining program increase a person's chances of getting a job? Or does the introduction of a new drug reduce the likelihood of hospitalization?

To answer such questions, simply analyzing correlation is not enough. Although two variables may be related, this does not mean that one causes the other. This phenomenon is often framed in the phrase “correlation doesn't imply causation”. Why is this so?

One reason is that when we construct a correlation between two variables and try to establish causality (that one variable causes the other), we often miss the influence of a third variable (omitted variable) –that is, something that we did not take into account in the calculation, but which in reality has an effect.

One frequent demonstration of the presence of a third variable is the example of ice cream and sunburn. If we construct a correlation between the amount of ice cream consumed and the number of sunburn cases, its coefficient would indicate a fairly strong relationship. However, we cannot logically say that ice cream affects sunburns or vice versa. In this case, we are missing a third variable, the season of the year. It is the summer time with hot weather that influences people to buy ice cream and at the same time is the cause of getting burns.

Causal inference is widely used in economics, medicine, social sciences, and business analytics to evaluate the effectiveness of policies, drugs, educational programs, and marketing strategies.

One example is a study by Eric Chyn, in which the author examined the long-term effects of the forced relocation of children from high-crime neighborhoods to less disadvantaged ones as a result of the demolition of public housing in Chicago. Using data on employment, income, and education, the author compared two similarly characterized groups:

– Those who were forced to move (experimental group);
– Those who remained living in the same neighborhood (control group).

This quasi-experimental design allowed us to assess the causal effect of the impact, since the relocation, in this case, was not initiated by the residents themselves, but due to the emergency condition of the buildings. This reduces the likelihood of systematic differences between the groups and allows us to interpret the difference in living standards as a result of relocation.

The figure shows the results of the comparison of the two groups, where the group of those who moved was further divided by age at the time of the move (7-12 years old and 13-18 years old). The graph consists of two panels where the left panel represents labor employment and the right panel represents income. According to the study, there is a positive effect of forced relocation, especially among the younger age group.

  • Figure from Eric Chin (2018). Effects on employment and earnings as a function of measurement age, where the X-axis is age and the Y-axis is the treatment effect
Agent based modeling
Agent based modeling (ABM) allows studying complex social systems through computer modeling. This method involves creating a dynamic environment and placing agents in it.

Agents are autonomous entities with a given set of characteristics, whose behavior (e.g., communication with other agents) is programmed in a certain way. The advantage of the method is the possibility of simulation, where the analysis of the process is not limited to the available data, but allows you to create your own hypothetical scenarios, study their dynamics and identify new patterns of behavior.

An example is urban traffic modeling, where each car is an agent with specified characteristics, responding to traffic signals and other traffic participants. By changing the parameters, it is possible to evaluate the impact of new roads or traffic patterns on congestion and safety. Another example is modeling the spread of disease, where agents (people) have unique attributes and interact according to rules such as social distancing or vaccination. This helps to evaluate the effectiveness of public health interventions.

A classic example of agent based modeling is Schelling's work(1971), where the author applies simulation modeling to build a model of racial segregation. The main idea of the model is to test the extent to which people's small preferences to live among those “similar” to them lead to strong segregation in cities.

The figure shows the results of the simulation, where agents that were initially arranged in a chaotic order move to other cells, creating a near-perfect partitioning into neighborhoods of the same color. This means that agents moved according to their preference to live with someone who shares their traits. Cells are moved around until each agent is happy with their location.

  • Figure from Luca Mingarelli (2021). Visualization of Schelling's segregation model showing the four stages of segregation
Machine learning
Machine learning (ML) is, simply put, the field of knowledge about how to make artificial intelligence (AI) think and learn like a human being, constantly improving itself based on real-world data. Using databases as inputs, machine learning algorithms can perform complex tasks including prediction, classification and clustering. One of the advantages of ML is its ability to handle large amounts of data and find complex dependencies that are difficult to detect with traditional statistical methods.

ML is now universally applied in business and technical sciences. In sociology, this method is rather used as a tool to explain some patterns that the algorithm has “learned”. This approach is called interpretable machine learning (IML or explainable artificial intelligence, XAI).

For example, the work of Wim Bernasco and colleagues used video surveillance data analysis to study how residents of Amsterdam, the Netherlands, maintained a social distance of 1.5 m during the COVID-19 pandemic in 2020-2021. The authors developed an ML algorithm that analyzed video and measured the distance between pedestrians and counted the number of social distance violations during different waves of quarantine.

  • Figure from Bernasco et al. (2021) with mean number of violations of the 1.5-m social distance rule on the studied camera data (Y-axis) and weeks 2020-2021 (X-axis)
Large language models
Large language models (LLMs) are a type of machine learning model that allows computers to understand and generate text. Such models are trained on huge amounts of text (books, articles, web pages, and other sources) and learn to predict what words or phrases will come next in a sentence. One of the most famous examples of a large language model is ChatGPT, which generates responses based on the context of a query.

When talking about LLMs, the focus is often on their ability to write texts. But that's just the tip of the iceberg. They help doctors to process medical records, lawyers to prepare documents, journalists to gather facts, and marketers to write creative.

Sociologists, on the other hand, use LLMs as a “cast” of society (because large models are trained on literally all texts available on the Internet). Because of this, by analyzing LLMs one can try to model the results of public opinion polls, as Argyle et al. (2023) do.

Another example comes from Park et al. (2023), where the authors combined LLM with agent based modeling. Within one cooperative game, they created several agents that could communicate with each other via LLM. Then, they observed the interactions of the agents among themselves, which revealed the processes of emergence of social hierarchy, as well as studied the specifics of the process of information distribution in the team.

  • Figure from Park et al (2023). Example of a game in which the simulation and examples of agents interacting with each other took place

Kazkahstan Sociology Lab is actively developing in the direction of computational social science, popularizing these approaches among Kazakhstani researchers and students. The program of our School includes training courses and modules, master classes and research projects based on sociological approach to the analysis of big data, social networks, and digital traces of behavior. We discuss new sources of open data and modeling of social processes using machine learning, Natural Language Processing (NLP), Social Network Analysis (SNA), and experimental methods. These methods allow us to better understand social processes in Kazakhstan and to propose more informed decisions for decision makers and society as a whole.

© 2023 Kazakhstan Sociology Lab
Requisites
КОРПОРАТИВНЫЙ ФОНД "ФОНД "EL UMITI"
Юридический адрес:
010000, Республика Казахстан, г.Астана, ул.Бокейхана, 1
Фактический адрес:
010000, Республика Казахстан, г.Астана, Пр.Мангилик ел, 55/13, блок С.2.1.
БИН: 190940020707
Счет: KZ60601A871005960391
Название банка: филиал АО «HalykBank» г. Астана
БИК: HSBKKZKX