COMPUTATIONAL

SOCIAL SCIENCE



Modern Kazakhstan faces a number of challenges, the solution of which requires the use of new approaches based on data and evidence-based policy. Owing to the digitalization of recent decades, large volumes of data about society have become available to sociologists in a short period of time, without the need to conduct specialized surveys or collect statistical information. Social media, digital footprints, the Internet of Things, and government and business platforms generate countless datasets every day, enabling researchers to address their questions at a larger scale and with greater accuracy. As a result, sociology is becoming computational.


In this context, one of the leading trends in contemporary social research is the use of Computational Social Science.

What Is Computational Social Science?
Computational Social Science (CSS) is an interdisciplinary field that develops theories of human behavior by applying computational methods to large datasets from social media, the Internet, and other digitized archives (e.g., administrative records).

What are the boundaries that allow us to claim that a scholarly work belongs to computational sociology?

We understand CSS not as a theoretical framework, but as a broad methodological approach to working with social data. Thus, research on any sociological topic conducted using a range of computational methods can be categorized as CSS.

Kazakhstan Sociology Lab is implementing the program "Computational Social Sciences: New Horizons of Methodology and Evidence-Based Policy" under program-targeted funding for scientific, scientific and technical programs for 2025-2027 of the Ministry of Higher Education and Science of the Republic of Kazakhstan.
Research team
  • Adil Rodionov
    Principal Investigator
  • Olessia Koltsova
    Research Collaborator
    (Advisory Board Member)
  • Katerina Guba
    Research Collaborator
    (Advisory Board Member)
  • Ivan Smirnov
    Research Collaborator
    (Advisory Board Member)
  • Aliya Sarsekeyeva
    Co-Principal Investigator
  • Dmitriy Serebrennikov
    Co-Principal Investigator
  • Yadviga Sinyavskaya
    Visiting Researcher
  • Ksenia Tenisheva
    Visiting Researcher
  • Darkhan Medeuov
    Researcher
  • Zhaniya Aubakirova
    Junior Researcher
  • Arsen Avsatkarinov
    Junior Researcher

CSS METHODS

Social Network Analysis
Natural Language Processing
Causal Inference
Agent Based Modeling
Machine Learning
Large Language Models

Social Network Analysis

Social Network Analysis (SNA) is a method grounded in graph theory and aimed at studying the structure of interactions between objects within a network. A network consists of nodes (individual actors) and edges (relationships between nodes). SNA tools are used to analyze patterns of relationships between individuals or groups. Network visualization makes it possible to observe structural patterns, identify non-obvious connections and their characteristics, and detect groups of nodes– clusters.

To understand SNA, it is essential to define its primary object of analysis—networks. A network can be described as a set of relations between objects, agents, or nodes, and the links connecting them. Social networks, for example, can be represented as a set of users (nodes) and their subscriptions or connections to one another (edges). This representation allows researchers to examine questions such as who is the most central or influential actor in the network and why. Many social phenomena can be modeled as networks: friendships within a school class, career trajectories within an organization, or relationships in the literary market (for example, which publishers work with which authors).

As an illustrative example, Andrew Beveridge and Jie Shan applied social network analysis to study relationships between characters in the fantasy novel series A Song of Ice and Fire by George R. R. Martin. Using graph-theoretical algorithms, they constructed a network based on the third book of the series, A Storm of Swords, where characters become clearly divided into distinct social circles.

Their analysis identified a network consisting of 107 nodes (characters) and 353 weighted edges, with edge weights corresponding to the number of times two characters were mentioned within 15 words of each other. Seven communities (clusters) were detected, revealing the underlying political structure of the narrative without requiring direct engagement with the plot.

The figure presents a visualization of the characters’ social network. Node color indicates cluster membership, node size reflects structural importance within the network, and font size represents the frequency with which a character acts as an intermediary between others.

  • Figure from Andrew Beveridge и Jie Shan (2016). The social network generated from "A storm of Swords"
Natural Language Processing
Natural Language Processing (NLP) is a group of methods that enables the analysis of textual data using computational algorithms. NLP is also referred to as computational linguistics. One of the main advantages of this approach is the automation of text data processing, as well as the ability to work with large volumes of text that could not be read even over a human lifetime.

Despite the fact that researchers are likely to encounter NLP on a daily basis (for example, both message autocorrection and ChatGPT are based on such methods), work within this framework remains relatively uncommon in sociology. NLP is more frequently used in cultural studies, political sociology, and research on social inequality, among other fields.



For example, Kozlowski et al. used text corpus analysis and word embeddings to study how meaning associations related to social class are formed in mass discourse. By analyzing large text corpora, the authors demonstrated how different occupations, values, and social practices become associated with ideas of poverty, wealth, masculinity, and femininity.

In the figure presented, the authors visualize the distribution of sports using the “poor-rich” and “feminine-masculine” axes. This visualization illustrates how texts associate particular sports with social status and gender representations. Thus, boxing, hockey and basketball are associated with poverty and masculinity, tennis and golf – with wealth and masculinity, and softball and volleyball – “rich” and “feminine” sports.

  • Figure from Kozlowski et al. (2019). Associations by sport with class are presented on the X-axis, with gender on the Y-axis
Causal Inference
Causal Inference (CI) is a data analysis approach that seeks to identify causal relationships between phenomena. In simple terms, CI helps answer the question: “Did one event affect another?” For example, does a job retraining program increase a person’s chances of finding employment? Or does the introduction of a new drug reduce the likelihood of hospitalization?

To answer such questions, analyzing correlation alone is not sufficient. Although two variables may be related, this does not necessarily mean that one causes the other—a point commonly summarized by the phrase “correlation does not imply causation.” Why is this the case?

One reason is that when we construct a correlation between two variables and try to establish causality (that one variable causes the other), we often miss the influence of a third variable (an omitted variable)—that is, something that we did not take into account in the calculation but which, in reality, has an effect.

A common demonstration of the presence of a third variable is the example of ice cream and sunburn. If we construct a correlation between the amount of ice cream consumed and the number of sunburn cases, the coefficient would indicate a fairly strong relationship. However, we cannot logically say that ice cream affects sunburns or vice versa. In this case, we are missing a third variable: the season of the year. It is summertime, with hot weather, that encourages people to buy ice cream and, at the same time, increases the likelihood of sunburn.

Causal inference is widely used in economics, medicine, the social sciences, and business analytics to evaluate the effectiveness of policies, drugs, educational programs, and marketing strategies.

One example is a study by Eric Chyn, in which the author examined the long-term effects of the forced relocation of children from high-crime neighborhoods to less disadvantaged ones following the demolition of public housing in Chicago. Using data on employment, income, and education, the author compared two similarly characterized groups:

– Those who were forced to move (experimental group);
– Those who remained living in the same neighborhood (control group).

This quasi-experimental design made it possible to assess causal effects, since the relocation was not initiated by the residents themselves but was driven by the emergency condition of the buildings. This reduces the likelihood of systematic differences between the groups and allows differences in living standards to be interpreted as the result of relocation.

The figure shows the results of the comparison between the two groups, where the group of those who moved is further divided by age at the time of relocation (7–12 years old and 13–18 years old). The graph consists of two panels: the left panel represents employment, and the right panel represents income. According to the study, forced relocation has a positive effect, especially among the younger age group.

  • Figure from Eric Chin (2018). Effects on employment and earnings as a function of measurement age, where the X-axis is age and the Y-axis is the treatment effect
Agent-Based Modeling
Agent-Based Modeling (ABM) allows studying complex social systems through computer modeling. This method involves creating a dynamic environment and placing agents within it.

Agents are autonomous entities with a given set of characteristics, whose behavior (e.g., communication with other agents) is programmed in a specific way. One key advantage of this method is the possibility of simulation: the analysis of processes is not limited to available data but allows researchers to create hypothetical scenarios, study their dynamics, and identify new behavioral patterns.

One example is urban traffic modeling, where each car is treated as an agent with specified characteristics that responds to traffic signals and other road users. By changing model parameters, it is possible to evaluate the impact of new roads or traffic patterns on congestion and safety. Another example is modeling the spread of disease, where agents (people) have individual attributes and interact according to rules such as social distancing or vaccination. This makes it possible to assess the effectiveness of public health interventions.

A classic example of agent-based modeling is Schelling's work(1971), in which the author applies simulation modeling to study racial segregation. The central idea of the model is to examine how even small individual preferences to live among people who are “similar” can lead to strong patterns of segregation in cities.

The figure shows the results of the simulation: agents that are initially arranged in a random order move across cells, eventually producing a near-perfect partition into neighborhoods of the same color. This outcome reflects agents’ preferences to live near others who share similar traits. Cells are rearranged until each agent is satisfied with its location.

  • Figure from Luca Mingarelli (2021). Visualization of Schelling's segregation model showing the four stages of segregation
Machine Learning
Machine Learning (ML) is, simply put, a field of knowledge concerned with how to make artificial intelligence (AI) think and learn in ways similar to humans, continuously improving based on real-world data. Using databases as inputs, machine learning algorithms can perform complex tasks, including prediction, classification, and clustering. One of the main advantages of ML is its ability to handle large volumes of data and identify complex dependencies that are difficult to detect using traditional statistical methods.

ML is now widely applied in business and the technical sciences. In sociology, however, this method is more often used as a tool for explaining patterns that an algorithm has “learned.” This approach is known as interpretable machine learning (IML) or explainable artificial intelligence (XAI).

For example, the work of Wim Bernasco et al. used video surveillance data to study how residents of Amsterdam, the Netherlands, maintained a social distance of 1.5 meters during the COVID-19 pandemic in 2020–2021. The authors developed an ML algorithm that analyzed video footage, measured distances between pedestrians, and counted social-distance violations during different waves of quarantine.

  • Figure from Bernasco et al. (2021) with mean number of violations of the 1.5-m social distance rule on the studied camera data (Y-axis) and weeks 2020-2021 (X-axis)
Large Language Models
Large Language Models (LLMs) are a type of machine learning model that enables computers to understand and generate text. Such models are trained on huge amounts of text (books, articles, web pages, and other sources) and learn to predict which words or phrases are likely to come next in a sentence. One of the most well-known examples of a large language model is ChatGPT, which generates responses based on the context of a user’s query.

When discussing LLMs, the focus is often on their ability to write texts. However, this is only the tip of the iceberg. LLMs help doctors process medical records, lawyers prepare documents, journalists gather facts, and marketers produce creative content.

Sociologists, on the other hand, use LLMs as a “cast” of society (because large Sociologists, on the other hand, use LLMs as a kind of “cast” of society, since large models are trained on vast amounts of text available on the Internet. As a result, by analyzing LLMs, researchers can attempt to model the outcomes of public opinion polls, as demonstrated by Argyle et al. (2023).

Another example comes from Park et al. (2023), where the authors combined LLMs with agent-based modeling. Within a cooperative game, they created several agents that could communicate with one another via an LLM. By observing interactions among these agents, the researchers identified processes related to the emergence of social hierarchies and examined how information is distributed within a team.

  • Figure from Park et al (2023). Example of a game in which the simulation and examples of agents interacting with each other took place

Kazakhstan Sociology Lab is actively developing in the field of computational social science, promoting these approaches among Kazakhstani researchers and students. The School’s program includes training courses and modules, master classes, and research projects based on a sociological approach to the analysis of big data, social networks, and digital traces of behavior. We discuss new sources of open data and the modeling of social processes using machine learning, Natural Language Processing (NLP), Social Network Analysis (SNA), and experimental methods. These methods allow us to better understand social processes in Kazakhstan and to propose more informed decisions for policymakers and society as a whole.

Contacts
+7 775 476 33 77
SociologyLab.kz@gmail.com
Kazakhstan, Astana, Kabanbay Batyr Ave., 11/5, 12th floor
© 2023 Kazakhstan Sociology Lab
Requisites
КОРПОРАТИВНЫЙ ФОНД "ФОНД "EL UMITI"
Юридический адрес:
010000, Республика Казахстан, г.Астана, ул.Бокейхана, 1
Фактический адрес:
010000, Республика Казахстан, г.Астана, Кабанбай батыр 11/5, 12 этаж
БИН: 190940020707
Счет: KZ60601A871005960391
Название банка: филиал АО «HalykBank» г. Астана
БИК: HSBKKZKX