Skip to main content

Social Science Dashboard Specification: a tool for supporting social scientists in their research by using MUHAI technologies

[M3.1] A specification for supporting social scientists in their research by using MUHAI technologies. (VUA) (M16)

“The ultimate purpose of the social sciences is to furnish causal explanations of classes of observable events, which are, at least in part, generated by individual and collective agency/action.” [1] The aim of a digital assistant for social history research is therefore to support social scientists with the construction of such causal explanations for observable events, also theories or hypotheses. Social history researchers can then test these on various aspects of society, to see whether a newly found hypothesis holds. Since not many structured, easily accessible hypotheses exist in the social domain to learn from, we have focused on causal narratives in the medical domain first, since we have access to a dataset of ~4000 of these. The aim is to analyse this dataset, and transfer insights we gain to the social domain (given that they are transferable between domains). In this specification, we first briefly describe a few ways in which social history researchers discover and answer hypotheses, and potential things to look out for when developing a digital research assistant. Following, we briefly discuss the medical domain and hypothesis generation technologies we are developing in that area of research. 

Biomedical research: 
In the field of biomedicine, the common process to generate a new hypothesis that can be tested within a clinical trial, is to intervene in a given biochemical process with a specific treatment. This potential outcome framework has been used for a longtime fueled by new discoveries in the lab, e.g. the discovery of a new protein. As with social science research, the generation of a new hypothesis can includes the following steps:

Step 1. Protein-pathway discovery.
Example: ‘The discovery of a new oncogene participating in a cellular pathway’

Step 2. Drug discovery.
Example: ‘The development of a chemical molecule to target the new onco-gene.’

Step 3. Clinical trial.
Example: ‘A significant effect on tumor growth was found administering the chemical molecule to patients with liver cancer.’

Step 4. Drug repurposing by analogy.
Example: ‘The chemical molecule treats liver cancer, which resembles kidney cancer. Can the molecule treat kidney cancer too?

Hence, hypothesis generation can be fueled by a new scientific discovery such as the discovery of a new gene participating in a pathway, as well as by analogy, through already performed trials and their results. 

A digital assistant for biomedical research: 
Scientific discovery in the biomedical domain can greatly benefit from automated hypothesis generation, as finding new and interesting research questions is challenging and requires considerable background knowledge about trials, drugs, conditions and their various causal mechanisms. 

Two main requirements for automated scientific discovery: 

  1. human should be able to follow the reasoning
  2. human should be made aware of the potential biases in the data

The task is often formulated as a link prediction task, in which a new link is predicted between a disease and an existing treatment, such as insulin treats→ diabetes. Several  studies  argue  for  the  integration  of  a  model  with  structured  background knowledge about known cause and effect relationships within the problem domain, to support both the generation of hypotheses as well as their explanation. 

Explainable  link  prediction  methods  have  proved  very  successful  in  pointing out  new,  interesting  drug-treatment  pairs,  specifically  in  being  able  to  focus the attention of medical practitioners to those hypotheses that are explainable with current knowledge on biochemical processes. While these developments are paramount  in  producing  explainable  medical  AI,  such  hypotheses  are  subject to simplification. Bodily processes are complex in nature, and by reducing hypothesis generation to a single link prediction task, a system risks missing out on interesting hypotheses. For example: adults with diabetes mellitus as well asdiabetic ketoacidosis might require a completely different treatment than kids without diabetic ketoacidosis. Such a task can be formulated as a graph generation task, where one predicts not only a link between a drug and a disease, but the entirety of the hypothesis: age groups, symptoms, modes for drug delivery, and other. Even though explainable link prediction is a much researched topic, research into subgraph generation is  scarce,  and  the  research  that  exists  focuses  on  machine-learned methods that are often nontransparent in their reasoning. 

Social science/social history research: 

In the field of social history, the discovery of a causal narrative often arises first and foremost by the generation or discovery of a grand theory. Such grand theories can arise in a multitude of ways: from `rocking chair sociology’, to the discovery of certain patterns when zooming in on certain groups in society, be it inequality amongst people in a small town, or social cohesion amongst followers of a certain religion. Theories related to the latter therefore come about in a more fortuitous way. Here it is interesting to note that most finer-grained questions can be divided into three big questions or themes: those related to cohesion, inequality or rationalisation (the effect of technological developments on a society). 

When an interesting theory that is devised should be tested, or an interesting use case has come to light, the process of constructing causal narratives can be roughly subdivided into three sub-questions and their output.  Each following step ingests the output of the previous step: 

Start: either a theory, or a use case, e.g., the role of social cohesion can explain certain outcomes among social groups, or the suicide rates and religious beliefs of those living in town X have been recorded, respectively.

  1. a descriptive question, e.g.,: are there more suicides among protestants than among catholics? Intended output: a temporally organised description of related events. 
  2. an explanatory question, e.g.,: Can social cohesion explain the different statistics related to suicide among these different religious groups? Intended output: a causal narrative. 

Even though overarching ‘grand’ questions should remain the same, branching questions however are prone to grow into a certain direction. Knowledge on social history can therefore only lift one side of the curtain. 

A digital assistant for social science research: 
We argue that a digital assistant for scientific discovery in the social sciences or social history domain can aid in the data-driven generation of point 1. and 2. described in the section above. By ingesting structured data, such as is available at the international institute of social history (IISH), a digital assistant can, first and foremost, discover trends over time (longitudinal)  or among groups, to present to the researcher in question. An example of such a trend is described in point 1 above. Illuminating bias in datasets is an important component here, as bias limits the range of a certain hypothesis, for instance the hypothesis mentioned in the previous section could apply only to people that earn more than the marginal income. 

A digital assistant for hypothesis generation in the social sciences, be it social history or social science research in general, should take note of the following: 

Explainable. Humanities researchers increasingly turn their data into Linked Data[3,4],  interlinking their own data, but also to link social science data to knowledge from other domains available in the LOD cloud. 


References:
[1] Abell, P. (2009). History, case studies, statistics, and causal inference. European Sociological Review. https://doi.org/10.1093/esr/jcn072

Example literature related to a comparative question, as well as a data ecosystem supporting the search for causal narratives:

[2] van den Berg, N., van Dijk, I. K., Mourits, R. J., Slagboom, P. E., Janssens, A. A. P. O., & Mandemakers, K. (2021). Families in comparison: An individual-level comparison of life-course and family reconstructions between population and vital event registers. Population Studies, 75(1), 91–110. https://doi.org/10.1080/00324728.2020.1718186

[3] Hoekstra, R., Meroño-Peñuela, A., Rijpma, A., Zijdeman, R., Ashkpour, A., Dentler, K., Zandhuis, I., & Rietveld, L. (2018). The dataLegend ecosystem for historical statistics. Journal of Web Semantics, 50, 49–61. https://doi.org/10.1016/j.websem.2018.03.001

[4] ​​Zapilko, Benjamin, et al. "Applying linked data technologies in the social sciences." KI-Künstliche Intelligenz 30.2 (2016): 159-162.