Skip to content

Commit bb77ad8

Browse files
authored
Merge pull request #81 from vtraag/update/causality_intro
Update causality intro
2 parents 86d6d29 + a589944 commit bb77ad8

File tree

3 files changed

+13
-36
lines changed

3 files changed

+13
-36
lines changed

sections/0_causality/causal_intro/article/intro-causality.qmd

Lines changed: 9 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -56,11 +56,13 @@ Some are historical, examining a *single* history, while others are contemporary
5656
Some fields already have a long tradition with causal inference, while other fields have paid less attention to causal inference.
5757
We believe that science studies, regardless of whether that is scientometrics, science of science, science and technology studies, or sociology of science, have paid relatively little attention to questions of causality, with some notable exceptions [e.g., @aagaard_considerations_2017; @glaser_governing_2016].
5858

59-
We here provide an introduction to causal inference for science studies.
59+
We here provide an introduction to causal inference for science studies, with a particular focus on effects on the impact of Open Science.
6060
Multiple introductions to structural causal modelling of varying complexity already exist [@rohrer2018; @arif2023; @elwert2013].
6161
@dong_beyond_2022 introduce matching strategies to information science.
6262
We believe it is beneficial to introduce causal thinking using familiar examples from science studies, making it easier for researchers in this area to learn about causal approaches.
6363
We avoid technicalities, so that the core ideas can be understood even with little background in statistics.
64+
We first introduce the general approach, which we then briefly illustrate in three short case studies.
65+
In addition, we provide some extensive descriptions of approaching causality in three specific case studies in academic impact (on the effect of [Open Data on citations](../../open_data_citation_advantage.qmd)), in [societal impact](../../social_causality.qmd) and in economic impact (on the effect of [Open Data on Cost Savings](../../open_data_cost_savings.qmd)).
6466

6567
## The fundamental problem
6668

@@ -82,12 +84,10 @@ For instance, non-compliance in experimental settings might present difficulties
8284
Additionally, scholars might be interested in identifying mediating factors when running experiments, which further complicates identifying causality [@rohrer2022].
8385
In other words, causal inference presents a continuum of challenges, where experimental settings are typically easiest for identifying causal effects---but certainly no panacea---and observational settings are more challenging---but certainly not impossible.
8486

85-
In this paper we introduce a particular view on causal inference, namely that of structural causal models [@pearl_causality_2009].
87+
In this Open Science Impact Indicator Handbook we introduce a particular view on causal inference, namely that of structural causal models [@pearl_causality_2009].
8688
This is a relatively straightforward approach to causal inference with a clear visual representation of causality.
8789
It should allow researchers to reason and discuss about their causal thinking more easily.
88-
In the next section, we explain structural causal models in more detail.
89-
We then cover some case studies based on simulated data to illustrate how causal estimates can be obtained in practice.
90-
We close with a broader discussion on causality.
90+
We explain structural causal models in more detail in the next section.
9191

9292
# Causal inference - a brief introduction {#sec-causal-inference}
9393

@@ -614,6 +614,9 @@ The example highlights that relatively simple DAGs are often sufficient to uncov
614614
For instance, if we had not measured *Field*, controlling for it and identifying the causal effect would become impossible.
615615
In that case, it is irrelevant whether there are any other confounding effects between *Citations* and *Open data*, since those effects do not alleviate the problem of being unable to control for *Field*.
616616

617+
The discussion here focuses specifically on illustrating the general principles.
618+
In the case study on the effect of [Open Data on citations](../../open_data_citation_advantage.qmd) we examine this in greater detail and with a higher degree of realism.
619+
617620
## The effect of Open data on Reproducibility {#sec-open-data-on-repro}
618621

619622
Suppose we are interested in the causal effect of *Open data* on *Reproducibility*.
@@ -800,7 +803,7 @@ Taking measurement seriously can expose additional challenges that need to be ad
800803

801804
The study of science is a broad field with a variety of methods.
802805
Academics have employed a range of perspectives to understand science's inner workings, driven by the field's diversity in researchers' disciplinary backgrounds [@sugimoto2011; @liu2023].
803-
In this paper we highlight why causal thinking is important for the study of science, in particular for quantitative approaches.
806+
In this chapter we highlight why causal thinking is important for the study of science, in particular for quantitative approaches.
804807
In doing so, we do not mean to suggest that we always need to estimate causal effects.
805808
Descriptive research is valuable in itself, providing context for uncharted phenomena.
806809
Likewise, studies that predict certain outcomes are very useful.
@@ -877,32 +880,6 @@ For example, when developing an interview guide to study a particular phenomenon
877880
Furthermore, even if qualitative data cannot easily quantify the precise strength of a causal relationship, it may corroborate the structure of a causal model.
878881
Ultimately, combining quantitative causal identification strategies with direct qualitative insights on mechanisms can lead to more comprehensive evidence [@munafò2018; @tashakkori2021], strengthening and validating our collective understanding of science.
879882

880-
# Acknowledgements {.unnumbered}
881-
882-
We thank Ludo Waltman, Tony Ross-Hellauer, Jesper W. Schneider and Nicki Lisa Cole for valuable feedback on an earlier version of the manuscript.
883-
TK used GPT-4 and Claude v2.1 to assist in language editing during the final revision stage.
884-
885-
# Author contributions {.unnumbered}
886-
887-
Thomas Klebel: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Software, Validation, Visualization, Writing - original draft, and Writing - review & editing.
888-
Vincent Traag: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Software, Validation, Visualization, Writing - original draft, and Writing - review & editing.
889-
890-
# Competing interests {.unnumbered}
891-
892-
The authors have no competing interests.
893-
894-
# Funding information {.unnumbered}
895-
896-
The authors received funding from the European Union’s Horizon Europe framework programme under grant agreement Nos. 101058728 and 101094817.
897-
Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Executive Agency.
898-
Neither the European Union nor the European Research Executive Agency can be held responsible for them.
899-
The Know-Center is funded within COMET—Competence Centers for Excellent Technologies—under the auspices of the Austrian Federal Ministry of Transport, Innovation and Technology, the Austrian Federal Ministry of Economy, Family and Youth and by the State of Styria.
900-
COMET is managed by the Austrian Research Promotion Agency FFG.
901-
902-
# Data and code availability {.unnumbered}
903-
904-
All data and code, as well as a reproducible version of the manuscript, are available at [@klebel_code].
905-
906883
# Theoretical effect of Rigour on Reproducibility {#sec-appendix-rigour-on-reproducibility .appendix}
907884

908885
There is a direct effect of *Rigour* on *Reproducibility* and a indirect effect, mediated by *Open data*.

sections/0_causality/open_data_citation_advantage.qmd

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,17 +16,15 @@ affiliations:
1616
# The effect of Open Data on Citations {#open-data-citation-advantage .unnumbered}
1717

1818
::: {.callout collapse="true"}
19-
2019
## History
2120

2221
| Version | Revision date | Revision | Author |
2322
|---------|---------------|-------------|------------|
2423
| 1.1 | 2024-11-27 | Revisions | V.A. Traag |
2524
| 1.0 | 2024-11-13 | First draft | V.A. Traag |
26-
2725
:::
2826

29-
We here provide some idea of what it would take to try to infer the causal effect of one specific Open Science practice on citation impact. In particular, we consider the effect of Open Data on citation impact. That is, papers that share their data might be more likely to be cited. This is something that has been called the "Open Data Citation Advantage", and in the PathOS scoping review of the academic impact of Open Science [@klebel_academic_2024], evidence was found for a small positive effect of sharing data.
27+
We here provide some idea of what it would take to try to infer the causal effect of one specific Open Science practice on citation impact. In particular, we consider the effect of [Open Data](../1_open_science/prevalence_open_fair_data_practices.qmd) on [citation impact](../2_academic_impact/citation_impact.qmd). That is, papers that share their data might be more likely to be cited. This is something that has been called the "Open Data Citation Advantage", and in the PathOS scoping review of the academic impact of Open Science [@klebel_academic_2024], evidence was found for a small positive effect of sharing data.
3028

3129
Inferring the causal effect of open data on citation impact is not straightforward and cannot easily be done in an experimental setting. Although an experiment study could in principle be done, it would require researchers to participate and follow the experimental, randomised "treatment" of sharing data or not, which will be challenging, especially where more and more data policies mandate that data should be shared. This means that, barring such experiments, we have to rely on observational studies of citations to publications that have (not) shared their data. Note that we here only focus on whether data was shared or not, not whether the data is FAIR or not, or the extent to which it is FAIR, although that might be a relevant confounder to consider.
3230

@@ -35,7 +33,7 @@ We will try to produce a relevant structural causal model by going through the f
3533
1. Consider causal factors that affect or are affected by X or Y
3634
2. Consider effects between the identified factors.
3735

38-
Let us start by considering factors that have a causal effect on the number of citations to a paper. As suggested above, there are many factors that correlate with citations [@onodera2015]. The scientific field and the year of publications are two very clear causal factors. One other relevant aspect is obviously something like the quality or relevance of the research: higher quality or research that is more relevant to more researchers, will be more likely to be cited. Unfortunately, such a quality or relevance is not directly observable. Where something is published, i.e. which journal, is likely to have a causal effect on the citations [@traag2021]. In addition, there are most likely some reputational effects of the author and the institution [@way2019]. Finally, (international) collaboration might be likely to have some effect on citations as well, potentially mediated by network effects.
36+
Let us start by considering factors that have a causal effect on the number of citations to a paper. As suggested above, there are many factors that correlate with citations [@onodera2015]. The scientific field and the year of publications are two very clear causal factors, and are usually also considered when [normalising citations](../2_academic_impact/citation_impact.qmd#avg.-total-normalised-citations-mncs-tncs). One other relevant aspect is obviously something like the quality or relevance of the research: higher quality or research that is more relevant to more researchers, will be more likely to be cited. Unfortunately, such a quality or relevance is not directly observable. Where something is published, i.e. which journal, is likely to have a causal effect on the citations [@traag2021]. In addition, there are most likely some reputational effects of the author and the institution [@way2019]. Finally, (international) collaboration might be likely to have some effect on citations as well, potentially mediated by network effects.
3937

4038
Let us then consider factors that have a causal effect on the sharing of open data. One clearly relevant factor is the open data policy of the journal where the publication is published: if a journal has a clear open data policy that requires authors to make data available (e.g. [PLOS’ Data Policy](https://journals.plos.org/plosone/s/data-availability)), publications in that journal might be more likely to be make their data available. Similarly, if research is funded by a funder that has a clear open data policy (e.g. [Wellcome Trust’s Data Policy](https://wellcome.org/grant-funding/guidance/policies-grant-conditions/data-software-materials-management-and-sharing-policy)), the data might be more likely to be made available. Funding might also make it more likely that authors make their data openly available due to an increase in resources (e.g. data support). Similarly, institutional resources (e.g. data support or training) might help make data open. Some fields may have an academic culture in which scholars are more accustomed to making their data openly available. In addition, some research approaches in a field might be more likely to make their data available than others (e.g. it might be easier to share anonymised quantitative data from surveys as opposed to thick interview data). Lastly, open data has increasingly become a standard, meaning that more recent publications might be more likely to share their data.
4139

sections/2_academic_impact/citation_impact.qmd

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,8 @@ affiliations:
3131

3232
The citation impact of publications reflects the degree to which they have been taken up by other researchers in their publications. There are long-standing discussions about the interpretation of citations, where two theories can be discerned [@bellis2009]: a normative theory, proposing citations reflect acknowledgements of previous work [@merton1973]; and a constructivist theory, proposing citations are used as tools for argumentation [@latour1988]. Overall, citation impact seems to be most closely related to the relevance of the work for the academic community and should be distinguished from other considerations of scientific quality, where the relationship is less clear [@aksnes2019].
3333

34+
Although it is out of scope to provide suggestions for causal inference of all possible Open Science aspects on citation, we discuss one case on the effect of [Open Data on citations](../0_causality/open_data_citation_advantage.qmd).
35+
3436
## Metrics
3537

3638
Citations are affected by two major factors, that we expect to be irrelevant for considerations of impact: the field of research, and the year of publication[^pub-year]. That is, some fields, such as Cell Biology, are much more citation intensive than other fields, such as Mathematics. Additionally, publications that were published in 2010 have had more time to accumulate citations than publications published in 2020. Controlling for these factors[^normalisation-factors] is resulting in what are often called “normalised” citation indicators [@waltman2019]. Although such normalised citation indicators are more comparable across time and field, they are sometimes also more opaque. For that reason, we explain both normalised metrics and “raw”, non-normalised, citation metrics.

0 commit comments

Comments
 (0)