diff --git a/software_project_management/fair/01_understanding_fair.md b/software_project_management/fair/01_understanding_fair.md new file mode 100644 index 00000000..49be7886 --- /dev/null +++ b/software_project_management/fair/01_understanding_fair.md @@ -0,0 +1,397 @@ +--- +name: Understanding FAIR +dependsOn: [] +tags: [] +learningOutcomes: + - Understand the 4 foundational principles of FAIR + - Be able to identify actionable items to improve the FAIRness of a project +attribution: + - citation: The FAIR Guiding Principles for scientific data management and stewardship + url: https://doi.org/10.1038/sdata.2016.18 + image: https://media.springernature.com/full/nature-cms/uploads/product/sdata/header-87021870c315c48063927b82055c12bc.svg + license: CC-BY-4.0 + - citation: GO-FAIR + url: https://www.go-fair.org/fair-principles/ + image: https://www.go-fair.org/wp-content/themes/go-fair/images/logo.svg + license: CC-BY-4.0 + - citation: FAIRsharing + url: https://fairsharing.org + image: https://fairsharing.org/assets/fairsharing-logo.svg + license: CC-BY-4.0 + - citation: OpenAIRE + url: https://www.openaire.eu/how-to-make-your-data-fair + image: https://www.openaire.eu/templates/yootheme/cache/19/Logo_Horizontal-1910c000.webp + license: CC-BY-4.0 + - citation: FAIR toolkit + url: https://fairtoolkit.pistoiaalliance.org + image: https://fairtoolkit.pistoiaalliance.org/wp-content/uploads/2020/05/FAIR-Toolkit-logo.png + license: CC-BY-4.0 + - citation: How to FAIR a Danish website to guide researchers on making research data more FAIR + url: https://doi.org/10.5281/zenodo.3712065 + image: https://www.howtofair.dk/media/e5md4otg/htfair_logo.svg + license: CC-BY-4.0 +--- +## Introduction + +This module aims to give you an overview of the FAIR principles for scientific +data. We will start with some scenarios that researchers frequently face, the +motivation of the FAIR principles and their implication to your research. By +the end of this module, you will be able to explain the FAIR principles in your +own words and identify any existing FAIRness gaps in your data with some +recommendations to improve it. + +## Common Scenarios + +As a researcher, the following scenarios may be familiar to you: + +- You want to replicate some analyses by other researchers but either the data + is lost or incomplete. +- You are not sure if you can reuse or how to access those dataset even they + are available. +- There is no documentation or metadata associated with the dataset which you + are going to reuse. +- You do not know how other researchers obtained or generated the data + (provenance) +- You try to explore dataset similar to your research but you have no clue + about how to locate relevant dataset. +- You have your paper published but not sure where or how to put the associated + data for others to verify or reuse. + +The FAIR principles data aim to alleviate some of the above pains arising from +the above scenarios. + +## FAIR Guiding Principles + +FAIR is composed of 4 foundational principles: + +1. To be Findable +2. To be Accessible +3. To be Interoperable +4. To be Reusable + +Modern scientific research often involves handling data that far exceed the +processing capacity of humans in terms of their volume, throughput and +complexity. As a result, these principles emphasise on machine-actionability by +requiring explicit structures, i.e. not only humans can find and interpret the +(meta)data but also enable computational agents to autonomously discover and +act upon them to ensure vast amount of data can be processed continuously and +reproducibly. + +### To be Findable + +This is to make your data discoverable by both humans and machines. + +> - (meta)data are assigned a globally unique and persistent identifier +> - data are described with rich metadata +> - metadata clearly and explicitly include the identifier of the data it +> describes +> - (meta)data are registered or indexed in a searchable resource + +#### What Findable means + +It is important to assign the data with a *globally unique* and *persistent* +identifier: 'globally unique' means this identifier is associated with this +data only; 'persistent' means it remains valid indefinitely. This makes it +resolvable and computational agents can use this identifier to automatically +retrieve the data. Persistent identifiers are one of the core parts of FAIR and +will be discussed in an upcoming module. + +There is no universal standard that defines when metadata have attained a +certain level of richness as how much is *rich* varies by scientific domains. +The consensus is that the metadata should provide enough information for both +humans and critically, computational agents, to determine the data's usage +restriction, quality, collecting and analysis methodology, provenance, and +other information specific to a domain (ideally defined by a standard). + +To promote your data, it is important to make your (meta)data discoverable by +'registered or indexed in a searchable resource'. This means (meta)data should +be deposited in systems that enable discovery, which could be a general +repository, a domain-specific database, an institutional server or a registry. +They should be searchable by both humans and computational agents, which +usually accompany with programmatic access and protocols. We will look at this +in detail in an upcoming module. + +#### Be more Findable + +- Assign persistent identifiers to all of your published (meta)data. +- Deposit data in a reputable repository rather than personal website or + server. +- Use metadata templates from data repository or domain-specific standard to + maximise the information you could provide. + +#### Check your Understanding about Findable + +::::challenge{id=findable-q1 title="Findable Q1"} +Who is the Findable principle designed to help discover data, and why does this +matter? +:::: + +:::solution +The Findable principle (in fact all the FAIR principles) aims to help both +humans and computational agents to discover data so we can leverage their +computational power to autonomously process the data and obtain reproducible +results. +::: + +::::challenge{id=findable-q2 title="Findable Q2"} +Why is there no universal standard about the richness of metadata? +:::: + +:::solution +How much metadata you should provide depends on your scientific domain, but you +should aim to provide as much metadata as possible. +::: + +### To be Accessible + +This is to provide clear instructions on how to access the data, even the +access is restricted. + +> - (meta)data are retrievable by their identifier using a standardized +> communications protocol +> - the protocol is open, free, and universally implementable +> - the protocol allows for an authentication and authorization procedure, +> where necessary +> - metadata are accessible, even when the data are no longer available + +#### What Accessible means + +It focuses on how humans or computational agents can access the (meta)data. +With persistent identifiers, a communication protocol should be available to +enable (meta)data access. The *protocol* should have *freely available open* +specification, widespread adoption and should be well documented that anyone +with the necessary technical knowledge can implement it. + +:::callout{variant="note"} +As a researchers, we seldom need to implement the authentication/authorisation +protocol itself but it is important to select one that adheres to the +'Accessible' principle. +::: + +Necessary but minimal authentication/authorisation procedures can be included +for data such as proprietary or private information. Those procedures should +allow computational agents to act on it automatically: extract the requirement +of accessing the data from the metadata, and authorise the human researchers if +valid credentials are present, or alert the human researchers if a more manual +authentication/authorisation process is required (such as approval from a +committee/company). + +:::callout{variant="note"} +Data that are not 'freely available' or 'open' can be 'accessible' if the +procedures to retrieve them are clearly documented and understood by both +humans and computational agents, so sensitive or private data can still be +accessible by the definition in FAIR. +::: + +It is inevitable that some scientific data will no longer be available over +time owing to ongoing storage and maintenance cost of infrastructure or +permission changes mandated by legal/commercial requirement. Metadata, on the +other hand, are generally much easier to store and maintain as they are smaller +in size and do not contain sensitive information as the data themselves. The +access to metadata should still be maintained even the actual data is gone as +metadata are still valuable for future researchers to track down what research +have been done and their methodologies, which are particularly important for +literature review and provenance purposes. + +#### Be more Accessible + +What you can do can vary a lot with the nature of your data and below are +some recommendations that you can explore to make your data more 'accessible': + +##### Open Data + +- Make (meta)data retrievable via HTTP/HTTPS. +- Use repositories that implement public API for computational agent to query + and download (meta)data. +- Use reputable repositories that guarantee long-term availability. + +##### Controlled-Access Data + +- Include public metadata about what data exist, who collected them, and how to + request access etc. +- For sensitive data, include the access requirement such as fees, licence, + institutional/committe/company approval, and information about restricted + usage etc. +- For embargoed data, include the embargo period and use repositories that + automatically update its permission. + +#### Check your Understanding about Accessible + +::::challenge{id=accessible-q1 title="Accessible Q1"} +Is there any conflict between open and freely available data and data that is +'Accessible' in the context of FAIR? +:::: + +:::solution +No. Data that are not open and freely available can still be considered +'Accessible' in FAIR: as long as the procedures to access the data are clearly +documented and both humans and computational agents can act on it, it is +'Accessible'. +::: + +::::challenge{id=accessible-q2 title="Accessible Q2"} +Why should metadata remain accessible even when the actual data is no longer +available? +:::: + +:::solution +Metadata provide valuable information such as who conducted the research, the +methodologies and the analysis pipeline used, and these can be a tremendous +help for future researchers. +::: + +### To be Interoperable + +This is to enable data to work with other data. + +> - (meta)data use a formal, accessible, shared, and broadly applicable +> language for knowledge representation. +> - (meta)data use vocabularies that follow FAIR principles +> - (meta)data include qualified references to other (meta)data + +#### What Interoperable means + +A computational agent does not naturally possess semantic understanding on +(meta)data as humans do, so it is not able to infer the underlying meaning of +the (meta)data to decide what appropriate analyses or actions to perform. If +(meta)data from different sources can work together automatically, we can +maximise the benefits provided by computational agents. + +It is important to use *formal* languages which have clearly defined structures +and interpretable rules so a computational agent can leverage them for +autonomous data processing, in contrary to free-text description. +'*Accessible*' in this context ensures the above specification of languages is +publicly available and easily retrievable. The specification should also be +widely adopted by the scientific community (*shared*) and be general (*broadly +applicable*), in contrary to a specification that is only used by a small +number of research groups in a niche area. + +The vocabularies in the language specification should themselves be FAIR: +persistent identifiers should associate with individual terms and contain rich +metadata to describe them; the vocabularies should be accessible via a +standardised communication protocol; the relationship among vocabularies +should be clearly documented. + +In modern scientific research, (meta)data seldom exist in isolation and it is +important to establish meaningful connections (*qualified references*) among +them. A standardised way is needed to describe the relationship such as +'dataset A is a subset of dataset B or 'result C from experiment D validates +hypothesis E'. The qualified references can be used by a computational agent to +enable automatic discovery of related resources and trace back to the original +source, which is useful for literature reviews and provenance purposes. + +#### Be more Interoperable + +- Use standard file formats such as JSON, CSV or specialised formats widely + used in a particular scientific domain to facilitate data exchange. +- Replace free-text terms by controlled vocabularies for (meta)data. +- Document relationship extensively among data using appropriate language + specification. + +#### Check your Understanding about Interoperable + +::::challenge{id=interoperable-q1 title="Interoperable Q1"} +Why is it important to use controlled vocabularies rather than free text to +describe your data? +:::: + +:::solution +Free text could vary among researchers while controlled vocabularies use +standardised terms with predefined meanings, which allow computational agents +to process and integrate data from different sources. +::: + +::::challenge{id=interoperable-q2 title="Interoperable Q2"} +What are 'qualified references' and why are they important? +:::: + +:::solution +Qualified references are well-defined connections among (meta)data that +describe the nature of their relationship, and they are essential for +computational agents to automatically discover related resources and establish +provenance. +::: + +### To be Reusable + +This is to provide comprehensive instructions on how to properly reuse the +data. + +> - meta(data) are richly described with a plurality of accurate and relevant +> attributes +> - (meta)data are released with a clear and accessible data usage license +> - (meta)data are associated with detailed provenance +> - (meta)data meet domain-relevant community standards + +#### What Reusable means + +All the previous principles lay the foundation for data reuse by both humans +and computational agents. If future researchers want to reuse the data, it is +helpful for them to be able to retrieve useful information and guidance from +the (meta)data. It is important to provide as much information as possible so +it can be used in a different context with minimal effort. + +A licence should be attached to the (meta)data so it is obvious for both humans +and computational agents to determine whether they can use the data and any +restriction associated with it. Provenance should be extensively documented to +provide information about the data sources and methodologies that generate the +data, so it is possible to assess the quality and fitness-for-purpose when the +data is expected to be applied in a different context. + +A specific domain may already have adopted certain frameworks or agree with +some conventions, so it is important to follow established domain-specific +practices so those familiar with the domain can quickly understand and assess +your data for its suitability. + +#### Be more Reusable + +- Be generous in providing information related to the data as something look + unnecessary to you may be paramount to others, for examples: + - descriptive metadata such as what data was collected, its rationale and + any access restriction, + - structural metadata such as the relationship among the data, + - technical metadata such as the file format (e.g. the version of a + standardised file format) +- Choose a well-established licence with a standardised identifier. +- Include basic provenance information such as the researchers (e.g. by + persistent identifiers) and funding sources (e.g. by grant numbers). +- Use a widely-adopted standard in your domain. + +#### Check your Understanding about Reusable + +::::challenge{id=reusable-q1 title="Reusable Q1"} +Why is it important to include metadata that may seem unnecessary to you as the +data creator? +:::: + +:::solution +You will never know how future researchers want to use your data, so it is +essential to include as much metadata as possible, even though it seems +unnecessary to you, to maximise the reusability of your data. +::: + +::::challenge{id=reusable-q2 title="Reusable Q2"} +Why is provenance important for data reusability? +:::: + +:::solution +Provenance is the information about the origin and history of the data, and it +allows future researchers to assess whether the data is fit for their specific +purpose and evaluate its quality. +::: + +## Conclusion + +We have looked at the 4 foundational principles in FAIR (Findable, Accessible, +Interoperable and Reusable) and elaborated on their underlying meaning. Some +actionable items were included for each of the principles as well. The +following modules aim to look into some specific area mentioned in this +overview. + +The FAIR Guiding Principles are intentionally domain-independent and very high +level, and they do not have any preferences in implementation (such as +technology, software or standard). The principles themselves are not a standard +or specification, they are guidelines that help researchers to better promote +their digital research artefacts. The FAIRness of your data exists in a +continuous spectrum rather than as a binary property and you are encouraged to +incrementally make your data more FAIR. diff --git a/software_project_management/fair/02_data_metadata_nondata.md b/software_project_management/fair/02_data_metadata_nondata.md new file mode 100644 index 00000000..486e5c3b --- /dev/null +++ b/software_project_management/fair/02_data_metadata_nondata.md @@ -0,0 +1,261 @@ +--- +name: Data, metadata and non-data digital assets +dependsOn: ["software_project_management.fair.01_understanding_fair"] +tags: [] +learningOutcomes: + - Understand 3 main types of digital assets: data, metadata and non-data digital assets + - Explain what exactly are data, metadata and non-data digital assets +attribution: + - citation: eCFR 200.315 Intangible property + url: https://www.ecfr.gov/current/title-2/subtitle-A/chapter-II/part-200/subpart-D/subject-group-ECFR8feb98c2e3e5ad2/section-200.315 + image: https://huntersquery.byu.edu/wp-content/uploads/2021/03/Screen-Shot-2021-03-11-at-5.15.54-PM.png + - citation: What do we mean by “data”? A proposed classification of data types in the arts and humanities + url: https://doi.org/10.1108/JD-07-2022-0146 + image: https://emer.silverchair-cdn.com/data/SiteBuilderAssets/Live/Images/jd/Journal_of_Documentation-1722323768.svg + license: CC-BY-4.0 + - citation: Horizon Europe Programme Guide + url: https://ec.europa.eu/info/funding-tenders/opportunities/docs/2021-2027/horizon/guidance/programme-guide_horizon_en.pdf + image: https://ec.europa.eu/info/funding-tenders/opportunities/portal/assets/ecl/ec/logo/positive/logo-ec--en.svg + license: CC-BY-4.0 +--- +## Introduction + +The FAIR principles are applicable to all research digital assets, and in this +module we are going to look at what constitute digital assets and what we +can do to achieve FAIRness for them. For most of the scientific research, +digital assets can be roughly divided into 3 categories: data, metadata and +non-data digital assets. + +## Data + +If we define scientific research as the process of knowledge creation, it is +vital to document everything that is leading to it. With this in mind, we can +employ the following definitions of data: + +> 'recorded factual material ... as necessary to validate research' (200.315 +> Intangible property, Code of Federal Regulations, US Federal Government) + +> everything that is 'linked to the workflow of knowledge creation' (Gualandi +> B, Pareschi L, Peroni S (2023)) + +What counts as data varies a lot across different scientific domains and often +depends on context. For instance, a field geologist may treat a rock itself as +data while a geochemist only considers the chemical composition of the rock as +data. It is thus crucial for researchers to identify precisely what entities +should be considered as data in their project. + +::::challenge{id=data-q1 title="Data Q1"} +You are starting a new research project studying bird migration patterns. What +might count as 'data' in the project? +:::: + +:::solution +- Date and time of sighting +- Location records (e.g. GPS information from tracking devices) +- Species identification +- Photos at specific locations +- Biological measurements (e.g. weights, age) + +It is important to remember what count as data depends on the context, for +example, if you are studying how weather affects the migration route, the +weather conditions (e.g. temperature, wind speed) should be considered as data +but it could just be metadata if this is not the research focus. +::: + +### Data Management Plan + +A lot of funding agencies require a submission of a document called data +management plan (DMP) before the grant application process or during the +research project. We will employ this definition for DMP: + +> '... formal documents that outline from the start of the project all aspects +> of the research data lifecycle, which includes its organisation and curation, +> and adequate provisions for its access, preservation, sharing, and eventual +> deletion, both during and after a project. (Horizon Europe Programme Guide, +> p. 48) + +As a DMP considers aspects such as what data should be collected or generated +in different phases of the research project, even if a DMP is not mandatory in +the grant application process, you are strongly encouraged to create one as +having even an informal plan about research data management will help you to +identify potential problems earlier. + +There are tools that help the creation of DMPs: + +- [DMPonline](https://dmponline.dcc.ac.uk) (Digital Curation Centre, UK) +- [DMP Tool](https://dmptool.org) (California Digital Library, US) + +This module will not go into detail about the creation of DMPs but you can +browse some of the public plans available ([from DMPonline](https://dmponline.dcc.ac.uk/public_plans) +or [from DMP Tool](https://dmptool.org/public_plans)) to get an idea about what +information a DMP should include. Some core questions that you should +consider are: + +- what data will be collected/generated? +- what is the data collection/generation methodology? +- what are their formats? +- how do you organise the data (e.g. file naming, storage)? +- how will you share your data? +- is there any restriction about the sharing of data? +- how to measure the quality of the data? +- how will you preserve the data? +- for how long the data should be preserved? +- who will be responsible? +- what metadata will be included? + +Thinking through the above questions will help you to anticipate potential +issues, clarify the whole research project and plan ahead so you can adhere to +the FAIR principles in a systematic way. + +::::challenge{id=data-q2 title="Data Q2"} +Why would it be a good idea to voluntarily create a DMP even if no one mandates +you to do so? +:::: + +:::solution +Even an informal and lightweight DMP is valuable, it helps you: + +- to anticipate problems early regarding data collection +- to clarify the research project by systematically working out the entire + workflow +- to plan for FAIR compliance early +::: + +## Metadata + +Metadata is data about data and it forms a core part in the FAIR principles as +it is metadata that can make data machine-actionable. Without sufficient and +accurate metadata, computational agents cannot understand the data +characteristics correctly and make use of them automatically. + +The line between data and metadata is not clear and again depends on context, +for instance, the energy used to acquire an image in electron microscopy is +metadata if the research focuses on certain features present on the image; +however, if the research is about how the signal-to-noise ratio varies with the +energy used, then the energy itself becomes the data. + +### Dublin Core + +Dublin Core is a simple and universal collection of vocabularies for describing +digital assets, i.e. metadata. The wide adoption of Dublin Core is important +for 'To be Interoperable' in the FAIR principles. + +There are 15 core elements in Dublin Core and they are very generic to allow +description of data in a wide range of scientific domains. It is usually a good +starting point to include them as metadata: + +- title +- subject +- description +- type +- source +- relation +- coverage +- creator +- publisher +- contributor +- rights +- date +- format +- identifier +- language + +You are recommended to look at [the specification](https://www.dublincore.org/specifications/dublin-core/dces/) +for the above fields. All of the above fields are optional and repeatable, e.g. +'language' is not applicable if the data is an image; 'contributor' can be +repeated if there are more than one. + +### Choosing a Metadata Standard + +While Dublin Core is one of the most popular metadata standards, we often +require some extensions for our research project. Across different domains, +there are a lot of metadata standards in the community and it is important to +select one that is appropriate to your scientific domain and the research +project. Below are some places that could help you to explore metadata +standards in your domain: + +- [Digital Curation Centre](https://www.dcc.ac.uk/guidance/standards/metadata) +- [RDA Metadata Standards Catalog](https://rdamsc.bath.ac.uk) +- [FAIRsharing](https://fairsharing.org/search?fairsharingRegistry=Standard) + +::::challenge{id=metadata-q1 title="Metadata Q1"} +It was stated earlier that without sufficient metadata, 'computational agents +cannot understand the data characteristics correctly.' What does this mean in +practical terms? +:::: + +:::solution +Without sufficient metadata, scripts and workflows cannot reliably determine +what the data actually represent, for instance, given a number that represents +temperature but without unit, it is impossible for a computational agent to +know this is in Celsius or Kelvin or Fahrenheit. + +In addition, other computational agents cannot translate the data into a +different format because the structure and meaning are not complete without +sufficient metadata. For example, given a column 'T' contains temperature in +Celsius for a research project about bird migration pattern, it could mean +ambient temperature or body temperature of birds. A human expert may be able to +figure it out but to be FAIR compliant, we need to ensure a computational +agent is provided with enough metadata. +::: + +::::challenge{id=metadata-q2 title="Metadata Q2"} +Is a README file containing a detailed description of the data sufficient to +adhere to the FAIR principles? +:::: + +:::solution +No, README file is for humans and the FAIR principles emphasise on +machine-actionability. No matter how much description/metadata you provide in a +README file, it is not enough for FAIR compliance (although this is strongly +encouraged!) +::: + +## Non-data Digital Assets + +Modern scientific research often generates outputs that are not traditionally +considered as data but essential in the FAIR principles and for other +researchers to validate the results, so it is important to recognise those +research outputs and make them FAIR. + +Below is a list of what non-data digital assets could mean: + +- software +- computational workflows +- algorithms +- protocols +- electronic lab notebooks +- computational environment/containers +- AI/ML models +- research plans +- API docs +- preprints + +Extensions to the FAIR principles may exist for some of the above research +outputs, for instance, [FAIR4RS](https://doi.org/10.1038/s41597-022-01710-x) is +an extension to the FAIR principles for research software. Websites such as +[protocols.io](https://www.protocols.io) support sharing of methods and +workflows, and registries like [Docker Hub](https://hub.docker.com) contain a +lot of computational environments that allow researchers to reproduce analysis +in a convenient way. + +::::challenge{id=nondata-q1 title="Nondata Q1"} +Think about your own research (or a research project you are familiar with). +What non-data digital assets does it produce that you might not have previously +considered? +:::: + +:::solution +Easily overlooked items may include software, scripts, protocols, computational +environment, or analysis workflows. Remember a lot of research outputs that are +not considered traditionally as data/metadata are worth sharing and preserving. +::: + +## Conclusion + +There are many kinds of research digital assets and we can roughly divide them +into 3 categories: data, metadata and non-data digital assets. This module +explains what they are, how you can make them more FAIR and resources that are +useful for further exploration. By identifying all the research outputs from +your project, you can apply the FAIR principles to them early on in your +project. diff --git a/software_project_management/fair/03_persistent_identifiers.md b/software_project_management/fair/03_persistent_identifiers.md new file mode 100644 index 00000000..3aa4744e --- /dev/null +++ b/software_project_management/fair/03_persistent_identifiers.md @@ -0,0 +1,136 @@ +--- +name: Persistent Identifiers +dependsOn: ["software_project_management.fair.02_data_metadata_nondata"] +tags: [] +learningOutcomes: + - Understand what a persistent identifier is and how it works + - Identify different types of persistent identifiers for different objects +attribution: + - citation: Best Practices for Tombstone Pages + url: https://support.datacite.org/docs/tombstone-pages + image: https://files.readme.io/fb9347b-small-DataCite-Logo_secondary.png + license: CC-BY-4.0 + - citation: What is a DOI? + url: https://www.doi.org/the-identifier/what-is-a-doi/ + image: https://www.doi.org/images/logos/header_logo_cropped_registered.svg + license: CC-BY-4.0 + - citation: ORCID + url: https://orcid.org + image: https://orcid.org/assets/vectors/orcid.logo.svg + license: CC0 1.0 + - citation: Research Organization Registry (ROR) + url: https://ror.org + image: https://ror.org/img/ror-logo.svg + license: CC-BY-4.0 +--- +## Introduction + +Almost all of the FAIR guiding principles are not feasible without the +existence of persistent identifiers and this highlights their importance. In +this module, we are going to look at what they are and how they work. We will +also look at some common types of persistent identifiers that are used to +identify digital assets, researchers and research organisations. + +## What is it? + +An identifier is a unique reference to an object, regardless of where it is. +You can think of it as your passport number. Your passport number is unique to +you and it will remain the same wherever you are. + +Nothing is inherently persistent or permanent, especially in the digital +world. In the case of an identifier, how persistent it is entirely depends on +the commitment from the people or organisations that create it. + +With these in mind, we can define a persistent identifier as + +> a long-lasting reference to a digital or non-digital asset that provides a +> globally unique name independent of its location. + +## How does it work? + +Behind all persistent identifiers, there are _resolution_ processes that +translate identifiers into the current location of the digital asset it refers +to. When you query a resolution system with a persistent identifier, it looks +it up in its registry, the registry returns the current location (e.g. URL) +pointing to the digital asset, and finally redirects you to the resource. For +non-digital assets, it will redirect you to the metadata, for example, a web +page that shows the name and current affiliation of the researcher. + +When the digital asset is relocated, the information in the registry and all +the related metadata is updated. Human researchers or computational agents +always use the same identifier to query the same digital asset, no matter where +it is moved or how many times it has been moved. This solves the pain of +unresolvable URL of digital asset once it has been moved because of various +reasons and allows computational agents to discover the digital asset +autonomously, one of the foundational aspects of the FAIR principles. + +If the underlying digital asset needs to be deleted, the convention is to +redirect users to a 'tombstone page' which displays all the metadata and +explains why it is removed. This is important for the digital asset to be +compliant with the 'To be Accessible' FAIR principle. + +## Types of persistent identifiers + +### Digital Object Identifiers (DOIs) + +DOI is arguably the most familiar persistent identifier known to researchers as +it often appears together with journal articles. However, DOI is not limited to +publication only and it is also used for research datasets, software, protocols +and preprints etc. + +A DOI always starts with '10' followed by a suffix. It was standardised as ISO +26324 and since its inception in 1997, it is estimated that DOIs have been +resolved over 123 billion times (at the time of this writing). Besides +standard DOIs, [shortdoi](https://shortdoi.org) creates shortened DOIs that are +ideal for modern research data sharing via social media and mobile platforms. + +### Open Researcher and Contributor ID (ORCID) + +ORCID is used to identify researchers with 16-digit identifiers in the format +of XXXX-XXXX-XXXX-XXXX. It solves the pain of researchers having to change +their names because of marriage or other legal reasons. It also prevents +confusion when a researcher has a common name (such as the surname Smith or +Wang, and given names such as Yi or Olivia), or different convention of the +same name (such as the order of surname and given name, whether to use +hyphenated form etc.). + +ORCID is straightforward to register and as of 2022, there are over 15 million +registered records and the number continues to grow, showing wide adoption +across the research community. + +### Research Organisation Registry (ROR) + +ROR identifies research organisations with a 9-character string +(e.g. 052gg0110). Similarly to ORCID, it solves ambiguous naming of academic +institutions (e.g. University of Oxford or Oxford University) and renaming +because of historical or strategical reasons (e.g. 'UK Centre for Medical +Research and Innovation' was renamed to 'The Francis Crick Institute' in 2011). + +ROR was launched in 2019 and in 2026 it contains more than 116 thousand +research organisations. + +### Others + +There are other kinds of persistent identifiers, such as [Archival Resource +Keys](https://arks.org) which is similar to DOIs but is cheaper and adopted a +more decentralised approach, and [RAiD](https://www.raid.org) which can be used +to identify a research project and serves as a pointer to other persistent +identifiers. + +There are also software-specific persistent identifiers such as +[SoftWare Hash IDentifier (SWHID)](https://www.softwareheritage.org/software-hash-identifier-swhid/) +(standardised as ISO 18670 in 2025) and [bio.tools](https://bio.tools) specific +for software tools in bioinformatics and the life sciences. + +## Conclusion + +This module discusses what exactly persistent identifiers are and how they +work, followed by the major types and other kinds of persistent identifiers for +different use cases. Persistent identifiers are the foundation for the FAIR +principles. Without them, (meta)data cannot be discovered in a reliable way, +hence not findable. Persistent identifiers also enable access of (meta)data +through standardised communication protocol (HTTP) and provide an +inter-connected picture of your research outputs, such as linking the +researcher (by ORCID) to a dataset (by DOI) which was analysed by a certain +software (by DOI/SWHID). They also provide information for tracking its +provenance reliably and autonomously. diff --git a/software_project_management/fair/index.md b/software_project_management/fair/index.md new file mode 100644 index 00000000..23bf5c3f --- /dev/null +++ b/software_project_management/fair/index.md @@ -0,0 +1,19 @@ +--- +id: fair_principles +name: FAIR Principles +dependsOn: [] +files: + [ + 01_understanding_fair.md, + 02_data_metadata_nondata.md, + 03_persistent_identifier.md, + 04_data_repository.md, + 05_assess_fair.md, + ] +summary: | + This course provides a walkthrough of the FAIR Guiding Principles for + scientific data and how to progressively improve FAIRness in your research. +--- + +This course provides a walkthrough of the FAIR Guiding Principles for +scientific data and how to progressively improve FAIRness in your research.