Skip to content

Updating data_sources.json#5938

Open
rohitkumarbhagat wants to merge 2 commits intodatacommonsorg:masterfrom
rohitkumarbhagat:data-release-20260128
Open

Updating data_sources.json#5938
rohitkumarbhagat wants to merge 2 commits intodatacommonsorg:masterfrom
rohitkumarbhagat:data-release-20260128

Conversation

@rohitkumarbhagat
Copy link
Contributor

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @rohitkumarbhagat, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the data offerings by expanding the data_sources.json file. It introduces a brand new, extensive 'Biomedical' data category and substantially augments existing 'Education', 'Demographics', and 'Health' data categories with a wide array of new data sources. These updates aim to provide a richer and more diverse dataset for users across various domains.

Highlights

  • New Biomedical Data Category: A comprehensive 'Biomedical' data category has been introduced, incorporating numerous biological and medical data sources from prominent organizations such as ELIXIR, ENCODE, EMBL-EBI, Gene Ontology Consortium, ICTV, Jensen Lab, PharmGKB, Human Protein Atlas, NIH: NCBI, NIH: NLM, and the University of Maryland School of Medicine.
  • Expanded Education Data Sources: Significant additions have been made to education-related data, including new data source groups from the Civil Rights Data Collection (CRDC), New York City Department of Education, Office of Civil Rights Data, United States Census Bureau, and Urban Institute. These cover areas like school attendance, finance, retention, student demographics, and instructional device availability.
  • New Demographic and Health Data: New demographic data from the National Statistical Institutes of Bulgaria, Poland, and Finland have been integrated. Additionally, health-related data from New York University (diabetes), Tennessee Department of Health (death statistics), Texas Department of Health (death statistics), and additional CDC birth count data have been added.
  • Updated WHO COVID-19 Dashboard URL: The URL for the World Health Organization's COVID-19 Dashboard data source has been updated to reflect its new location.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a significant number of new data sources across various categories, primarily in 'Biomedical', 'Demographics', 'Economy', and 'Education', along with a URL update in the 'Health' section. The additions are well-structured and enhance the dataset's coverage. However, I've identified a few minor issues related to URL consistency (HTTP vs. HTTPS), a typo, and a field containing multiple URLs where a single string is expected. Addressing these will improve the data's quality and consistency.

"dataSources": [
{
"label": "The Sequence Ontology",
"url": "http://www.sequenceontology.org/",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's best practice to use https for URLs when available, as it provides a more secure connection. The http://www.sequenceontology.org/ URL can be updated to https://www.sequenceontology.org/.

Suggested change
"url": "http://www.sequenceontology.org/",
"url": "https://www.sequenceontology.org/",

{
"label": "DISEASES: Textmining",
"url": "https://diseases.jensenlab.org/Search",
"description": "DISEASES is a weekly updated web resource that integrates evidence on disease-gene associations from automatic text mining, manually curated literature, cancer mutation data, and genome-wide association studies. This dataset further unifies the evidence by assigning confidence scores that facilitate comparison of the different types and sources of evidence. All files start with the following four columns: gene identifier, gene name, disease identifier, and disease name. The textmining files contain the z-score, the confidence score, and a URL to a viewer of the underlying abstracts. For further details please refer to the following Open Access articles about the database: [DISEASES: Text mining and data integration of disease-gene associations](https://www.sciencedirect.com/science/article/pii/S1046202314003831) and [DISEASES 2.0: a weekly updated database of disease–gene associations from text mining and data integration](https://academic.oup.com/database/article/doi/10.1093/database/baac019/6554833?login=false). The data is made available under the [CC-BY](https://creativecommons.org/licenses/by/4.0/) license.\n\nData made available under the [CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/)."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The description for "DISEASES: Textmining" contains two separate license statements. Please clarify if both licenses apply, or if one supersedes the other, to avoid potential confusion regarding data usage rights.

"dataSources": [
{
"label": "NCBI Assembly",
"url": "https://www.ncbi.nlm.nih.gov/assembly",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The URL for the NCBI Assembly database is missing the https:// protocol. Please update it to https://www.ncbi.nlm.nih.gov/assembly/ for consistency and security.

            "description": "\"The [NCBI Assembly database](https://www.ncbi.nlm.nih.gov/assembly/) provides stable accessioning and data tracking for genome assembly data. The model underlying the database can accommodate a range of assembly structures, including sets of unordered contig or scaffold sequences, bacterial genomes consisting of a single complete chromosome, or complex structures such as a human genome with modeled allelic variation. The database provides an assembly accession and version to unambiguously identify the set of sequences that make up a particular version of an assembly, and tracks changes to updated genome assemblies. The Assembly database reports metadata such as assembly names, simple statistical reports of the assembly (number of contigs and scaffolds, contiguity metrics such as contig N50, total sequence length and total gap length) as well as the assembly update history. The Assembly database also tracks the relationship between an assembly submitted to the International Nucleotide Sequence Database Consortium (INSDC) and the assembly represented in the NCBI RefSeq project\" (Kitts et al. 2016). In this import we include the metadata for all genome assemblies documented in `assembly_summary_genbank.txt` and `assembly_summary_refseq.txt`. Assemblies are stored in GenomeAssembly nodes whose information is integrated from both the GenBank and RefSeq datasets."

{
"label": "Disease Ontology",
"url": "https://disease-ontology.org/",
"description": "The Disease Ontology was developed as a project by the Institute of Genome Sciences at the University of Maryland School of Medicine. It \"is a community driven, open source ontology that is designed to link disparate datasets through disease concepts\". It provides a \"standardized ontology for human disease with the purpose of providing the biomedical community with consistent, reusable and sustainable descriptions of human disease terms, phenotype characteristics and related medical vocabulary disease concepts\".\n\nThe data is made available under [C0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://disease-ontology.org/resources/citing-do)."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in the license reference: [C0 1.0 Universal should be [CC0 1.0 Universal.

            "description": "The Disease Ontology was developed as a project by the Institute of Genome Sciences at the University of Maryland School of Medicine. It \"is a community driven, open source ontology that is designed to link disparate datasets through disease concepts\". It provides a \"standardized ontology for human disease with the purpose of providing the biomedical community with consistent, reusable and sustainable descriptions of human disease terms, phenotype characteristics and related medical vocabulary disease concepts\".\n\nThe data is made available under [CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://disease-ontology.org/resources/citing-do)."

},
{
"label": "NITI India Population Projection",
"url": "https://ndap.niti.gov.in/dataset/7208, https://ndap.niti.gov.in/dataset/7209",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The url field is a string and typically expects a single URL. This entry contains two URLs separated by a comma and space. Please consolidate this to a single primary URL, or if both are essential, consider listing the secondary URL within the description field for clarity and consistency with the schema.

            "url": "https://ndap.niti.gov.in/dataset/7208",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant