diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..597d331 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,226 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Project Overview + +SiteSearchData produces and loads data for VEuPathDB site search Solr cores. It consists of a specialized WDK (Web Development Kit) model that represents component database data as Solr documents, along with programs to generate and load these documents. + +The data complies with the [VEuPathDB Site Search solr schema](https://github.com/VEuPathDB/SolrDeployment/blob/master/configsets/site-search/conf). + +## Architecture + +### Core Components + +1. **WDK Model** (`Model/lib/wdk/`) + - Specialized WDK model describing how component database data is represented as Solr documents + - Separated by cohort/project type: + - `ApiCommon/` - Genomics sites (genes, ESTs, pathways, organisms, etc.) + - `OrthoMCL/` - OrthoMCL-specific records (groups, sequences) + - `EDA/` - EDA (Exploratory Data Analysis) sites + - `Portal/` - Portal-specific records + - `Shared/` - Shared records (datasets) + - Each cohort has: + - `siteSearchModel.xml` - Main model file (imports other XMLs) + - `siteSearchRecords.xml` - Record class definitions + - `*Queries.xml` - ID, vocab, attribute, and table queries + +2. **Data Generation Scripts** (`Model/bin/`) + - `dumpApiCommonWdkBatchesForSolr` - Dumps all genomics WDK record classes + - `dumpOrthomclWdkBatchesForSolr` - Dumps OrthoMCL batches + - `dumpEdaWdkBatchesForSolr` - Dumps EDA batches + - `ssCreateWdkRecordsBatch` - Core batch creation (called by dump scripts) + - `ssCreateDocumentCategoriesBatch` - Creates metadata batch for document types + - `ssCreateDocumentFieldsBatch` - Creates metadata batch for document fields + - `ssCreateWdkMetaBatch` - Creates batch for WDK searches metadata + +3. **Data Loading Scripts** (`Model/bin/`) + - `ssLoadBatch` - Loads a single batch into Solr with validation + - `ssLoadMultipleBatches` - Recursively discovers and loads multiple batches + - `ssCommitSuggesterIndex` - Commits the typeahead index + +4. **Metadata** (`Model/data/`) + - `documentTypeCategories.json` - Hard-coded metadata describing document types and categories + - `nonWdkDocumentFields.json` - Field metadata for non-WDK documents (e.g., Jekyll documents) + +5. **Configuration Templates** (`Model/config/`) + - `gus.config.tmpl` - Template for GUS database configuration + - `SiteSearchData/model.prop.tmpl` - Model properties template + - `SiteSearchData/model-config.xml.tmpl` - Model database connections template + +6. **Java Source** (`Model/src/main/java/org/eupathdb/sitesearch/`) + - `wsfplugin/CommunityStudyIdsPlugin.java` - WSF plugin for community studies + - `data/model/report/SolrLoaderReporter.java` - WDK reporter that generates Solr JSON + +### Batch System + +All data is dumped and loaded in **batches** to ensure validity and trackability. Each batch: +- Lives in a directory: `solr-json-batch_[batch-type]_[batch-name]_[timestamp]` + - Example: `solr-json-batch_organism_pfal3D7_1234567890` +- Contains: + - Multiple `[document-type].json` files with Solr documents + - Single `batch.json` file describing the batch (metadata) + - Single `DONE` file indicating completion +- Each document includes batch metadata (type, name, timestamp) + +### WDK Model Rules (CRITICAL) + +The Site Search WDK Model follows strict rules documented in `Model/lib/wdk/README.md`. Key requirements: + +**Record Classes must:** +- Have exactly one associated `` +- Have `urlName` matching the parallel record class in the website's WDK model +- Include exactly one reporter: `SolrLoaderReporter` +- Use sentence case for `displayName` +- Have a `` with a "batch" property (from [enumsConfig.xml](https://github.com/VEuPathDB/SolrDeployment/blob/master/configsets/site-search/conf/enumsConfig.xml)) +- Use only ``s and ``s +- Include internal `project` attribute only if records are segmented by project in Solr +- Include internal `organismsForFilter` table only if searchable by organism +- Include internal `display_name` attribute + +**QuerySets must:** +- Set `isCacheable="false"` + +**AttributeQueryRefs must:** +- Only include `name` and `displayName` XML properties +- Never change `name` (invalidates UserDB strategies) +- Use sentence case for `displayName` +- May include property lists: `isSummary`, `isSubtitle`, `isSearchable`, `includeProjects`, `boost` + +**Tables must:** +- Follow same rules as attributeQueryRefs +- Have ``s with only `name` property +- Only include text-searchable columns + +**Questions must:** +- Have zero or one parameters + +## Build and Deployment + +### Building + +```bash +# Maven build (compiles Java sources, packages JAR) +mvn clean install + +# Ant build (installs to GUS_HOME) +ant SiteSearchData-Installation + +# Docker build (builds container with dependencies) +make build +``` + +The build system depends on: +- FgpUtil (https://github.com/EuPathDB/FgpUtil.git) +- WDK (https://github.com/EuPathDB/WDK.git) +- WSF (https://github.com/EuPathDB/WSF.git) +- install (https://github.com/EuPathDB/install.git) + +### Running Scripts + +Scripts require GUS_HOME setup and the scripts directory in PATH: + +```bash +export GUS_HOME=/path/to/gus_home +export PATH=$PATH:$GUS_HOME/bin +``` + +### Configuration + +Before running, configure `$GUS_HOME/config/` with: +1. `gus.config` - Component database connection +2. `SiteSearchData/model-config.xml` - appDB, userDB, accountDB connections +3. `SiteSearchData/model.prop` - Model properties + +Templates are in `Model/config/`. + +### Testing + +Unit test for `ssLoadBatch`: +```bash +cd Model/test +./test_ssLoadBatch [core_url] +``` + +Requires empty Solr core. Set `SOLR_USER` and `SOLR_PASSWORD` if using basic auth. + +### Tagging + +**IMPORTANT:** Create a new git tag every time the model is updated. This is required to rebuild the SiteSearchData container image via the `jenkins_presenter_updater` job. + +## Common Workflows + +### Dumping Data for a Site + +```bash +# For genomics sites +dumpApiCommonWdkBatchesForSolr [organism_batch_name] [other_params] + +# For OrthoMCL +dumpOrthomclWdkBatchesForSolr [params] + +# For EDA sites +dumpEdaWdkBatchesForSolr [params] +``` + +### Creating Metadata Batches + +```bash +# Document type categories +ssCreateDocumentCategoriesBatch [output_dir] + +# Document fields +ssCreateDocumentFieldsBatch [wdk_service_url] [output_dir] + +# WDK searches metadata +ssCreateWdkMetaBatch [site_url] [output_dir] +``` + +### Loading Data into Solr + +```bash +# Single batch +ssLoadBatch [solr_core_url] [batch_dir] [--replace] + +# Multiple batches (recursive discovery) +ssLoadMultipleBatches [solr_core_url] [root_dir] + +# Commit suggester index +ssCommitSuggesterIndex [solr_core_url] +``` + +### Testing Loaded Data + +```bash +# Test WDK record counts in Solr against component database +testSiteSearchWdkRecordCounts [site_url] [solr_core_url] + +# Test all ApiCommon QA sites (must run from VEuPathDB server) +testApiCommonQaSites +``` + +## Local Development + +See `local-loading-notes.adoc` for detailed instructions on: +- Setting up minimal GUS_HOME +- Running local Solr instance +- Loading batches from remote builds + +## Dependencies + +Maven dependencies include: +- WDK model and service +- FgpUtil (core, json, server) +- Jersey containers (Grizzly2, server) +- JSON processing +- Log4j + +## File Locations + +- WDK Model XMLs: `Model/lib/wdk/[cohort]/` +- Scripts: `Model/bin/` +- Java sources: `Model/src/main/java/org/eupathdb/sitesearch/` +- Metadata: `Model/data/` +- Config templates: `Model/config/` +- Tests: `Model/test/` +- Docker: `dockerfiles/` diff --git a/Model/bin/dumpApiCommonWdkBatchesForSolr b/Model/bin/dumpApiCommonWdkBatchesForSolr index 5217ea2..e980ffe 100755 --- a/Model/bin/dumpApiCommonWdkBatchesForSolr +++ b/Model/bin/dumpApiCommonWdkBatchesForSolr @@ -28,7 +28,7 @@ my $modelProps = SiteSearchData::Model::Utils::getPropsFromFile("$ENV{GUS_HOME}/ my $dbh = SiteSearchData::Model::Utils::getDbh($gusProps); SiteSearchData::Model::Utils::runWdkReport($wdkServiceUrl, $targetDir, $BATCH_DIR_PREFIX, "pathway", $modelProps->{PROJECT_ID}); -SiteSearchData::Model::Utils::runWdkReport($wdkServiceUrl, $targetDir, $BATCH_DIR_PREFIX, "popset-isolate", $modelProps->{PROJECT_ID}); +#SiteSearchData::Model::Utils::runWdkReport($wdkServiceUrl, $targetDir, $BATCH_DIR_PREFIX, "popset-isolate", $modelProps->{PROJECT_ID}); SiteSearchData::Model::Utils::runWdkReport($wdkServiceUrl, $targetDir, $BATCH_DIR_PREFIX, "compound", $modelProps->{PROJECT_ID}); SiteSearchData::Model::Utils::runWdkReport($wdkServiceUrl, $targetDir, $BATCH_DIR_PREFIX, "dataset-presenter", $modelProps->{PROJECT_ID}); diff --git a/Model/lib/wdk/ApiCommon/compoundQueries.xml b/Model/lib/wdk/ApiCommon/compoundQueries.xml index 49e6f13..43efe0a 100644 --- a/Model/lib/wdk/ApiCommon/compoundQueries.xml +++ b/Model/lib/wdk/ApiCommon/compoundQueries.xml @@ -7,7 +7,7 @@ @@ -21,7 +21,7 @@ @@ -37,7 +37,7 @@ @@ -51,8 +51,8 @@ \1') else chemical_data end as value - from apidbTuning.CompoundAttributes ca, chebi.chemical_data cd + from webready.CompoundAttributes ca, chebi.chemical_data cd where ca.id = cd.compound_id ]]> @@ -74,7 +74,7 @@ @@ -87,7 +87,7 @@ @@ -99,7 +99,7 @@ @@ -114,7 +114,7 @@ @@ -82,7 +83,8 @@ @@ -97,7 +99,8 @@ diff --git a/Model/lib/wdk/ApiCommon/geneQueries.xml b/Model/lib/wdk/ApiCommon/geneQueries.xml index 0c10ff3..3cc2edf 100644 --- a/Model/lib/wdk/ApiCommon/geneQueries.xml +++ b/Model/lib/wdk/ApiCommon/geneQueries.xml @@ -40,6 +40,16 @@ + + + + + + + + @@ -47,8 +57,9 @@ @@ -69,9 +80,11 @@ @@ -85,13 +98,12 @@ @@ -101,13 +113,21 @@ @@ -138,7 +158,8 @@ @@ -151,7 +172,8 @@ @@ -162,7 +184,8 @@ @@ -172,38 +195,15 @@ - - - @@ -216,69 +216,12 @@ - - - - - @@ -289,8 +232,14 @@ @@ -310,7 +259,8 @@ , ir.interpro_secondary_id , ir.interpro_desc , ir.interpro_family_id - from ApidbTuning.interproresults ir + from Webready.interproresults_p ir + where ir.org_abbrev IN (%%PARTITION_KEYS%%) group by ir.gene_source_id , ir.interpro_db_name , ir.interpro_primary_id @@ -344,17 +294,18 @@ else 'alternate ID' end as id_type, gi.gene AS source_id - from apidbTuning.GeneId gi + from webready.GeneId_p gi where regexp_like(gi.id, '(\D)') and gi.database_name not like '%gene2Uniprot_RSRC' + and gi.org_abbrev IN (%%PARTITION_KEYS%%) -- and gi.union_member != 'same ID' ) select * from alias_query union - select regexp_replace(alias, '(*)\.\d\d?$', '') as alias, + select regexp_replace(alias, '\(.*\)\.\d\d?$', '') as alias, database_name,'base name ' || id_type as id_type,source_id from alias_query - where regexp_like(alias,'(*)\.\d\d?$') + where regexp_like(alias,'\(.*\)\.\d\d?$') ]]> @@ -369,9 +320,10 @@ from dots.NaFeatureComment nfc, (select nfc.na_feature_comment_id, ta.gene_source_id as source_id - from dots.NaFeatureComment nfc, apidbTuning.TranscriptAttributes ta - where ta.na_feature_id = nfc.na_feature_id - or ta.gene_na_feature_id = nfc.na_feature_id + from dots.NaFeatureComment nfc, webready.TranscriptAttributes_p ta + where (ta.na_feature_id = nfc.na_feature_id + or ta.gene_na_feature_id = nfc.na_feature_id) + and ta.org_abbrev IN (%%PARTITION_KEYS%%) group by nfc.na_feature_comment_id, ta.gene_source_id ) ci where ci.na_feature_comment_id = nfc.na_feature_comment_id @@ -391,7 +343,8 @@ @@ -406,10 +359,18 @@ @@ -420,7 +381,8 @@ @@ -434,9 +396,9 @@ @@ -505,8 +479,10 @@ pdb_chain, pdb_id, taxon, pdb_title from (select distinct ps.pdb_chain, ps.pdb_title, ps.pdb_id, ps.taxon, ta.source_id, ta.gene_source_id - from apidbTuning.PdbSimilarity ps, apidbTuning.TranscriptAttributes ta - where ps.source_id = ta.source_id) + from webready.PdbSimilarity_p ps, webready.TranscriptAttributes_p ta + where ps.source_id = ta.source_id + and ta.org_abbrev IN (%%PARTITION_KEYS%%) + and ps.org_abbrev IN (%%PARTITION_KEYS%%)) group by gene_source_id, pdb_chain, pdb_id, taxon, pdb_title ]]> @@ -522,7 +498,8 @@ @@ -584,19 +561,19 @@ ols.name as cycle_stage, phenotype_post_composition, phenotype_comment, pr.chebi_annotation_extension from dots.GeneFeature gf, apidb.PhenotypeModel pm, - apidb.PhenotypeResult pr, sres.OntologyTerm oen, - sres.OntologyTerm opq, sres.OntologyTerm ols, ( select phenotype_model_id, na_feature_id from apidb.PhenotypeModel union select phenotype_model_id, na_feature_id - from apidb.NaFeaturePhenotypeModel) pmodel_feature - where gf.na_feature_id = pmodel_feature.na_feature_id + from apidb.NaFeaturePhenotypeModel + ) pmodel_feature, + apidb.PhenotypeResult pr + left join sres.OntologyTerm oen on pr.phenotype_entity_term_id = oen.ontology_term_id + left join sres.OntologyTerm opq on pr.phenotype_quality_term_id = opq.ontology_term_id + left join sres.OntologyTerm ols on pr.life_cycle_stage_term_id = ols.ontology_term_id + where gf.na_feature_id = pmodel_feature.na_feature_id and pm.phenotype_model_id = pmodel_feature.phenotype_model_id and pm.phenotype_model_id = pr.phenotype_model_id - and pr.phenotype_entity_term_id = oen.ontology_term_id (+) - and pr.phenotype_quality_term_id = opq.ontology_term_id (+) - and pr.life_cycle_stage_term_id = ols.ontology_term_id (+) ]]> @@ -627,9 +604,11 @@ diff --git a/Model/lib/wdk/ApiCommon/organismQueries.xml b/Model/lib/wdk/ApiCommon/organismQueries.xml index fff8bd3..9d2499a 100644 --- a/Model/lib/wdk/ApiCommon/organismQueries.xml +++ b/Model/lib/wdk/ApiCommon/organismQueries.xml @@ -39,19 +39,17 @@ @@ -79,7 +77,7 @@ select oa.source_id, dsp.description from apidbTuning.DatasetPresenter dsp, apidbTuning.Datasetdatasource dd, apidbTuning.OrganismAttributes oa - where dsp.type = 'genome' + where dd.category = 'Genomes' and dsp.dataset_presenter_id = dd.dataset_presenter_id and oa.component_taxon_id = dd.taxon_id ]]> @@ -93,9 +91,9 @@ @@ -26,7 +26,7 @@ select source_id, pathway_source, source_id as old_source_id, pathway_source as old_pathway_source - from apidbTuning.PathwayAttributes + from webready.PathwayAttributes ]]> @@ -39,7 +39,7 @@ @@ -64,16 +64,16 @@ pr.substrates_text, pr.products_text, pc.compound_source_id, pc.chebi_accession, cid.id as compound_other_id - from apidbTuning.PathwayAttributes pa, apidbTuning.PathwayCompounds pc, - apidbTuning.PathwayReactions pr, + from webready.PathwayAttributes pa, webready.PathwayReactions pr, + webready.PathwayCompounds pc + left join (select compound, id - from apidbTuning.CompoundId + from webready.CompoundId where compound != id - ) cid + ) cid on pc.compound_source_id = cid.compound where pc.pathway_id = pa.pathway_id and pr.reaction_id = pc.reaction_id and pr.ext_db_name = pc.ext_db_name - and pc.compound_source_id = cid.compound(+) ]]> diff --git a/Model/lib/wdk/ApiCommon/sequenceQueries.xml b/Model/lib/wdk/ApiCommon/sequenceQueries.xml index ed35a06..c540845 100644 --- a/Model/lib/wdk/ApiCommon/sequenceQueries.xml +++ b/Model/lib/wdk/ApiCommon/sequenceQueries.xml @@ -7,10 +7,9 @@ @@ -41,13 +40,25 @@ + + + + + + + + + @@ -71,7 +82,8 @@ else organism end as description, replace(sequence_type, '_', ' ') as sequence_type - from apidbTuning.GenomicSeqAttributes + from webready.GenomicSeqAttributes_p + where org_abbrev in (%%PARTITION_KEYS%%) ]]> @@ -86,7 +98,8 @@ @@ -98,7 +111,8 @@ @@ -109,13 +123,12 @@ @@ -125,13 +138,21 @@ diff --git a/Model/lib/wdk/ApiCommon/siteSearchModel.xml b/Model/lib/wdk/ApiCommon/siteSearchModel.xml index 457e0bc..0d3ee75 100644 --- a/Model/lib/wdk/ApiCommon/siteSearchModel.xml +++ b/Model/lib/wdk/ApiCommon/siteSearchModel.xml @@ -30,8 +30,6 @@ - - diff --git a/Model/lib/wdk/ApiCommon/siteSearchRecords.xml b/Model/lib/wdk/ApiCommon/siteSearchRecords.xml index 48a9c41..1c1c28a 100644 --- a/Model/lib/wdk/ApiCommon/siteSearchRecords.xml +++ b/Model/lib/wdk/ApiCommon/siteSearchRecords.xml @@ -37,11 +37,6 @@ recordClassRef="recordClasses.sequence"> - - - @@ -106,7 +101,7 @@ - + organism @@ -383,7 +378,7 @@ - + organism @@ -634,93 +629,12 @@
-
- - - - - - - - - - - popset-isolate - - - - source_id - - - - - 100 - - - - - - - - - - - - - - true - - - - - - - - true - - - 1.5 - - - - - true - - - 1.5 - - - - - - - - - - - - -
- - -
diff --git a/Model/lib/wdk/OrthoMCL/groupRecordQueries.xml b/Model/lib/wdk/OrthoMCL/groupRecordQueries.xml index 5e19a5c..ab075b3 100644 --- a/Model/lib/wdk/OrthoMCL/groupRecordQueries.xml +++ b/Model/lib/wdk/OrthoMCL/groupRecordQueries.xml @@ -83,7 +83,7 @@ @@ -116,13 +116,13 @@ diff --git a/Model/lib/wdk/OrthoMCL/sequenceRecordQueries.xml b/Model/lib/wdk/OrthoMCL/sequenceRecordQueries.xml index e855f65..5cc7538 100644 --- a/Model/lib/wdk/OrthoMCL/sequenceRecordQueries.xml +++ b/Model/lib/wdk/OrthoMCL/sequenceRecordQueries.xml @@ -214,7 +214,7 @@ from apidbTuning.SequenceAttributes ot.three_letter_abbrev AS abbreviation, ot.name AS taxon FROM dots.ExternalAaSequence eas, apidb.orthomcltaxon ot - WHERE NVL(SUBSTR(eas.secondary_identifier, 0, INSTR(eas.secondary_identifier, '|')-1), eas.secondary_identifier) = ot.three_letter_abbrev + WHERE COALESCE(SUBSTRING(eas.secondary_identifier, 1, POSITION('|' IN eas.secondary_identifier)-1), eas.secondary_identifier) = ot.three_letter_abbrev ]]> diff --git a/Model/lib/wdk/Shared/datasetQueries.xml b/Model/lib/wdk/Shared/datasetQueries.xml index 1946ca1..bdbaec0 100644 --- a/Model/lib/wdk/Shared/datasetQueries.xml +++ b/Model/lib/wdk/Shared/datasetQueries.xml @@ -5,12 +5,13 @@ + diff --git a/build.xml b/build.xml index 70610c3..cac5684 100644 --- a/build.xml +++ b/build.xml @@ -12,11 +12,7 @@ -