VEuPathDB · steve-fischer-200 · Jan 23, 2026 · Dec 6, 2025 · Dec 8, 2025 · Jan 5, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,226 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+SiteSearchData produces and loads data for VEuPathDB site search Solr cores. It consists of a specialized WDK (Web Development Kit) model that represents component database data as Solr documents, along with programs to generate and load these documents.
+
+The data complies with the [VEuPathDB Site Search solr schema](https://github.com/VEuPathDB/SolrDeployment/blob/master/configsets/site-search/conf).
+
+## Architecture
+
+### Core Components
+
+1. **WDK Model** (`Model/lib/wdk/`)
+   - Specialized WDK model describing how component database data is represented as Solr documents
+   - Separated by cohort/project type:
+     - `ApiCommon/` - Genomics sites (genes, ESTs, pathways, organisms, etc.)
+     - `OrthoMCL/` - OrthoMCL-specific records (groups, sequences)
+     - `EDA/` - EDA (Exploratory Data Analysis) sites
+     - `Portal/` - Portal-specific records
+     - `Shared/` - Shared records (datasets)
+   - Each cohort has:
+     - `siteSearchModel.xml` - Main model file (imports other XMLs)
+     - `siteSearchRecords.xml` - Record class definitions
+     - `*Queries.xml` - ID, vocab, attribute, and table queries
+
+2. **Data Generation Scripts** (`Model/bin/`)
+   - `dumpApiCommonWdkBatchesForSolr` - Dumps all genomics WDK record classes
+   - `dumpOrthomclWdkBatchesForSolr` - Dumps OrthoMCL batches
+   - `dumpEdaWdkBatchesForSolr` - Dumps EDA batches
+   - `ssCreateWdkRecordsBatch` - Core batch creation (called by dump scripts)
+   - `ssCreateDocumentCategoriesBatch` - Creates metadata batch for document types
+   - `ssCreateDocumentFieldsBatch` - Creates metadata batch for document fields
+   - `ssCreateWdkMetaBatch` - Creates batch for WDK searches metadata
+
+3. **Data Loading Scripts** (`Model/bin/`)
+   - `ssLoadBatch` - Loads a single batch into Solr with validation
+   - `ssLoadMultipleBatches` - Recursively discovers and loads multiple batches
+   - `ssCommitSuggesterIndex` - Commits the typeahead index
+
+4. **Metadata** (`Model/data/`)
+   - `documentTypeCategories.json` - Hard-coded metadata describing document types and categories
+   - `nonWdkDocumentFields.json` - Field metadata for non-WDK documents (e.g., Jekyll documents)
+
+5. **Configuration Templates** (`Model/config/`)
+   - `gus.config.tmpl` - Template for GUS database configuration
+   - `SiteSearchData/model.prop.tmpl` - Model properties template
+   - `SiteSearchData/model-config.xml.tmpl` - Model database connections template
+
+6. **Java Source** (`Model/src/main/java/org/eupathdb/sitesearch/`)
+   - `wsfplugin/CommunityStudyIdsPlugin.java` - WSF plugin for community studies
+   - `data/model/report/SolrLoaderReporter.java` - WDK reporter that generates Solr JSON
+
+### Batch System
+
+All data is dumped and loaded in **batches** to ensure validity and trackability. Each batch:
+- Lives in a directory: `solr-json-batch_[batch-type]_[batch-name]_[timestamp]`
+  - Example: `solr-json-batch_organism_pfal3D7_1234567890`
+- Contains:
+  - Multiple `[document-type].json` files with Solr documents
+  - Single `batch.json` file describing the batch (metadata)
+  - Single `DONE` file indicating completion
+- Each document includes batch metadata (type, name, timestamp)
+
+### WDK Model Rules (CRITICAL)
+
+The Site Search WDK Model follows strict rules documented in `Model/lib/wdk/README.md`. Key requirements:
+
+**Record Classes must:**
+- Have exactly one associated `<Question>`
+- Have `urlName` matching the parallel record class in the website's WDK model
+- Include exactly one reporter: `SolrLoaderReporter`
+- Use sentence case for `displayName`
+- Have a `<propertyList>` with a "batch" property (from [enumsConfig.xml](https://github.com/VEuPathDB/SolrDeployment/blob/master/configsets/site-search/conf/enumsConfig.xml))
+- Use only `<attributeQueryRef>`s and `<table>`s
+- Include internal `project` attribute only if records are segmented by project in Solr
+- Include internal `organismsForFilter` table only if searchable by organism
+- Include internal `display_name` attribute
+
+**QuerySets must:**
+- Set `isCacheable="false"`
+
+**AttributeQueryRefs must:**
+- Only include `name` and `displayName` XML properties
+- Never change `name` (invalidates UserDB strategies)
+- Use sentence case for `displayName`
+- May include property lists: `isSummary`, `isSubtitle`, `isSearchable`, `includeProjects`, `boost`
+
+**Tables must:**
+- Follow same rules as attributeQueryRefs
+- Have `<columnAttribute>`s with only `name` property
+- Only include text-searchable columns
+
+**Questions must:**
+- Have zero or one parameters
+
+## Build and Deployment
+
+### Building
+
+```bash
+# Maven build (compiles Java sources, packages JAR)
+mvn clean install
+
+# Ant build (installs to GUS_HOME)
+ant SiteSearchData-Installation
+
+# Docker build (builds container with dependencies)
+make build
+```
+
+The build system depends on:
+- FgpUtil (https://github.com/EuPathDB/FgpUtil.git)
+- WDK (https://github.com/EuPathDB/WDK.git)
+- WSF (https://github.com/EuPathDB/WSF.git)
+- install (https://github.com/EuPathDB/install.git)
+
+### Running Scripts
+
+Scripts require GUS_HOME setup and the scripts directory in PATH:
+
+```bash
+export GUS_HOME=/path/to/gus_home
+export PATH=$PATH:$GUS_HOME/bin
+```
+
+### Configuration
+
+Before running, configure `$GUS_HOME/config/` with:
+1. `gus.config` - Component database connection
+2. `SiteSearchData/model-config.xml` - appDB, userDB, accountDB connections
+3. `SiteSearchData/model.prop` - Model properties
+
+Templates are in `Model/config/`.
+
+### Testing
+
+Unit test for `ssLoadBatch`:
+```bash
+cd Model/test
+./test_ssLoadBatch [core_url]
+```
+
+Requires empty Solr core. Set `SOLR_USER` and `SOLR_PASSWORD` if using basic auth.
+
+### Tagging
+
+**IMPORTANT:** Create a new git tag every time the model is updated. This is required to rebuild the SiteSearchData container image via the `jenkins_presenter_updater` job.
+
+## Common Workflows
+
+### Dumping Data for a Site
+
+```bash
+# For genomics sites
+dumpApiCommonWdkBatchesForSolr [organism_batch_name] [other_params]
+
+# For OrthoMCL
+dumpOrthomclWdkBatchesForSolr [params]
+
+# For EDA sites
+dumpEdaWdkBatchesForSolr [params]
+```
+
+### Creating Metadata Batches
+
+```bash
+# Document type categories
+ssCreateDocumentCategoriesBatch [output_dir]
+
+# Document fields
+ssCreateDocumentFieldsBatch [wdk_service_url] [output_dir]
+
+# WDK searches metadata
+ssCreateWdkMetaBatch [site_url] [output_dir]
+```
+
+### Loading Data into Solr
+
+```bash
+# Single batch
+ssLoadBatch [solr_core_url] [batch_dir] [--replace]
+
+# Multiple batches (recursive discovery)
+ssLoadMultipleBatches [solr_core_url] [root_dir]
+
+# Commit suggester index
+ssCommitSuggesterIndex [solr_core_url]
+```
+
+### Testing Loaded Data
+
+```bash
+# Test WDK record counts in Solr against component database
+testSiteSearchWdkRecordCounts [site_url] [solr_core_url]
+
+# Test all ApiCommon QA sites (must run from VEuPathDB server)
+testApiCommonQaSites
+```
+
+## Local Development
+
+See `local-loading-notes.adoc` for detailed instructions on:
+- Setting up minimal GUS_HOME
+- Running local Solr instance
+- Loading batches from remote builds
+
+## Dependencies
+
+Maven dependencies include:
+- WDK model and service
+- FgpUtil (core, json, server)
+- Jersey containers (Grizzly2, server)
+- JSON processing
+- Log4j
+
+## File Locations
+
+- WDK Model XMLs: `Model/lib/wdk/[cohort]/`
+- Scripts: `Model/bin/`
+- Java sources: `Model/src/main/java/org/eupathdb/sitesearch/`
+- Metadata: `Model/data/`
+- Config templates: `Model/config/`
+- Tests: `Model/test/`
+- Docker: `dockerfiles/`
diff --git a/Model/bin/dumpApiCommonWdkBatchesForSolr b/Model/bin/dumpApiCommonWdkBatchesForSolr
@@ -28,7 +28,7 @@ my $modelProps = SiteSearchData::Model::Utils::getPropsFromFile("$ENV{GUS_HOME}/
 my $dbh = SiteSearchData::Model::Utils::getDbh($gusProps);
 
 SiteSearchData::Model::Utils::runWdkReport($wdkServiceUrl, $targetDir, $BATCH_DIR_PREFIX, "pathway", $modelProps->{PROJECT_ID});
-SiteSearchData::Model::Utils::runWdkReport($wdkServiceUrl, $targetDir, $BATCH_DIR_PREFIX, "popset-isolate", $modelProps->{PROJECT_ID});
+#SiteSearchData::Model::Utils::runWdkReport($wdkServiceUrl, $targetDir, $BATCH_DIR_PREFIX, "popset-isolate", $modelProps->{PROJECT_ID});
 SiteSearchData::Model::Utils::runWdkReport($wdkServiceUrl, $targetDir, $BATCH_DIR_PREFIX, "compound", $modelProps->{PROJECT_ID});
 SiteSearchData::Model::Utils::runWdkReport($wdkServiceUrl, $targetDir, $BATCH_DIR_PREFIX, "dataset-presenter", $modelProps->{PROJECT_ID});
 

diff --git a/Model/lib/wdk/ApiCommon/compoundQueries.xml b/Model/lib/wdk/ApiCommon/compoundQueries.xml
@@ -7,7 +7,7 @@
       <sql>
         <![CDATA[
             select source_id
-            from apidbTuning.CompoundAttributes
+            from webready.CompoundAttributes
         ]]>
       </sql>
     </sqlQuery>
@@ -21,7 +21,7 @@
       <sql>
         <![CDATA[
           select source_id, source_id as old_source_id
-          from apidbTuning.CompoundAttributes
+          from webready.CompoundAttributes
         ]]>
       </sql>
     </sqlQuery>
@@ -37,7 +37,7 @@
         <![CDATA[
           select source_id, compound_name, definition,
                  other_names, formula, '@PROJECT_ID@' as project
-          from apidbTuning.CompoundAttributes
+          from webready.CompoundAttributes
         ]]>
       </sql>
     </sqlQuery>
@@ -51,8 +51,8 @@
       <column name="value"/>
       <sql>
         <![CDATA[
-            select ca.source_id, struct.type, to_char(struct.structure) as value
-            from apidbTuning.CompoundAttributes ca, chebi.Structures struct
+            select ca.source_id, struct.type, struct.structure::text as value
+            from webready.CompoundAttributes ca, chebi.Structures struct
             where ca.id = struct.compound_id
               and struct.dimension = '1D'
           union
@@ -62,7 +62,7 @@
                        then REGEXP_REPLACE(cd.chemical_data,'(\d)','<sub>\1</sub>')
                      else chemical_data
                    end as value
-            from apidbTuning.CompoundAttributes ca, chebi.chemical_data cd
+            from webready.CompoundAttributes ca, chebi.chemical_data cd
             where ca.id = cd.compound_id
         ]]>
       </sql>
@@ -74,7 +74,7 @@
       <sql>
         <![CDATA[
           select ca.source_id, n.name as value
-          from apidbTuning.CompoundAttributes ca, chebi.names n
+          from webready.CompoundAttributes ca, chebi.names n
           where ca.id = n.compound_id
             and n.type='IUPAC NAME'
         ]]>
@@ -87,7 +87,7 @@
       <sql>
         <![CDATA[
           select source_id, definition
-          from apidbTuning.CompoundAttributes
+          from webready.CompoundAttributes
         ]]>
       </sql>
     </sqlQuery>
@@ -99,7 +99,7 @@
       <sql>
         <![CDATA[
           select cid.compound as source_id, cid.id as value
-          from apidbTuning.CompoundId cid
+          from webready.CompoundId cid
           where cid.type ='synonym'
         ]]>
       </sql>
@@ -114,7 +114,7 @@
         <![CDATA[
            select distinct pc.chebi_accession as source_id, pr.enzyme as ec_number,
                            pr.substrates_text, pr.products_text
-           from apidbTuning.PathwayCompounds pc, apidbTuning.PathwayReactions pr
+           from webready.PathwayCompounds pc, webready.PathwayReactions pr
            where pc.reaction_id = pr.reaction_id
              and pc.ext_db_name = pr.ext_db_name
              and pc.ext_db_version = pr.ext_db_version
@@ -132,8 +132,8 @@
                           pa.source_id as pathway_source_id,
                           pa.name as pathway_name, pa.pathway_source,
                           pr.reaction_source_id
-          from apidbTuning.PathwayCompounds pc, apidbTuning.PathwayReactions pr,
-               apidbTuning.PathwayAttributes pa
+          from webready.PathwayCompounds pc, webready.PathwayReactions pr,
+               webready.PathwayAttributes pa
           where pc.pathway_id = pa.pathway_id
             and pr.reaction_id = pc.reaction_id
             and pc.chebi_accession is not null
@@ -150,16 +150,16 @@
         <![CDATA[
           with tp_ec
           as (select distinct ec_number_gene
-              from apidbTuning.TranscriptPathway),
+              from webready.TranscriptPathway_p),
           wildcarded_pathway_ec
           as (select distinct enzyme
-              from apidbTuning.PathwayReactions
+              from webready.PathwayReactions
               where enzyme like '%.%.%.%'
                 and enzyme like '%-%'
                 and enzyme != '-.-.-.-'),
           unwildcarded_pathway_ec
           as (select distinct enzyme
-              from apidbTuning.PathwayReactions
+              from webready.PathwayReactions
               where enzyme like '%.%.%.%'
                 and enzyme not like '%-%'),
           ec_match
@@ -171,8 +171,8 @@
                 from tp_ec te, unwildcarded_pathway_ec upe
                 where te.ec_number_gene = upe.enzyme)
           select distinct ca.source_id, tp.gene_source_id, ga.product, ga.name as gene_name
-          from apidbTuning.TranscriptPathway tp, apidbTuning.PathwayReactions pr,
-               apidbTuning.PathwayCompounds pc, apidbTuning.CompoundAttributes ca,
+          from webready.TranscriptPathway_p tp, webready.PathwayReactions pr,
+               webready.PathwayCompounds pc, webready.CompoundAttributes ca,
                ec_match em, apidbTuning.GeneAttributes ga
           where tp.ec_number_gene = em.ec_number_gene
             and em.enzyme = pr.enzyme

diff --git a/Model/lib/wdk/ApiCommon/estQueries.xml b/Model/lib/wdk/ApiCommon/estQueries.xml
@@ -9,7 +9,7 @@
         <![CDATA[
           select distinct ea.source_id
           from apidb.Organism o,
-               sres.TaxonName tn, apidbTuning.EstAttributes ea,
+               sres.TaxonName tn, webready.EstAttributes_p ea,
                (
                   select ts.taxon_id as representative_taxon_id,
                          ts.species_taxon_id as taxon_id
@@ -41,6 +41,7 @@
             and taxmap.taxon_id = tn.taxon_id
             and tn.name = ea.organism
             and o.abbrev = $$organismAbbrev$$
+            and ea.org_abbrev = $$organismAbbrev$$
         ]]>
       </sql>
     </sqlQuery>
@@ -82,7 +83,8 @@
       <sql>
         <![CDATA[
           select source_id, dbest_name, vector, stage, organism, '@PROJECT_ID@' as project
-          from apidbTuning.EstAttributes
+          from webready.EstAttributes_p
+--	  where org_abbrev in (%%xxxxPARTITION_KEYS%%)
         ]]>
       </sql>
     </sqlQuery>
@@ -97,7 +99,8 @@
       <sql>
         <![CDATA[
           select source_id, source_id as old_source_id
-          from apidbTuning.EstAttributes
+          from webready.EstAttributes_p
+--	  where org_abbrev in (%%xxxxPARTITION_KEYS%%)
         ]]>
       </sql>
     </sqlQuery>