Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
226 changes: 226 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

SiteSearchData produces and loads data for VEuPathDB site search Solr cores. It consists of a specialized WDK (Web Development Kit) model that represents component database data as Solr documents, along with programs to generate and load these documents.

The data complies with the [VEuPathDB Site Search solr schema](https://github.com/VEuPathDB/SolrDeployment/blob/master/configsets/site-search/conf).

## Architecture

### Core Components

1. **WDK Model** (`Model/lib/wdk/`)
- Specialized WDK model describing how component database data is represented as Solr documents
- Separated by cohort/project type:
- `ApiCommon/` - Genomics sites (genes, ESTs, pathways, organisms, etc.)
- `OrthoMCL/` - OrthoMCL-specific records (groups, sequences)
- `EDA/` - EDA (Exploratory Data Analysis) sites
- `Portal/` - Portal-specific records
- `Shared/` - Shared records (datasets)
- Each cohort has:
- `siteSearchModel.xml` - Main model file (imports other XMLs)
- `siteSearchRecords.xml` - Record class definitions
- `*Queries.xml` - ID, vocab, attribute, and table queries

2. **Data Generation Scripts** (`Model/bin/`)
- `dumpApiCommonWdkBatchesForSolr` - Dumps all genomics WDK record classes
- `dumpOrthomclWdkBatchesForSolr` - Dumps OrthoMCL batches
- `dumpEdaWdkBatchesForSolr` - Dumps EDA batches
- `ssCreateWdkRecordsBatch` - Core batch creation (called by dump scripts)
- `ssCreateDocumentCategoriesBatch` - Creates metadata batch for document types
- `ssCreateDocumentFieldsBatch` - Creates metadata batch for document fields
- `ssCreateWdkMetaBatch` - Creates batch for WDK searches metadata

3. **Data Loading Scripts** (`Model/bin/`)
- `ssLoadBatch` - Loads a single batch into Solr with validation
- `ssLoadMultipleBatches` - Recursively discovers and loads multiple batches
- `ssCommitSuggesterIndex` - Commits the typeahead index

4. **Metadata** (`Model/data/`)
- `documentTypeCategories.json` - Hard-coded metadata describing document types and categories
- `nonWdkDocumentFields.json` - Field metadata for non-WDK documents (e.g., Jekyll documents)

5. **Configuration Templates** (`Model/config/`)
- `gus.config.tmpl` - Template for GUS database configuration
- `SiteSearchData/model.prop.tmpl` - Model properties template
- `SiteSearchData/model-config.xml.tmpl` - Model database connections template

6. **Java Source** (`Model/src/main/java/org/eupathdb/sitesearch/`)
- `wsfplugin/CommunityStudyIdsPlugin.java` - WSF plugin for community studies
- `data/model/report/SolrLoaderReporter.java` - WDK reporter that generates Solr JSON

### Batch System

All data is dumped and loaded in **batches** to ensure validity and trackability. Each batch:
- Lives in a directory: `solr-json-batch_[batch-type]_[batch-name]_[timestamp]`
- Example: `solr-json-batch_organism_pfal3D7_1234567890`
- Contains:
- Multiple `[document-type].json` files with Solr documents
- Single `batch.json` file describing the batch (metadata)
- Single `DONE` file indicating completion
- Each document includes batch metadata (type, name, timestamp)

### WDK Model Rules (CRITICAL)

The Site Search WDK Model follows strict rules documented in `Model/lib/wdk/README.md`. Key requirements:

**Record Classes must:**
- Have exactly one associated `<Question>`
- Have `urlName` matching the parallel record class in the website's WDK model
- Include exactly one reporter: `SolrLoaderReporter`
- Use sentence case for `displayName`
- Have a `<propertyList>` with a "batch" property (from [enumsConfig.xml](https://github.com/VEuPathDB/SolrDeployment/blob/master/configsets/site-search/conf/enumsConfig.xml))
- Use only `<attributeQueryRef>`s and `<table>`s
- Include internal `project` attribute only if records are segmented by project in Solr
- Include internal `organismsForFilter` table only if searchable by organism
- Include internal `display_name` attribute

**QuerySets must:**
- Set `isCacheable="false"`

**AttributeQueryRefs must:**
- Only include `name` and `displayName` XML properties
- Never change `name` (invalidates UserDB strategies)
- Use sentence case for `displayName`
- May include property lists: `isSummary`, `isSubtitle`, `isSearchable`, `includeProjects`, `boost`

**Tables must:**
- Follow same rules as attributeQueryRefs
- Have `<columnAttribute>`s with only `name` property
- Only include text-searchable columns

**Questions must:**
- Have zero or one parameters

## Build and Deployment

### Building

```bash
# Maven build (compiles Java sources, packages JAR)
mvn clean install

# Ant build (installs to GUS_HOME)
ant SiteSearchData-Installation

# Docker build (builds container with dependencies)
make build
```

The build system depends on:
- FgpUtil (https://github.com/EuPathDB/FgpUtil.git)
- WDK (https://github.com/EuPathDB/WDK.git)
- WSF (https://github.com/EuPathDB/WSF.git)
- install (https://github.com/EuPathDB/install.git)

### Running Scripts

Scripts require GUS_HOME setup and the scripts directory in PATH:

```bash
export GUS_HOME=/path/to/gus_home
export PATH=$PATH:$GUS_HOME/bin
```

### Configuration

Before running, configure `$GUS_HOME/config/` with:
1. `gus.config` - Component database connection
2. `SiteSearchData/model-config.xml` - appDB, userDB, accountDB connections
3. `SiteSearchData/model.prop` - Model properties

Templates are in `Model/config/`.

### Testing

Unit test for `ssLoadBatch`:
```bash
cd Model/test
./test_ssLoadBatch [core_url]
```

Requires empty Solr core. Set `SOLR_USER` and `SOLR_PASSWORD` if using basic auth.

### Tagging

**IMPORTANT:** Create a new git tag every time the model is updated. This is required to rebuild the SiteSearchData container image via the `jenkins_presenter_updater` job.

## Common Workflows

### Dumping Data for a Site

```bash
# For genomics sites
dumpApiCommonWdkBatchesForSolr [organism_batch_name] [other_params]

# For OrthoMCL
dumpOrthomclWdkBatchesForSolr [params]

# For EDA sites
dumpEdaWdkBatchesForSolr [params]
```

### Creating Metadata Batches

```bash
# Document type categories
ssCreateDocumentCategoriesBatch [output_dir]

# Document fields
ssCreateDocumentFieldsBatch [wdk_service_url] [output_dir]

# WDK searches metadata
ssCreateWdkMetaBatch [site_url] [output_dir]
```

### Loading Data into Solr

```bash
# Single batch
ssLoadBatch [solr_core_url] [batch_dir] [--replace]

# Multiple batches (recursive discovery)
ssLoadMultipleBatches [solr_core_url] [root_dir]

# Commit suggester index
ssCommitSuggesterIndex [solr_core_url]
```

### Testing Loaded Data

```bash
# Test WDK record counts in Solr against component database
testSiteSearchWdkRecordCounts [site_url] [solr_core_url]

# Test all ApiCommon QA sites (must run from VEuPathDB server)
testApiCommonQaSites
```

## Local Development

See `local-loading-notes.adoc` for detailed instructions on:
- Setting up minimal GUS_HOME
- Running local Solr instance
- Loading batches from remote builds

## Dependencies

Maven dependencies include:
- WDK model and service
- FgpUtil (core, json, server)
- Jersey containers (Grizzly2, server)
- JSON processing
- Log4j

## File Locations

- WDK Model XMLs: `Model/lib/wdk/[cohort]/`
- Scripts: `Model/bin/`
- Java sources: `Model/src/main/java/org/eupathdb/sitesearch/`
- Metadata: `Model/data/`
- Config templates: `Model/config/`
- Tests: `Model/test/`
- Docker: `dockerfiles/`
2 changes: 1 addition & 1 deletion Model/bin/dumpApiCommonWdkBatchesForSolr
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ my $modelProps = SiteSearchData::Model::Utils::getPropsFromFile("$ENV{GUS_HOME}/
my $dbh = SiteSearchData::Model::Utils::getDbh($gusProps);

SiteSearchData::Model::Utils::runWdkReport($wdkServiceUrl, $targetDir, $BATCH_DIR_PREFIX, "pathway", $modelProps->{PROJECT_ID});
SiteSearchData::Model::Utils::runWdkReport($wdkServiceUrl, $targetDir, $BATCH_DIR_PREFIX, "popset-isolate", $modelProps->{PROJECT_ID});
#SiteSearchData::Model::Utils::runWdkReport($wdkServiceUrl, $targetDir, $BATCH_DIR_PREFIX, "popset-isolate", $modelProps->{PROJECT_ID});
SiteSearchData::Model::Utils::runWdkReport($wdkServiceUrl, $targetDir, $BATCH_DIR_PREFIX, "compound", $modelProps->{PROJECT_ID});
SiteSearchData::Model::Utils::runWdkReport($wdkServiceUrl, $targetDir, $BATCH_DIR_PREFIX, "dataset-presenter", $modelProps->{PROJECT_ID});

Expand Down
34 changes: 17 additions & 17 deletions Model/lib/wdk/ApiCommon/compoundQueries.xml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
<sql>
<![CDATA[
select source_id
from apidbTuning.CompoundAttributes
from webready.CompoundAttributes
]]>
</sql>
</sqlQuery>
Expand All @@ -21,7 +21,7 @@
<sql>
<![CDATA[
select source_id, source_id as old_source_id
from apidbTuning.CompoundAttributes
from webready.CompoundAttributes
]]>
</sql>
</sqlQuery>
Expand All @@ -37,7 +37,7 @@
<![CDATA[
select source_id, compound_name, definition,
other_names, formula, '@PROJECT_ID@' as project
from apidbTuning.CompoundAttributes
from webready.CompoundAttributes
]]>
</sql>
</sqlQuery>
Expand All @@ -51,8 +51,8 @@
<column name="value"/>
<sql>
<![CDATA[
select ca.source_id, struct.type, to_char(struct.structure) as value
from apidbTuning.CompoundAttributes ca, chebi.Structures struct
select ca.source_id, struct.type, struct.structure::text as value
from webready.CompoundAttributes ca, chebi.Structures struct
where ca.id = struct.compound_id
and struct.dimension = '1D'
union
Expand All @@ -62,7 +62,7 @@
then REGEXP_REPLACE(cd.chemical_data,'(\d)','<sub>\1</sub>')
else chemical_data
end as value
from apidbTuning.CompoundAttributes ca, chebi.chemical_data cd
from webready.CompoundAttributes ca, chebi.chemical_data cd
where ca.id = cd.compound_id
]]>
</sql>
Expand All @@ -74,7 +74,7 @@
<sql>
<![CDATA[
select ca.source_id, n.name as value
from apidbTuning.CompoundAttributes ca, chebi.names n
from webready.CompoundAttributes ca, chebi.names n
where ca.id = n.compound_id
and n.type='IUPAC NAME'
]]>
Expand All @@ -87,7 +87,7 @@
<sql>
<![CDATA[
select source_id, definition
from apidbTuning.CompoundAttributes
from webready.CompoundAttributes
]]>
</sql>
</sqlQuery>
Expand All @@ -99,7 +99,7 @@
<sql>
<![CDATA[
select cid.compound as source_id, cid.id as value
from apidbTuning.CompoundId cid
from webready.CompoundId cid
where cid.type ='synonym'
]]>
</sql>
Expand All @@ -114,7 +114,7 @@
<![CDATA[
select distinct pc.chebi_accession as source_id, pr.enzyme as ec_number,
pr.substrates_text, pr.products_text
from apidbTuning.PathwayCompounds pc, apidbTuning.PathwayReactions pr
from webready.PathwayCompounds pc, webready.PathwayReactions pr
where pc.reaction_id = pr.reaction_id
and pc.ext_db_name = pr.ext_db_name
and pc.ext_db_version = pr.ext_db_version
Expand All @@ -132,8 +132,8 @@
pa.source_id as pathway_source_id,
pa.name as pathway_name, pa.pathway_source,
pr.reaction_source_id
from apidbTuning.PathwayCompounds pc, apidbTuning.PathwayReactions pr,
apidbTuning.PathwayAttributes pa
from webready.PathwayCompounds pc, webready.PathwayReactions pr,
webready.PathwayAttributes pa
where pc.pathway_id = pa.pathway_id
and pr.reaction_id = pc.reaction_id
and pc.chebi_accession is not null
Expand All @@ -150,16 +150,16 @@
<![CDATA[
with tp_ec
as (select distinct ec_number_gene
from apidbTuning.TranscriptPathway),
from webready.TranscriptPathway_p),
wildcarded_pathway_ec
as (select distinct enzyme
from apidbTuning.PathwayReactions
from webready.PathwayReactions
where enzyme like '%.%.%.%'
and enzyme like '%-%'
and enzyme != '-.-.-.-'),
unwildcarded_pathway_ec
as (select distinct enzyme
from apidbTuning.PathwayReactions
from webready.PathwayReactions
where enzyme like '%.%.%.%'
and enzyme not like '%-%'),
ec_match
Expand All @@ -171,8 +171,8 @@
from tp_ec te, unwildcarded_pathway_ec upe
where te.ec_number_gene = upe.enzyme)
select distinct ca.source_id, tp.gene_source_id, ga.product, ga.name as gene_name
from apidbTuning.TranscriptPathway tp, apidbTuning.PathwayReactions pr,
apidbTuning.PathwayCompounds pc, apidbTuning.CompoundAttributes ca,
from webready.TranscriptPathway_p tp, webready.PathwayReactions pr,
webready.PathwayCompounds pc, webready.CompoundAttributes ca,
ec_match em, apidbTuning.GeneAttributes ga
where tp.ec_number_gene = em.ec_number_gene
and em.enzyme = pr.enzyme
Expand Down
9 changes: 6 additions & 3 deletions Model/lib/wdk/ApiCommon/estQueries.xml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
<![CDATA[
select distinct ea.source_id
from apidb.Organism o,
sres.TaxonName tn, apidbTuning.EstAttributes ea,
sres.TaxonName tn, webready.EstAttributes_p ea,
(
select ts.taxon_id as representative_taxon_id,
ts.species_taxon_id as taxon_id
Expand Down Expand Up @@ -41,6 +41,7 @@
and taxmap.taxon_id = tn.taxon_id
and tn.name = ea.organism
and o.abbrev = $$organismAbbrev$$
and ea.org_abbrev = $$organismAbbrev$$
]]>
</sql>
</sqlQuery>
Expand Down Expand Up @@ -82,7 +83,8 @@
<sql>
<![CDATA[
select source_id, dbest_name, vector, stage, organism, '@PROJECT_ID@' as project
from apidbTuning.EstAttributes
from webready.EstAttributes_p
-- where org_abbrev in (%%xxxxPARTITION_KEYS%%)
]]>
</sql>
</sqlQuery>
Expand All @@ -97,7 +99,8 @@
<sql>
<![CDATA[
select source_id, source_id as old_source_id
from apidbTuning.EstAttributes
from webready.EstAttributes_p
-- where org_abbrev in (%%xxxxPARTITION_KEYS%%)
]]>
</sql>
</sqlQuery>
Expand Down
Loading