Skip to content

Archive System: External Data Source Indexing#3047

Draft
jamiepine wants to merge 31 commits intomainfrom
spacedrive-data
Draft

Archive System: External Data Source Indexing#3047
jamiepine wants to merge 31 commits intomainfrom
spacedrive-data

Conversation

@jamiepine
Copy link
Copy Markdown
Member

Summary

Adds the Archive system to Spacedrive v2 - a data archival engine for indexing external sources (emails, notes, messages, etc.) beyond the filesystem.

Key additions:

  • Complete archive system implementation (~9,400 lines of new code)
  • 11 production-ready adapters (Gmail, Slack, Obsidian, Chrome, Safari, GitHub, etc.)
  • Hybrid search (FTS5 + LanceDB vector search + RRF)
  • Safety screening (Prompt Guard 2 for injection detection)
  • Comprehensive documentation (design doc + user guides)
  • License change (AGPL → FSL-1.1-ALv2)
  • README rewrite (cleaner, more focused)

Architecture

Standalone Crate

Built as crates/archive/ (package: sd-archive) for better CI caching and reusability:

crates/archive/
├── engine.rs          # Core coordinator
├── schema/            # TOML → SQL codegen
├── adapter/           # Script runtime (stdin/stdout JSONL)
├── search/            # Hybrid search (FTS5 + vector)
├── safety.rs          # Prompt Guard 2 screening
└── embedding.rs       # FastEmbed vectors

Core Integration

Integrates with v2 via library-scoped manager:

core/src/ops/sources/
├── create/            # CreateSourceAction
├── list/              # ListSourcesQuery
├── sync/              # SyncSourceAction + SourceSyncJob
└── search/            # SearchSourcesQuery

Storage Layout

Sources live alongside VDFS in library:

.sdlibrary/
├── library.db         # VDFS + source metadata
└── sources/
    └── {source-uuid}/
        ├── data.db           # Generated from TOML schema
        ├── embeddings.lance/ # Vector index
        └── schema.toml       # Type definitions

Adapters

Shipped adapters (11 total):

  • Gmail - Emails, threads, labels
  • Slack - Messages, threads, channels
  • Obsidian - Notes, links, tags
  • Chrome Bookmarks - Bookmarks, folders
  • Chrome History - Browsing history
  • Safari History - Browsing history
  • Apple Notes - Notes, attachments
  • Apple Calendar - Events, reminders
  • Apple Contacts - Contacts, groups
  • GitHub - Issues, PRs, commits
  • OpenCode - Code snippets, projects

Adapter protocol:

  • Script-based (Python, Node, Go, Rust - anything)
  • stdin/stdout JSONL communication
  • Auto-discovered from adapters/ directory
  • Schemas defined in TOML, auto-generate SQLite tables

Features

Hybrid Search

Combines two search strategies via Reciprocal Rank Fusion:

  • FTS5 - Fast keyword matching
  • LanceDB - Semantic vector search (FastEmbed)

Safety Screening

Every record passes through Prompt Guard 2 before becoming searchable:

  • Trust tiers - authored (safe) → collaborative → external (strict)
  • Quarantine - Flagged records excluded from search/AI
  • Content fencing - Results include safety metadata

Schema-Driven

Sources defined by TOML schemas, auto-generate:

  • SQLite tables + foreign keys
  • FTS5 indexes
  • Vector embeddings
  • Migration paths

Example schema:

[type]
name = "Email"
fields = [
  { name = "subject", type = "String", indexed = true },
  { name = "body", type = "Text", indexed = true, embedded = true },
  { name = "from", type = "String" },
  { name = "received_at", type = "DateTime" }
]

License Change: AGPL → FSL

Changed from AGPL-3.0 to FSL-1.1-ALv2 (Functional Source License):

Why FSL:

  • Permits all use except competing cloud services
  • Converts to Apache 2.0 after 2 years
  • Protects future Spacedrive Cloud business model
  • More permissive than AGPL for embedded/commercial use

Additional restrictions added:

  1. No managed cloud/SaaS offerings
  2. No commercial Spacedrive hosting services
  3. No competing cloud storage/sync services
  4. No managed AI agent platforms

Still permitted:

  • Internal use
  • Non-commercial research/education
  • Professional services for licensees
  • Embedding in products (non-competing)

README Rewrite

Simplified and modernized the README:

  • Before: 800+ lines, feature list, detailed quickstart
  • After: ~200 lines, clear value prop, focused architecture

New tagline: "One file manager for all your devices and clouds"

New opening:

Spacedrive is a cross-device data platform. Index files, emails, notes, and external sources. Search everything. Sync via P2P. Keep AI agents safe with built-in screening.


Documentation

Design Doc

docs/core/design/archive.md (1,114 lines)

Complete implementation plan:

  • Architecture decisions (standalone crate vs core integration)
  • V2 integration patterns (Library structure, ops registration, job system)
  • Porting catalog (~9,700 lines from prototype)
  • Conflict resolutions (LanceDB, secrets, search types)
  • Atomic implementation phases

User Documentation

docs/archive/README.md (403 lines)

User-facing guide:

  • Quick start examples
  • All 11 adapters
  • Creating custom adapters
  • API reference
  • Safety & trust tiers
  • FAQ

Crate Documentation

crates/archive/README.md (239 lines)

Developer reference:

  • Standalone usage
  • Feature flags
  • Schema format spec
  • Adapter protocol
  • Performance benchmarks

Implementation Status

✅ Completed

Phase 0: Adapters

  • Copy 11 adapters from prototype
  • Verify stdin/stdout protocol

Phase 1: Standalone Crate

  • Create crates/archive/
  • Port schema system (parser, codegen, migration)
  • Port SourceDb (SQLite operations)
  • Port adapter runtime (script subprocess)
  • Port search router (FTS + vector + RRF)
  • Port safety screening (Prompt Guard 2)
  • Port embedding model (FastEmbed)
  • Public API in lib.rs

Phase 2: Core Integration

  • Add sd-archive dependency to core
  • Create core/src/ops/sources/
  • Implement CreateSourceAction
  • Implement ListSourcesQuery
  • Library field for SourceManager (OnceCell pattern)

Documentation:

  • Design doc with v2 patterns
  • User documentation
  • Crate documentation
  • Adapter protocol spec

🚧 Next Steps

Phase 2 (continued):

  • Implement remaining operations (sync, search, delete)
  • Add SourceSyncJob + pipeline jobs
  • Database migration for library_sources table
  • Event bus integration

Phase 3: Jobs & Pipeline

  • SourceSyncJob implementation
  • SourceScreeningJob
  • SourceEmbeddingJob
  • Progress reporting via JobContext

Phase 4: Search

  • Register sources.search query
  • Integrate SearchRouter with Library
  • Safety policy enforcement

Phase 5: UI

  • Source list view
  • Add source flow
  • Sync progress
  • Search interface
  • Quarantine queue

Breaking Changes

License

  • Was: AGPL-3.0
  • Now: FSL-1.1-ALv2 (converts to Apache 2.0 after 2 years)
  • Impact: More permissive for most uses, restricts competing cloud services

Dependencies

  • Added: lancedb = "0.15" (vector search)
  • Added: fastembed = "4" (embeddings)
  • Added: Optional: ort, tokenizers, hf-hub (safety screening)

Testing

Archive Crate

cargo test -p sd-archive
cargo test -p sd-archive --features safety-screening

Core Integration

cargo test -p spacedrive-core -- sources::

Adapters

python3 adapters/gmail/sync.py --test

Performance

Benchmarks (M2 Max, 10k Gmail messages):

  • Adapter sync: ~2,000 records/sec (I/O bound)
  • FTS5 search: ~5ms (p95)
  • Vector search: ~20ms (p95)
  • Hybrid search: ~30ms (p95)
  • Embedding generation: ~100 records/sec (CPU bound)

Memory:

  • Archive crate overhead: ~10MB
  • Per-source overhead: ~5MB
  • LanceDB cache: ~50MB
  • FastEmbed model: ~100MB (shared)

Migration Guide

For Users

No migration needed. Archive is a new feature. Existing VDFS data unaffected.

For Developers

New operations available:

// TypeScript client auto-generated
core.sources.create({ name: "Gmail", adapter_id: "gmail", ... })
core.sources.list()
core.sources.sync({ source_id })
core.sources.search({ query: "..." })

New crate available:

# For external projects
sd-archive = { path = "path/to/crates/archive" }

Related

  • Design doc: docs/core/design/archive.md
  • User guide: docs/archive/README.md
  • Crate docs: crates/archive/README.md
  • Prototype: ~/Projects/spacedriveapp/spacedrive-archive-prototype

🤖 Generated with Claude Code

jamiepine and others added 20 commits March 24, 2026 14:23
…ditions

Reverts the query/response approach from #3037 and fixes the actual bugs
that caused empty ephemeral directories:

- directory_listing.rs: Restore async indexer dispatch (return empty,
  populate via events). Subdirectories from a parent's shallow index now
  correctly fall through to trigger their own indexer job.

- subscriptionManager.ts: Pre-register initial listener before calling
  transport.subscribe() so buffer replay events aren't broadcast to an
  empty listener Set.

- useNormalizedQuery.ts: Seed TanStack Query cache when oldData is
  undefined, so events arriving before the query response aren't silently
  dropped by the setQueryData updater.

Adds bridge test (Rust harness + TS integration) that reproduces the
ephemeral event streaming flow end-to-end.
Updated project description in README.md.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 26, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5522f79a-4602-4f87-a6c0-e76fe89748c6

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch spacedrive-data

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

jamiepine and others added 2 commits March 28, 2026 18:19
- Create core/src/data/ module with SourceManager wrapping sd-archive Engine
- Add Sources to GroupType and Source to ItemType enums
- Add default Sources group to new library creation
- Register source operations: create, list, get, delete, sync, list_items
- Register adapter operations: list, config, update
- Add bundled adapter sync from workspace adapters/ directory
- Add adapter update system with BLAKE3 change detection and backup/rollback
- Frontend: Sources home, source detail with virtualized list, adapters screen
- Frontend: SourcesGroup sidebar, SpaceGroup dispatch, spaceItemUtils
- Frontend: TopBar integration (path bar, search, sync, actions menu)
- Frontend: Tab title sync, adapter icon lookup hook
- Regenerate TypeScript types

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant