Skip to content

Add Haskell language support#572

Merged
buger merged 2 commits into
mainfrom
add-haskell-language-support
Jun 2, 2026
Merged

Add Haskell language support#572
buger merged 2 commits into
mainfrom
add-haskell-language-support

Conversation

@buger

@buger buger commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds first-class Haskell support across Probe's language-aware paths, following the Crystal integration pattern:

  • adds a tree-sitter-haskell backed HaskellLanguage implementation for .hs and .lhs
  • wires Haskell into parser pools, query/search language aliases, symbol extraction, test detection, comments, CLI language lists, and supported-language docs
  • wires Haskell into LSP/indexing language detection, registry aliases, parser selection, symbol/FQN handling, UID rules, indexing config, relationship extraction, and workspace/project language support
  • adds focused Haskell fixtures and regression tests for symbols, extraction, query, search filters, source context, parser pool use, find_symbol_at_position() tree-sitter selection, and Haskell operator extraction

Dogfood

Validated on real Haskell projects cloned under /tmp/tmp.WnZvUXswwU and /tmp/tmp.bVINF8U3oI:

  • cabal at a51c4ee (Merge pull request #11479 from omarjatoi/11269-warnings-in-reverse)
  • haskell-language-server at 2a30435 (Avoid relying on OccNames when generating class methods (#4932))
  • shellcheck at 764802b (Merge pull request #3443 from dotysan/printf_dashv_nospace)
  • pandoc at e16f501 (cabal.project: ensure we use latest doclayout for build.)
  • xmonad at a618fb3 (ci: Regenerate haskell-ci)

Edge cases exercised:

  • GADTs and multiline data declarations: HLS Ide/Plugin/Properties.hs, Cabal SetupHooks/Rule.hs
  • type families, data families, type instances, and data instances: HLS/Cabal family-heavy files
  • pattern synonyms: HLS graph key internals
  • foreign imports/exports: Cabal paths modules and Pandoc WASM entrypoint
  • operator signatures and definitions: xmonad ManageHook.hs
  • CPP-gated Haskell modules: HLS HlsPlugins.hs
  • bird-style and LaTeX-style .lhs: HLS eval-plugin/manual literate fixtures
  • search filters and aliases: lang:haskell, lang:hs, lang:lhs
  • query JSON with --with-context on real HLS source

Commands exercised included:

  • probe symbols on Cabal, HLS, xmonad, Pandoc, and .lhs files
  • probe extract ...#ToHsType for a type family
  • probe extract ...#RuleCommands for a multiline GADT declaration
  • probe extract ...#(<+>) and probe extract ...#(-->) for Haskell operators
  • probe extract ...#PluginMethod on HLS typeclass-heavy code
  • probe query ... --language hs --with-context --format json
  • probe query ... --language lhs on bird-style and LaTeX-style literate Haskell
  • probe search 'KnownExtension AND lang:haskell', probe search 'HandlerM AND lang:hs', and probe search 'prod AND lang:lhs'

Dogfood found and fixed a real gap: operator extraction was falling back to text search for xmonad operators and could pick up comments. The fix adds Haskell prefix_id support and lets Haskell signature nodes participate in structural symbol lookup while preferring concrete definitions when available.

Local Haskell LSP tool availability was also checked. This environment does not have haskell-language-server-wrapper, haskell-language-server, ghc, cabal, or stack installed, so live HLS startup/version validation was not possible here. Tree-sitter/LSP integration paths are covered by unit tests.

Validation

  • cargo fmt --all -- --check
  • cargo test --test haskell_language_tests
  • cargo test -p lsp-daemon haskell
  • cargo check -p probe-code
  • cargo check -p lsp-daemon
  • pre-commit hook:
    • cargo fmt --all -- --check
    • cargo clippy --all-targets --all-features -- -D warnings
    • RUST_BACKTRACE=1 cargo test --lib
    • RUST_BACKTRACE=1 cargo test --test integration_tests

@probelabs

probelabs Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

PR Overview: Add Haskell Language Support

Summary

This PR adds comprehensive first-class Haskell language support to Probe, following the established Crystal integration pattern. It implements complete tree-sitter-based parsing, symbol extraction, and LSP integration for both .hs (Haskell) and .lhs (Literate Haskell) files across 30 files with 586 additions and 21 deletions.

Key Changes

Core Language Implementation

  • New HaskellLanguage implementation (src/language/haskell.rs): 142-line complete implementation of the LanguageImpl trait with:
    • Type declarations: data_type, newtype, type_synonym, type_family, data_family, kind_signature
    • Functions: function, bind, foreign_import, foreign_export, pattern_synonym
    • Typeclasses: class, instance with proper module hierarchy support
    • Module declarations with module_id extraction for FQN support

Dependency & Parser Integration

  • Added tree-sitter-haskell dependency to both Cargo.toml and lsp-daemon/Cargo.toml
  • Wired Haskell into parser pools across 3 locations:
    • Core parser pool (src/language/parser_pool.rs): Added hs, lhs to warm-up lists
    • LSP daemon analyzer pool (lsp-daemon/src/analyzer/tree_sitter_analyzer.rs): Full parser pool integration
    • Relationship extraction pool (lsp-daemon/src/relationship/tree_sitter_extractor.rs): Added Haskell support

Language Detection & Registration

  • Extended language factory (src/language/factory.rs): Returns HaskellLanguage for .hs and .lhs extensions
  • Updated CLI language lists (src/cli.rs): Added haskell, hs, lhs aliases to command completions
  • Enhanced language detector (lsp-daemon/src/language_detector.rs): Recognizes both extensions
  • Added to workspace/project language support (lsp-daemon/src/workspace/config.rs, project.rs)

Symbol Extraction & Analysis

  • Symbol kind mappings in tree-sitter analyzer:
    • Functions → SymbolKind::Function
    • Data types/newtypes → SymbolKind::Type
    • Classes → SymbolKind::Class
    • Instances → SymbolKind::TraitImpl
    • Pattern synonyms → SymbolKind::Constant
  • FQN (Fully Qualified Name) support with dot (.) separator for Haskell's module hierarchy
  • UID generation rules (lsp-daemon/src/symbol/language_support.rs): Added haskell() method with:
    • Scope separator: . (dot)
    • Keywords: module, import, data, newtype, type, class, instance, where, foreign, pattern
    • Case-sensitive: true
    • No overloading support
  • AST extractor integration (lsp-daemon/src/indexing/ast_extractor.rs): Added Haskell symbol kind mapping

Test Detection

  • Haskell test file patterns (src/language/test_detection.rs):
    • *Spec.hs, *Spec.lhs (Hspec)
    • *Test.hs, *Test.lhs (Tasty/QuickCheck)
    • Test*.hs, Test*.lhs prefix patterns
  • Test node detection: Functions starting with prop_, test_, spec_

LSP & Indexing Integration

  • LSP database adapter (lsp-daemon/src/lsp_database_adapter.rs):
    • Symbol name extraction with Haskell-specific node types (name, variable, constructor, module_id, field_name)
    • Keyword filtering for Haskell-specific terms (data, newtype, type, where, instance)
    • Support for single-quote (') in identifiers (Haskell prime symbols)
  • Indexing configuration (lsp-daemon/src/indexing/config.rs, pipelines.rs):
    • Language-specific features: extract_typeclasses, extract_instances, extract_signatures
    • Default extensions: hs, lhs
  • LSP registry/server updates for proper language ID mapping

Documentation

  • Updated supported languages documentation (docs/reference/supported-languages.md) with:
    • Haskell feature descriptions
    • Type/class extraction details
    • Comment handling (-- style)
    • Test detection patterns

Architecture & Impact

Component Relationships

graph TB
    A[Haskell Source Files .hs/.lhs] --> B[Parser Pool]
    B --> C[HaskellLanguage Impl]
    C --> D[Symbol Extraction]
    D --> E[TreeSitter Analyzer]
    E --> F[LSP Database Adapter]
    F --> G[UID Generator]
    G --> H[Indexing Pipelines]
    H --> I[Search & Query]
    
    J[CLI Commands] --> K[Language Factory]
    K --> C
    
    L[Test Detection] --> M[Test File Patterns]
    M --> A
Loading

Data Flow

sequenceDiagram
    participant User as CLI/User
    participant Factory as Language Factory
    participant Pool as Parser Pool
    participant Haskell as HaskellLanguage
    participant Analyzer as TreeSitter Analyzer
    participant LSP as LSP Adapter
    
    User->>Factory: get_language_impl(".hs")
    Factory->>Haskell: HaskellLanguage::new()
    User->>Pool: parse_with_pool()
    Pool->>Haskell: get_tree_sitter_language()
    Haskell-->>Pool: tree_sitter_haskell::LANGUAGE
    Pool->>Analyzer: analyze_file()
    Analyzer->>Haskell: is_symbol_node(), get_symbol_signature()
    Analyzer->>LSP: store symbols
    LSP->>LSP: generate_uid() with Haskell rules
Loading

Files Changed

Core Implementation (1 new file)

  • src/language/haskell.rs (+142 lines): Complete LanguageImpl implementation

Dependency Updates (2 files)

  • Cargo.toml: Added tree-sitter-haskell = "0.23.1"
  • lsp-daemon/Cargo.toml: Added tree-sitter-haskell = "0.23.1"

Language Integration (6 files)

  • src/language/factory.rs: Added Haskell to factory match arms
  • src/language/mod.rs: Added pub mod haskell;
  • src/language/parser_pool.rs: Added hs, lhs to warm-up lists
  • src/cli.rs: Added Haskell aliases to language argument completions
  • src/extract/formatter.rs: Added lhs to extension mapping
  • src/debug_tree_sitter.rs: Added Haskell symbol extraction rules

Symbol Extraction (3 files)

  • src/extract/symbol_finder.rs: Added Haskell identifier node types (name, variable, constructor, module_id, field_name, prefix_id)
  • src/extract/symbols.rs: Added Haskell container nodes and kind normalization
  • src/language/test_detection.rs: Added Haskell test file patterns

LSP Daemon Integration (16 files)

  • lsp-daemon/src/analyzer/language_analyzers/generic.rs: Added lhs extension detection
  • lsp-daemon/src/analyzer/tree_sitter_analyzer.rs: Added parser pool, node mapping, and tests
  • lsp-daemon/src/daemon.rs: Added tree-sitter language selection
  • lsp-daemon/src/fqn.rs: Added Haskell FQN extraction with dot separator
  • lsp-daemon/src/indexing/ast_extractor.rs: Added Haskell symbol kind mapping
  • lsp-daemon/src/indexing/config.rs: Added Haskell to language configs and features
  • lsp-daemon/src/indexing/lsp_enrichment_worker.rs: Added lhs extension support
  • lsp-daemon/src/indexing/pipelines.rs: Added Haskell pipeline configuration
  • lsp-daemon/src/language_detector.rs: Added lhs extension detection
  • lsp-daemon/src/lsp_database_adapter.rs: Added Haskell symbol extraction and keyword filtering
  • lsp-daemon/src/lsp_registry.rs: Added lhs extension mapping
  • lsp-daemon/src/lsp_server.rs: Added lhs language ID mapping
  • lsp-daemon/src/relationship/tree_sitter_extractor.rs: Added Haskell to parser pool
  • lsp-daemon/src/symbol/language_support.rs: Added haskell() method with language rules
  • lsp-daemon/src/symbol/uid_generator.rs: Added Haskell to UID rules initialization
  • lsp-daemon/src/workspace/config.rs: Added Haskell to supported languages
  • lsp-daemon/src/workspace/project.rs: Added Haskell project detection

Documentation (1 file)

  • docs/reference/supported-languages.md: Added Haskell documentation section

Testing & Validation

The PR includes:

  • Unit tests in lsp-daemon/src/analyzer/tree_sitter_analyzer.rs for parser pool and symbol extraction
  • Integration tests for find_symbol_at_position() with Haskell tree-sitter
  • Dogfooding validation on real projects (cabal, haskell-language-server)
  • Pre-commit hooks passed: cargo fmt, cargo clippy, cargo test

Scope & Impact

Direct Impact

  • 30 files changed across core and LSP daemon crates
  • 586 additions, 21 deletions - focused, surgical changes
  • Zero breaking changes - purely additive feature

Cross-Module Boundaries

  • Core → LSP Daemon: Shared tree-sitter dependency and language enum
  • CLI → Language Factory: Command-line language selection
  • Indexing → Symbol Extraction: Unified symbol kind mapping
  • All → Documentation: Comprehensive feature documentation

Related Components

Based on the Crystal integration pattern, future work may include:

  • Dedicated test suite file (tests/haskell_language_tests.rs)
  • Test fixtures in tests/fixtures/haskell/
  • Outline format tests for Haskell symbol display

References

Core Implementation

  • src/language/haskell.rs:1-142 - HaskellLanguage trait implementation
  • src/language/factory.rs:39 - Factory registration
  • src/language/parser_pool.rs:47,140 - Parser pool warm-up

LSP Integration

  • lsp-daemon/src/analyzer/tree_sitter_analyzer.rs:106,455,598-615 - Parser pool and node mapping
  • lsp-daemon/src/symbol/language_support.rs:364-391 - Language rules
  • lsp-daemon/src/indexing/config.rs:1406,1513-1517 - Indexing features

Testing

  • lsp-daemon/src/analyzer/tree_sitter_analyzer.rs:1321-1418 - Parser pool tests
  • lsp-daemon/src/analyzer/tree_sitter_analyzer.rs:2890-2923 - Symbol position tests
  • lsp-daemon/src/lsp_database_adapter.rs:2890-2923 - Tree-sitter integration tests
Metadata
  • Review Effort: 3 / 5
  • Primary Label: feature

Powered by Visor from Probelabs

Last updated: 2026-06-02T10:38:44.476Z | Triggered by: pr_updated | Commit: be8117b

💡 TIP: You can chat with Visor using /visor ask <your question>

@probelabs

probelabs Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Architecture Issues (8)

Severity Location Issue
🟢 Info src/language/parser_pool.rs:47
Haskell is placed in Tier 2 (common languages) but lacks the usage data to justify this tier. The tier system should be based on actual usage frequency. Haskell is a specialized language that would more appropriately be in Tier 3.
💡 SuggestionMove Haskell to Tier 3 (specialized languages) to align with the tier system's design philosophy, or provide usage metrics to justify Tier 2 placement.
🟢 Info lsp-daemon/src/symbol/language_support.rs:364
The Haskell language rules are minimal compared to other languages. While Rust, TypeScript, Java, etc., have comprehensive type_aliases defined, Haskell has an empty vec. This suggests the implementation may not fully handle Haskell's type system nuances.
💡 SuggestionConsider adding common Haskell type aliases (e.g., String -> [Char], FilePath -> String) to match the completeness of other language implementations, or document why Haskell doesn't need type aliases.
🟢 Info lsp-daemon/src/symbol/language_support.rs:364
Haskell is marked as supports_overloading: false, but Haskell does support typeclass-based overloading (e.g., multiple methods with the same name in different typeclasses). This may be an oversimplification.
💡 SuggestionVerify if this setting affects UID generation correctly for Haskell typeclass methods. If Haskell's typeclass overloading differs from traditional overloading, document this distinction.
🟡 Warning src/extract/symbol_finder.rs:54
The include_symbol_nodes parameter is a Haskell-specific workaround that creates a special case in the generic symbol finding logic. This parameter only exists to handle Haskell's operator definitions that can be represented only by signature nodes. This design couples the generic symbol finder to Haskell-specific AST structure.
💡 SuggestionConsider moving this logic into the HaskellLanguage implementation's is_symbol_node() method or create a language-specific override mechanism instead of adding a parameter to the generic function.
🟡 Warning src/extract/symbol_finder.rs:425
Hardcoded language check matches!(extension, "hs" | "lhs") introduces a special case directly in the symbol extraction logic. This pattern violates the abstraction layer where language-specific behavior should be handled by the LanguageImpl trait, not by hardcoded extension checks.
💡 SuggestionMove this logic into the HaskellLanguage.is_symbol_node() method or add a method to LanguageImpl to indicate whether signature nodes should be included in symbol searches.
🟡 Warning src/extract/symbol_finder.rs:513
Post-hoc filtering of signature nodes based on hardcoded language check is a special case that should be handled at the language implementation level. This logic filters out signature nodes after collection, which is inefficient and couples the generic finder to Haskell-specific needs.
💡 SuggestionHandle this filtering in the HaskellLanguage.is_acceptable_parent() or is_symbol_node() methods to prevent signature nodes from being added to matched_nodes in the first place.
🟡 Warning src/language/haskell.rs:95
The is_symbol_node() method contains complex nested conditions for module detection that checks if a module node has a parent of kind 'header' and has a 'module_id' child. This is overly specific and creates tight coupling to tree-sitter-haskell's exact AST structure.
💡 SuggestionSimplify this logic. If module nodes need special handling, consider a more general approach that doesn't hardcode the 'header' parent check, or document why this specific AST pattern is necessary.
🟡 Warning lsp-daemon/src/analyzer/tree_sitter_analyzer.rs:106
The extension_to_language_name() function duplicates the mapping logic that exists in the core language factory. This creates two sources of truth for language mappings, which can lead to inconsistencies.
💡 SuggestionConsider centralizing this mapping or having the LSP daemon use the core language factory's mapping logic to avoid duplication and maintenance burden.

Performance Issues (4)

Severity Location Issue
🟡 Warning src/extract/symbols.rs:149-156
Adding 'header', 'declarations', 'class_declarations', 'instance_declarations', and 'ERROR' to recursive collect_symbols calls may cause performance issues. These node types can create deep nesting in Haskell ASTs, and ERROR nodes indicate malformed syntax that should be handled more carefully. The current MAX_SYMBOL_DEPTH limit of 3 may not be sufficient for complex Haskell type hierarchies.
💡 SuggestionConsider adding special handling for ERROR nodes to avoid recursion into malformed syntax. Evaluate whether MAX_SYMBOL_DEPTH=3 is sufficient for Haskell's type class hierarchies and nested instance declarations. Add metrics to track recursion depth in production.
🟡 Warning src/language/parser_pool.rs:47
Adding Haskell (hs) to Tier 2 common languages increases parser pool memory footprint. Each parser consumes ~10MB of memory, and adding Haskell to the pre-warming tier increases baseline memory usage, especially for projects that don't use Haskell.
💡 SuggestionConsider keeping Haskell in Tier 3 (specialized languages) unless it's frequently used. Monitor memory usage in production and consider making tier assignment configurable via environment variable.
🟡 Warning lsp-daemon/src/analyzer/tree_sitter_analyzer.rs:617-723
The extract_symbol_name function creates a new cursor with node.walk() and iterates through children on every call. This function is called recursively throughout symbol extraction, causing repeated cursor allocations. For Haskell files with many symbols, this creates significant allocation overhead.
💡 SuggestionConsider passing a reusable cursor as a parameter or using cursor reuse pattern. Measure allocation overhead using heap profiling. The existing code has a TODO comment about cursor reuse performance.
🟡 Warning src/extract/symbols.rs:272-280
The extract_symbol_name function creates a new cursor and iterates through children using node.children(&mut cursor). This cursor allocation happens for every symbol extraction call. For Haskell files with many type declarations and functions, this creates cumulative allocation overhead.
💡 SuggestionConsider reusing cursors across calls or using a more efficient iteration pattern. The Vec::with_capacity optimization is good, but cursor reuse would provide additional performance gains.
\n\n

Architecture Issues (8)

Severity Location Issue
🟢 Info src/language/parser_pool.rs:47
Haskell is placed in Tier 2 (common languages) but lacks the usage data to justify this tier. The tier system should be based on actual usage frequency. Haskell is a specialized language that would more appropriately be in Tier 3.
💡 SuggestionMove Haskell to Tier 3 (specialized languages) to align with the tier system's design philosophy, or provide usage metrics to justify Tier 2 placement.
🟢 Info lsp-daemon/src/symbol/language_support.rs:364
The Haskell language rules are minimal compared to other languages. While Rust, TypeScript, Java, etc., have comprehensive type_aliases defined, Haskell has an empty vec. This suggests the implementation may not fully handle Haskell's type system nuances.
💡 SuggestionConsider adding common Haskell type aliases (e.g., String -> [Char], FilePath -> String) to match the completeness of other language implementations, or document why Haskell doesn't need type aliases.
🟢 Info lsp-daemon/src/symbol/language_support.rs:364
Haskell is marked as supports_overloading: false, but Haskell does support typeclass-based overloading (e.g., multiple methods with the same name in different typeclasses). This may be an oversimplification.
💡 SuggestionVerify if this setting affects UID generation correctly for Haskell typeclass methods. If Haskell's typeclass overloading differs from traditional overloading, document this distinction.
🟡 Warning src/extract/symbol_finder.rs:54
The include_symbol_nodes parameter is a Haskell-specific workaround that creates a special case in the generic symbol finding logic. This parameter only exists to handle Haskell's operator definitions that can be represented only by signature nodes. This design couples the generic symbol finder to Haskell-specific AST structure.
💡 SuggestionConsider moving this logic into the HaskellLanguage implementation's is_symbol_node() method or create a language-specific override mechanism instead of adding a parameter to the generic function.
🟡 Warning src/extract/symbol_finder.rs:425
Hardcoded language check matches!(extension, "hs" | "lhs") introduces a special case directly in the symbol extraction logic. This pattern violates the abstraction layer where language-specific behavior should be handled by the LanguageImpl trait, not by hardcoded extension checks.
💡 SuggestionMove this logic into the HaskellLanguage.is_symbol_node() method or add a method to LanguageImpl to indicate whether signature nodes should be included in symbol searches.
🟡 Warning src/extract/symbol_finder.rs:513
Post-hoc filtering of signature nodes based on hardcoded language check is a special case that should be handled at the language implementation level. This logic filters out signature nodes after collection, which is inefficient and couples the generic finder to Haskell-specific needs.
💡 SuggestionHandle this filtering in the HaskellLanguage.is_acceptable_parent() or is_symbol_node() methods to prevent signature nodes from being added to matched_nodes in the first place.
🟡 Warning src/language/haskell.rs:95
The is_symbol_node() method contains complex nested conditions for module detection that checks if a module node has a parent of kind 'header' and has a 'module_id' child. This is overly specific and creates tight coupling to tree-sitter-haskell's exact AST structure.
💡 SuggestionSimplify this logic. If module nodes need special handling, consider a more general approach that doesn't hardcode the 'header' parent check, or document why this specific AST pattern is necessary.
🟡 Warning lsp-daemon/src/analyzer/tree_sitter_analyzer.rs:106
The extension_to_language_name() function duplicates the mapping logic that exists in the core language factory. This creates two sources of truth for language mappings, which can lead to inconsistencies.
💡 SuggestionConsider centralizing this mapping or having the LSP daemon use the core language factory's mapping logic to avoid duplication and maintenance burden.
\n\n ### Performance Issues (1)
Severity Location Issue
🟠 Error contract:0
Output schema validation failed: must have required property 'issues'
\n\n ### ✅ Quality Check Passed

No quality issues found – changes LGTM.


Powered by Visor from Probelabs

Last updated: 2026-06-02T10:21:30.322Z | Triggered by: pr_updated | Commit: be8117b

💡 TIP: You can chat with Visor using /visor ask <your question>

@buger buger merged commit 9d8e702 into main Jun 2, 2026
18 of 19 checks passed
@buger buger deleted the add-haskell-language-support branch June 2, 2026 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant