-
Notifications
You must be signed in to change notification settings - Fork 0
🚸 Update formatting of source_code when a new Transform is created #107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a GitHelper utility class to extract detailed Git repository information and refactors transform creation in LaminRunManager to generate properly formatted source code metadata in YAML-like format. The changes enable better distinction between sliding transforms (branches) and pinned transforms (commits/tags), and support detection of various Git providers including custom instances via Nextflow's SCM configuration.
Key Changes:
- Added
GitHelperutility with Git provider detection, revision type classification (tag/branch/commit), and source URL construction - Refactored transform creation to use structured
TransformMetadataand generate YAML-like source code instead of JSON - Added
createInputArtifactmethod for handling remote input artifacts
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 11 comments.
| File | Description |
|---|---|
| src/main/groovy/ai/lamin/nf_lamin/util/GitHelper.groovy | New utility class for extracting Git information from .git directories, detecting providers, and constructing source URLs |
| src/test/groovy/ai/lamin/nf_lamin/util/GitHelperTest.groovy | Comprehensive unit tests for GitHelper provider detection, commit hash recognition, and URL construction |
| src/main/groovy/ai/lamin/nf_lamin/LaminRunManager.groovy | Refactored transform creation with new TransformMetadata structure, YAML-like source code formatting, and createInputArtifact method |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| String configText = gitConfig.text | ||
|
|
||
| // Find [remote "origin"] section and extract URL | ||
| def matcher = configText =~ /\[remote "origin"\][^\[]*?url\s*=\s*(.+)/ |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex pattern (.+) is too greedy and may capture trailing whitespace, newlines, or content from subsequent lines in the git config. Consider using a more specific pattern like ([^\s\n]+) to capture only the URL without trailing whitespace or newlines.
| def matcher = configText =~ /\[remote "origin"\][^\[]*?url\s*=\s*(.+)/ | |
| def matcher = configText =~ /\[remote "origin"\][^\[]*?url\s*=\s*([^\s\n]+)/ |
| case GITEA: | ||
| return "src/commit/${commitOrRef}" | ||
| case AZURE_DEVOPS: | ||
| return "_git?version=GC${commitOrRef}" |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Azure DevOps URL format _git?version=GC${commitOrRef} is incomplete. Azure DevOps URLs require additional path components including the project and repository name. The format should be something like ?path=/${filePath}&version=GC${commitOrRef}, and the base URL construction in constructSourceUrl would need special handling for Azure DevOps to include the project path correctly. Consider reviewing Azure DevOps Git URL patterns or marking this as unsupported until properly implemented.
| return "_git?version=GC${commitOrRef}" | |
| // Azure DevOps URL format requires project, repo, file path, and commit/ref; not supported here | |
| return null |
| Map<String, Object> createInputArtifact(Path path) { | ||
| if (laminInstance == null || config.dryRun) { | ||
| return null | ||
| } | ||
|
|
||
| // if path is a local file, skip creating input artifact | ||
| boolean isLocalFile = (path.toUri().getScheme() ?: 'file') == 'file' | ||
| if (isLocalFile) { | ||
| return null | ||
| } | ||
|
|
||
| String description = "Input artifact at ${path.toUri()}" | ||
|
|
||
| Map<String, Object> artifact = createOrUploadArtifact( | ||
| path: path, | ||
| description: description | ||
| ) | ||
|
|
||
| log.info "Created input artifact ${artifact?.get('uid')} for path ${path.toUri()}. Data: ${artifact}" | ||
|
|
||
| // todo: link artifact to current run as an input artifact | ||
|
|
||
| return artifact | ||
| } |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The createInputArtifact method lacks test coverage. While there are tests for other methods in LaminRunManager, this new method should have unit tests to verify the behavior, especially the logic for skipping local files and creating artifacts for remote paths.
| private static TransformMetadata collectTransformMetadata(Session session) { | ||
| WorkflowMetadata wfMetadata = session.getWorkflowMetadata() | ||
|
|
||
| TransformMetadata metadata = new TransformMetadata() | ||
|
|
||
| // Extract basic workflow metadata | ||
| metadata.repository = wfMetadata.repository ?: wfMetadata.projectName | ||
| metadata.mainScript = wfMetadata.scriptFile.toString().replaceFirst("${wfMetadata.projectDir}/", '') | ||
| metadata.revision = wfMetadata.revision ?: 'local-development' | ||
| metadata.commitId = wfMetadata.commitId | ||
| metadata.projectDir = wfMetadata.projectDir | ||
|
|
||
| // Extract entrypoint from session binding | ||
| metadata.entrypoint = session?.getBinding()?.getEntryName() | ||
|
|
||
| // Extract manifest information | ||
| metadata.manifestName = wfMetadata.manifest.getName() ?: '<No name in manifest>' | ||
| metadata.manifestDescription = wfMetadata.manifest.getDescription() ?: '<No description in manifest>' | ||
|
|
||
| // Try to get additional git information from the repository | ||
| try { | ||
| metadata.gitInfo = GitHelper.getGitInfo(metadata.projectDir, metadata.revision) | ||
| } catch (Exception e) { | ||
| log.debug "Failed to extract git info: ${e.message}" | ||
| metadata.gitInfo = null | ||
| } | ||
|
|
||
| return metadata | ||
| } | ||
|
|
||
| /** | ||
| * Generate the transform key from metadata | ||
| * Format: "repository" or "repository:script" if not main.nf | ||
| * | ||
| * @param metadata Transform metadata | ||
| * @return The transform key | ||
| */ | ||
| private static String generateTransformKey(TransformMetadata metadata) { | ||
| return metadata.mainScript == 'main.nf' | ||
| ? metadata.repository | ||
| : "${metadata.repository}:${metadata.mainScript}" | ||
| } | ||
|
|
||
| /** | ||
| * Generate the transform description from metadata | ||
| * | ||
| * @param metadata Transform metadata | ||
| * @return The transform description | ||
| */ | ||
| private static String generateTransformDescription(TransformMetadata metadata) { | ||
| return "${metadata.manifestName}: ${metadata.manifestDescription}" | ||
| } | ||
|
|
||
| /** | ||
| * Generates source_code string in YAML-like format to match Python package. | ||
| * Format: key: value pairs separated by newlines | ||
| * See: lamindb/models/transform.py Transform.from_git() | ||
| * https://github.com/laminlabs/lamindb/blob/734367caa7b4bff5216f4aca1d8e43fe88bb8b0a/lamindb/models/transform.py#L473-L489 | ||
| * | ||
| * The Python implementation distinguishes between: | ||
| * - "Sliding transforms": Track a branch (version == branch), code can change over time | ||
| * - "Pinned transforms": Pin to specific commit/tag, code is immutable | ||
| * | ||
| * Uses GitHelper to extract additional information from the git repository | ||
| * when available, including provider type and revision classification. | ||
| * | ||
| * @param metadata Transform metadata containing all necessary information | ||
| * @return Formatted source code string | ||
| */ | ||
| private static String generateTransformSourceCode(TransformMetadata metadata) { | ||
| // Use LinkedHashMap to preserve insertion order | ||
| Map<String, String> sourceData = new LinkedHashMap<>() | ||
|
|
||
| // Repository and path are required | ||
| sourceData.put('repo', metadata.repository) | ||
| sourceData.put('path', metadata.mainScript) | ||
|
|
||
| // Add entrypoint if specified (via -entry option) | ||
| if (metadata.entrypoint) { | ||
| sourceData.put('entrypoint', metadata.entrypoint) | ||
| } | ||
|
|
||
| // Determine whether to use commit or branch | ||
| // Priority: | ||
| // 1. If we have gitInfo, use its classification (most accurate) | ||
| // 2. If commitId exists, prefer commit (git-based project) | ||
| // 3. Otherwise use branch (local development) | ||
|
|
||
| boolean useCommit = false | ||
| if (metadata.gitInfo) { | ||
| // If git info available, use it to determine if this is a tag | ||
| // Tags should use commit (immutable), branches use branch (sliding) | ||
| useCommit = metadata.commitId && (metadata.gitInfo.isTag || metadata.gitInfo.isCommit) | ||
| } else { | ||
| // Fallback: if we have a commitId, use it | ||
| useCommit = metadata.commitId != null | ||
| } | ||
|
|
||
| if (useCommit && metadata.commitId) { | ||
| sourceData.put('commit', metadata.commitId) | ||
| } else if (metadata.revision) { | ||
| sourceData.put('branch', metadata.revision) | ||
| } | ||
|
|
||
| // // Optionally add provider for URL construction | ||
| // if (metadata.gitInfo && metadata.gitInfo.provider != GitHelper.GitProvider.UNKNOWN) { | ||
| // sourceData.put('provider', metadata.gitInfo.provider.name) | ||
| // } | ||
|
|
||
| // // Optionally add source URL if we can construct it | ||
| // if (metadata.gitInfo && metadata.commitId) { | ||
| // String sourceUrl = GitHelper.constructSourceUrl(metadata.gitInfo, metadata.commitId, metadata.mainScript) | ||
| // if (sourceUrl) { | ||
| // sourceData.put('url', sourceUrl) | ||
| // } | ||
| // } | ||
|
|
||
| // Convert map to YAML-like format: "key: value\n" | ||
| return sourceData.collect { k, v -> "${k}: ${v}" }.join('\n') | ||
| } |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new methods collectTransformMetadata, generateTransformKey, generateTransformDescription, and generateTransformSourceCode lack test coverage. These methods contain important business logic for determining commit vs branch usage and formatting source code. Consider adding unit tests to verify the various code paths, especially the gitInfo-based commit/branch detection logic.
| private static List<String> getTags(Path projectDir) { | ||
| Path tagsDir = projectDir.resolve('.git/refs/tags') | ||
| if (!Files.exists(tagsDir)) { | ||
| return [] | ||
| } | ||
|
|
||
| try { | ||
| return Files.list(tagsDir) | ||
| .filter { Files.isRegularFile(it) } | ||
| .map { it.fileName.toString() } | ||
| .collect() as List<String> | ||
| } catch (Exception e) { | ||
| log.debug "Failed to read tags: ${e.message}" | ||
| return [] | ||
| } | ||
| } |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The getTags and getBranches methods only read from .git/refs/tags and .git/refs/heads directories, but Git can store refs in a packed format in .git/packed-refs for performance. This means tags and branches stored in packed-refs won't be detected, leading to incorrect classification of revisions as commits instead of tags/branches. Consider reading and parsing .git/packed-refs in addition to the refs directories.
| // Convert git@ URLs to https:// | ||
| if (url.startsWith('git@')) { | ||
| url = url.replaceFirst(/^git@/, 'https://') | ||
| .replaceFirst(/:/, '/') |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex .replaceFirst(/:/, '/') will replace the first colon it finds, but git@ URLs can have port numbers (e.g., git@host:port:user/repo.git). While rare, this could incorrectly transform URLs with port numbers. Consider using a more specific pattern like .replaceFirst(/([^:]+):/, '$1/') or validating the URL format more carefully.
| .replaceFirst(/:/, '/') | |
| .replaceFirst(/([^:]+):/, '$1/') |
| return Files.list(tagsDir) | ||
| .filter { Files.isRegularFile(it) } | ||
| .map { it.fileName.toString() } | ||
| .collect() as List<String> |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Files.list() and Files.walk() methods return streams that should be closed to avoid resource leaks. In Groovy, the collect() terminal operation doesn't automatically close the stream. Consider wrapping these in a try-with-resources block or using withCloseable() to ensure the streams are properly closed, especially in long-running applications.
| try { | ||
| return Files.walk(branchesDir) |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Files.walk() stream should be closed to avoid resource leaks. Consider wrapping this in a try-with-resources block or using withCloseable() to ensure the stream is properly closed, especially in long-running applications.
| try { | |
| return Files.walk(branchesDir) | |
| try (def stream = Files.walk(branchesDir)) { | |
| return stream |
| private static String getRemoteUrl(Path projectDir) { | ||
| Path gitConfig = projectDir.resolve('.git/config') | ||
| if (!Files.exists(gitConfig)) { | ||
| return null | ||
| } | ||
|
|
||
| try { | ||
| String configText = gitConfig.text | ||
|
|
||
| // Find [remote "origin"] section and extract URL | ||
| def matcher = configText =~ /\[remote "origin"\][^\[]*?url\s*=\s*(.+)/ | ||
| if (matcher.find()) { | ||
| String url = matcher.group(1).trim() | ||
| // Convert git@ URLs to https:// | ||
| if (url.startsWith('git@')) { | ||
| url = url.replaceFirst(/^git@/, 'https://') | ||
| .replaceFirst(/:/, '/') | ||
| .replaceFirst(/\.git$/, '') | ||
| } | ||
| return url |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading .git/config files could expose sensitive information like credentials embedded in URLs. While this implementation doesn't log the full URL, consider sanitizing URLs to remove any embedded credentials (e.g., https://user:password@host/repo) before storing or using them, especially if they might be logged or stored in the database.
| Path branchesDir = projectDir.resolve('.git/refs/heads') | ||
| if (!Files.exists(branchesDir)) { | ||
| return [] | ||
| } | ||
|
|
||
| try { | ||
| return Files.walk(branchesDir) | ||
| .filter { Files.isRegularFile(it) } | ||
| .map { branchesDir.relativize(it).toString() } | ||
| .collect() as List<String> | ||
| } catch (Exception e) { | ||
| log.debug "Failed to read branches: ${e.message}" | ||
| return [] | ||
| } |
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The getBranches method only reads from .git/refs/heads, but Git can store refs in .git/packed-refs. Branches stored in packed-refs won't be detected, leading to incorrect revision classification. Consider reading and parsing .git/packed-refs in addition to the refs directories.
| Path branchesDir = projectDir.resolve('.git/refs/heads') | |
| if (!Files.exists(branchesDir)) { | |
| return [] | |
| } | |
| try { | |
| return Files.walk(branchesDir) | |
| .filter { Files.isRegularFile(it) } | |
| .map { branchesDir.relativize(it).toString() } | |
| .collect() as List<String> | |
| } catch (Exception e) { | |
| log.debug "Failed to read branches: ${e.message}" | |
| return [] | |
| } | |
| Set<String> branches = new HashSet<>() | |
| Path branchesDir = projectDir.resolve('.git/refs/heads') | |
| if (Files.exists(branchesDir)) { | |
| try { | |
| Files.walk(branchesDir) | |
| .filter { Files.isRegularFile(it) } | |
| .forEach { | |
| branches.add(branchesDir.relativize(it).toString()) | |
| } | |
| } catch (Exception e) { | |
| log.debug "Failed to read branches from refs/heads: ${e.message}" | |
| } | |
| } | |
| // Also parse .git/packed-refs for packed branches | |
| Path packedRefs = projectDir.resolve('.git/packed-refs') | |
| if (Files.exists(packedRefs)) { | |
| try { | |
| Files.lines(packedRefs).forEach { line -> | |
| line = line.trim() | |
| if (line.isEmpty() || line.startsWith('#') || line.startsWith('^')) { | |
| return | |
| } | |
| // Format: <hash> <ref> | |
| int spaceIdx = line.indexOf(' ') | |
| if (spaceIdx > 0) { | |
| String ref = line.substring(spaceIdx + 1) | |
| if (ref.startsWith('refs/heads/')) { | |
| String branchName = ref.substring('refs/heads/'.length()) | |
| branches.add(branchName) | |
| } | |
| } | |
| } | |
| } catch (Exception e) { | |
| log.debug "Failed to read branches from packed-refs: ${e.message}" | |
| } | |
| } | |
| return new ArrayList<>(branches) |
falexwolf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, Robrecht!
Can we also have a test that tests the source_code formatting end-to-end like we do in lamindb?
https://github.com/laminlabs/lamindb/blob/main/tests/core/test_transform_from_git.py
The unit tests look good, but if the overall format isn't tested things can go wrong? Or aren't you populating the full source_code field?
To your question re Git provider: I'd say this information is stored in the URL and we shouldn't duplicate unless it very clearly provides a benefit. Duplication always has downsides and we should have a good reason.
Otherwise this seems good to me!
The resulting source code is now:
When I run the following command:
The following logs are produced:
Unfortunately the Nextflow API doesn't expose whether the provided revision is a tag, branch, or commit sha. Hence I had to implement a GitHelper to pretty much detect that information (again). From the command above we can see that the Githelper is able to detect whether the reference is a tag, branch, or commit.
The GitHelper does need to access private classes from the Nextflow codebase so that it can read in the Nextflow SCM so that it can infer whether the repo is a gitlab, github, gitea, etc... repo when the repo is not in one of the standard git provider platforms. From the log we can again see that that GitHelper was able to pick up the relevant SCM config, because it found out that the provider is gitlab.
@falexwolf Given that this is able to detect the git provider (gitlab, github, ...), should this information be stored somewhere? Or shall we leave the stored information as is?