Skip to content

Conversation

@rcannood
Copy link
Collaborator

@rcannood rcannood commented Nov 20, 2025

The resulting source code is now:

repo: https://github.com/laminlabs/nf-lamin
path: validation/main.nf
branch: main # or commit: abcdef if a tag or commit
entrypoint: foobar

When I run the following command:

NXF_VER=25.10.0 nextflow \
  -trace ai.lamin \
  run https://packages.viash-hub.com/vsh/toolbox \
  -r v0.1.x \
  -main-script target/nextflow/bgzip/main.nf \
  -profile docker \
  -latest \
  -resume \
  --input README.md \
  --output out.tar.gz \
  --publish_dir output

The following logs are produced:

Nov-20 10:21:19.839 [main] DEBUG nextflow.scm.ProviderConfig - Detected SCM config: [providers:[vsh:[server:packages.viash-hub.com, platform:gitlab]]]
Nov-20 10:21:19.846 [main] DEBUG ai.lamin.nf_lamin.util.GitHelper - Extracted git info: GitInfo(provider=gitlab, remoteUrl=https://packages.viash-hub.com/vsh/toolbox.git, isTag=false, isBranch=true, isCommit=false)

Unfortunately the Nextflow API doesn't expose whether the provided revision is a tag, branch, or commit sha. Hence I had to implement a GitHelper to pretty much detect that information (again). From the command above we can see that the Githelper is able to detect whether the reference is a tag, branch, or commit.


The GitHelper does need to access private classes from the Nextflow codebase so that it can read in the Nextflow SCM so that it can infer whether the repo is a gitlab, github, gitea, etc... repo when the repo is not in one of the standard git provider platforms. From the log we can again see that that GitHelper was able to pick up the relevant SCM config, because it found out that the provider is gitlab.


@falexwolf Given that this is able to detect the git provider (gitlab, github, ...), should this information be stored somewhere? Or shall we leave the stored information as is?

@github-actions
Copy link

github-actions bot commented Nov 20, 2025

@github-actions github-actions bot temporarily deployed to pull request November 20, 2025 10:29 Inactive
@github-actions github-actions bot temporarily deployed to pull request November 21, 2025 12:39 Inactive
@rcannood rcannood requested a review from falexwolf November 27, 2025 07:45
@rcannood rcannood changed the title Fix resulting source code when transform is created 🚸 Update formatting of source_code when a new Transform is created Nov 27, 2025
@rcannood rcannood requested a review from Copilot November 27, 2025 20:01
Copilot finished reviewing on behalf of rcannood November 27, 2025 20:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a GitHelper utility class to extract detailed Git repository information and refactors transform creation in LaminRunManager to generate properly formatted source code metadata in YAML-like format. The changes enable better distinction between sliding transforms (branches) and pinned transforms (commits/tags), and support detection of various Git providers including custom instances via Nextflow's SCM configuration.

Key Changes:

  • Added GitHelper utility with Git provider detection, revision type classification (tag/branch/commit), and source URL construction
  • Refactored transform creation to use structured TransformMetadata and generate YAML-like source code instead of JSON
  • Added createInputArtifact method for handling remote input artifacts

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 11 comments.

File Description
src/main/groovy/ai/lamin/nf_lamin/util/GitHelper.groovy New utility class for extracting Git information from .git directories, detecting providers, and constructing source URLs
src/test/groovy/ai/lamin/nf_lamin/util/GitHelperTest.groovy Comprehensive unit tests for GitHelper provider detection, commit hash recognition, and URL construction
src/main/groovy/ai/lamin/nf_lamin/LaminRunManager.groovy Refactored transform creation with new TransformMetadata structure, YAML-like source code formatting, and createInputArtifact method

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

String configText = gitConfig.text

// Find [remote "origin"] section and extract URL
def matcher = configText =~ /\[remote "origin"\][^\[]*?url\s*=\s*(.+)/
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern (.+) is too greedy and may capture trailing whitespace, newlines, or content from subsequent lines in the git config. Consider using a more specific pattern like ([^\s\n]+) to capture only the URL without trailing whitespace or newlines.

Suggested change
def matcher = configText =~ /\[remote "origin"\][^\[]*?url\s*=\s*(.+)/
def matcher = configText =~ /\[remote "origin"\][^\[]*?url\s*=\s*([^\s\n]+)/

Copilot uses AI. Check for mistakes.
case GITEA:
return "src/commit/${commitOrRef}"
case AZURE_DEVOPS:
return "_git?version=GC${commitOrRef}"
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Azure DevOps URL format _git?version=GC${commitOrRef} is incomplete. Azure DevOps URLs require additional path components including the project and repository name. The format should be something like ?path=/${filePath}&version=GC${commitOrRef}, and the base URL construction in constructSourceUrl would need special handling for Azure DevOps to include the project path correctly. Consider reviewing Azure DevOps Git URL patterns or marking this as unsupported until properly implemented.

Suggested change
return "_git?version=GC${commitOrRef}"
// Azure DevOps URL format requires project, repo, file path, and commit/ref; not supported here
return null

Copilot uses AI. Check for mistakes.
Comment on lines 368 to 391
Map<String, Object> createInputArtifact(Path path) {
if (laminInstance == null || config.dryRun) {
return null
}

// if path is a local file, skip creating input artifact
boolean isLocalFile = (path.toUri().getScheme() ?: 'file') == 'file'
if (isLocalFile) {
return null
}

String description = "Input artifact at ${path.toUri()}"

Map<String, Object> artifact = createOrUploadArtifact(
path: path,
description: description
)

log.info "Created input artifact ${artifact?.get('uid')} for path ${path.toUri()}. Data: ${artifact}"

// todo: link artifact to current run as an input artifact

return artifact
}
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The createInputArtifact method lacks test coverage. While there are tests for other methods in LaminRunManager, this new method should have unit tests to verify the behavior, especially the logic for skipping local files and creating artifacts for remote paths.

Copilot uses AI. Check for mistakes.
Comment on lines 671 to 790
private static TransformMetadata collectTransformMetadata(Session session) {
WorkflowMetadata wfMetadata = session.getWorkflowMetadata()

TransformMetadata metadata = new TransformMetadata()

// Extract basic workflow metadata
metadata.repository = wfMetadata.repository ?: wfMetadata.projectName
metadata.mainScript = wfMetadata.scriptFile.toString().replaceFirst("${wfMetadata.projectDir}/", '')
metadata.revision = wfMetadata.revision ?: 'local-development'
metadata.commitId = wfMetadata.commitId
metadata.projectDir = wfMetadata.projectDir

// Extract entrypoint from session binding
metadata.entrypoint = session?.getBinding()?.getEntryName()

// Extract manifest information
metadata.manifestName = wfMetadata.manifest.getName() ?: '<No name in manifest>'
metadata.manifestDescription = wfMetadata.manifest.getDescription() ?: '<No description in manifest>'

// Try to get additional git information from the repository
try {
metadata.gitInfo = GitHelper.getGitInfo(metadata.projectDir, metadata.revision)
} catch (Exception e) {
log.debug "Failed to extract git info: ${e.message}"
metadata.gitInfo = null
}

return metadata
}

/**
* Generate the transform key from metadata
* Format: "repository" or "repository:script" if not main.nf
*
* @param metadata Transform metadata
* @return The transform key
*/
private static String generateTransformKey(TransformMetadata metadata) {
return metadata.mainScript == 'main.nf'
? metadata.repository
: "${metadata.repository}:${metadata.mainScript}"
}

/**
* Generate the transform description from metadata
*
* @param metadata Transform metadata
* @return The transform description
*/
private static String generateTransformDescription(TransformMetadata metadata) {
return "${metadata.manifestName}: ${metadata.manifestDescription}"
}

/**
* Generates source_code string in YAML-like format to match Python package.
* Format: key: value pairs separated by newlines
* See: lamindb/models/transform.py Transform.from_git()
* https://github.com/laminlabs/lamindb/blob/734367caa7b4bff5216f4aca1d8e43fe88bb8b0a/lamindb/models/transform.py#L473-L489
*
* The Python implementation distinguishes between:
* - "Sliding transforms": Track a branch (version == branch), code can change over time
* - "Pinned transforms": Pin to specific commit/tag, code is immutable
*
* Uses GitHelper to extract additional information from the git repository
* when available, including provider type and revision classification.
*
* @param metadata Transform metadata containing all necessary information
* @return Formatted source code string
*/
private static String generateTransformSourceCode(TransformMetadata metadata) {
// Use LinkedHashMap to preserve insertion order
Map<String, String> sourceData = new LinkedHashMap<>()

// Repository and path are required
sourceData.put('repo', metadata.repository)
sourceData.put('path', metadata.mainScript)

// Add entrypoint if specified (via -entry option)
if (metadata.entrypoint) {
sourceData.put('entrypoint', metadata.entrypoint)
}

// Determine whether to use commit or branch
// Priority:
// 1. If we have gitInfo, use its classification (most accurate)
// 2. If commitId exists, prefer commit (git-based project)
// 3. Otherwise use branch (local development)

boolean useCommit = false
if (metadata.gitInfo) {
// If git info available, use it to determine if this is a tag
// Tags should use commit (immutable), branches use branch (sliding)
useCommit = metadata.commitId && (metadata.gitInfo.isTag || metadata.gitInfo.isCommit)
} else {
// Fallback: if we have a commitId, use it
useCommit = metadata.commitId != null
}

if (useCommit && metadata.commitId) {
sourceData.put('commit', metadata.commitId)
} else if (metadata.revision) {
sourceData.put('branch', metadata.revision)
}

// // Optionally add provider for URL construction
// if (metadata.gitInfo && metadata.gitInfo.provider != GitHelper.GitProvider.UNKNOWN) {
// sourceData.put('provider', metadata.gitInfo.provider.name)
// }

// // Optionally add source URL if we can construct it
// if (metadata.gitInfo && metadata.commitId) {
// String sourceUrl = GitHelper.constructSourceUrl(metadata.gitInfo, metadata.commitId, metadata.mainScript)
// if (sourceUrl) {
// sourceData.put('url', sourceUrl)
// }
// }

// Convert map to YAML-like format: "key: value\n"
return sourceData.collect { k, v -> "${k}: ${v}" }.join('\n')
}
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new methods collectTransformMetadata, generateTransformKey, generateTransformDescription, and generateTransformSourceCode lack test coverage. These methods contain important business logic for determining commit vs branch usage and formatting source code. Consider adding unit tests to verify the various code paths, especially the gitInfo-based commit/branch detection logic.

Copilot uses AI. Check for mistakes.
Comment on lines +326 to +341
private static List<String> getTags(Path projectDir) {
Path tagsDir = projectDir.resolve('.git/refs/tags')
if (!Files.exists(tagsDir)) {
return []
}

try {
return Files.list(tagsDir)
.filter { Files.isRegularFile(it) }
.map { it.fileName.toString() }
.collect() as List<String>
} catch (Exception e) {
log.debug "Failed to read tags: ${e.message}"
return []
}
}
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The getTags and getBranches methods only read from .git/refs/tags and .git/refs/heads directories, but Git can store refs in a packed format in .git/packed-refs for performance. This means tags and branches stored in packed-refs won't be detected, leading to incorrect classification of revisions as commits instead of tags/branches. Consider reading and parsing .git/packed-refs in addition to the refs directories.

Copilot uses AI. Check for mistakes.
// Convert git@ URLs to https://
if (url.startsWith('git@')) {
url = url.replaceFirst(/^git@/, 'https://')
.replaceFirst(/:/, '/')
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex .replaceFirst(/:/, '/') will replace the first colon it finds, but git@ URLs can have port numbers (e.g., git@host:port:user/repo.git). While rare, this could incorrectly transform URLs with port numbers. Consider using a more specific pattern like .replaceFirst(/([^:]+):/, '$1/') or validating the URL format more carefully.

Suggested change
.replaceFirst(/:/, '/')
.replaceFirst(/([^:]+):/, '$1/')

Copilot uses AI. Check for mistakes.
Comment on lines +333 to +336
return Files.list(tagsDir)
.filter { Files.isRegularFile(it) }
.map { it.fileName.toString() }
.collect() as List<String>
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Files.list() and Files.walk() methods return streams that should be closed to avoid resource leaks. In Groovy, the collect() terminal operation doesn't automatically close the stream. Consider wrapping these in a try-with-resources block or using withCloseable() to ensure the streams are properly closed, especially in long-running applications.

Copilot uses AI. Check for mistakes.
Comment on lines +355 to +356
try {
return Files.walk(branchesDir)
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Files.walk() stream should be closed to avoid resource leaks. Consider wrapping this in a try-with-resources block or using withCloseable() to ensure the stream is properly closed, especially in long-running applications.

Suggested change
try {
return Files.walk(branchesDir)
try (def stream = Files.walk(branchesDir)) {
return stream

Copilot uses AI. Check for mistakes.
Comment on lines +266 to +285
private static String getRemoteUrl(Path projectDir) {
Path gitConfig = projectDir.resolve('.git/config')
if (!Files.exists(gitConfig)) {
return null
}

try {
String configText = gitConfig.text

// Find [remote "origin"] section and extract URL
def matcher = configText =~ /\[remote "origin"\][^\[]*?url\s*=\s*(.+)/
if (matcher.find()) {
String url = matcher.group(1).trim()
// Convert git@ URLs to https://
if (url.startsWith('git@')) {
url = url.replaceFirst(/^git@/, 'https://')
.replaceFirst(/:/, '/')
.replaceFirst(/\.git$/, '')
}
return url
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading .git/config files could expose sensitive information like credentials embedded in URLs. While this implementation doesn't log the full URL, consider sanitizing URLs to remove any embedded credentials (e.g., https://user:password@host/repo) before storing or using them, especially if they might be logged or stored in the database.

Copilot uses AI. Check for mistakes.
Comment on lines +350 to +363
Path branchesDir = projectDir.resolve('.git/refs/heads')
if (!Files.exists(branchesDir)) {
return []
}

try {
return Files.walk(branchesDir)
.filter { Files.isRegularFile(it) }
.map { branchesDir.relativize(it).toString() }
.collect() as List<String>
} catch (Exception e) {
log.debug "Failed to read branches: ${e.message}"
return []
}
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The getBranches method only reads from .git/refs/heads, but Git can store refs in .git/packed-refs. Branches stored in packed-refs won't be detected, leading to incorrect revision classification. Consider reading and parsing .git/packed-refs in addition to the refs directories.

Suggested change
Path branchesDir = projectDir.resolve('.git/refs/heads')
if (!Files.exists(branchesDir)) {
return []
}
try {
return Files.walk(branchesDir)
.filter { Files.isRegularFile(it) }
.map { branchesDir.relativize(it).toString() }
.collect() as List<String>
} catch (Exception e) {
log.debug "Failed to read branches: ${e.message}"
return []
}
Set<String> branches = new HashSet<>()
Path branchesDir = projectDir.resolve('.git/refs/heads')
if (Files.exists(branchesDir)) {
try {
Files.walk(branchesDir)
.filter { Files.isRegularFile(it) }
.forEach {
branches.add(branchesDir.relativize(it).toString())
}
} catch (Exception e) {
log.debug "Failed to read branches from refs/heads: ${e.message}"
}
}
// Also parse .git/packed-refs for packed branches
Path packedRefs = projectDir.resolve('.git/packed-refs')
if (Files.exists(packedRefs)) {
try {
Files.lines(packedRefs).forEach { line ->
line = line.trim()
if (line.isEmpty() || line.startsWith('#') || line.startsWith('^')) {
return
}
// Format: <hash> <ref>
int spaceIdx = line.indexOf(' ')
if (spaceIdx > 0) {
String ref = line.substring(spaceIdx + 1)
if (ref.startsWith('refs/heads/')) {
String branchName = ref.substring('refs/heads/'.length())
branches.add(branchName)
}
}
}
} catch (Exception e) {
log.debug "Failed to read branches from packed-refs: ${e.message}"
}
}
return new ArrayList<>(branches)

Copilot uses AI. Check for mistakes.
Copy link
Member

@falexwolf falexwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Robrecht!

Can we also have a test that tests the source_code formatting end-to-end like we do in lamindb?

https://github.com/laminlabs/lamindb/blob/main/tests/core/test_transform_from_git.py

The unit tests look good, but if the overall format isn't tested things can go wrong? Or aren't you populating the full source_code field?

To your question re Git provider: I'd say this information is stored in the URL and we shouldn't duplicate unless it very clearly provides a benefit. Duplication always has downsides and we should have a good reason.

Otherwise this seems good to me!

@github-actions github-actions bot temporarily deployed to pull request November 29, 2025 18:39 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants