Skip to content

Fix: prevent frank() from mutating non-data.table inputs by deep copying atomic and list objects #7072

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

venom1204
Copy link
Contributor

@venom1204 venom1204 commented Jun 17, 2025

closes #5617

This PR resolves the long-standing issue where frank() modified non-data.table inputs as a side effect — particularly named atomic vectors and lists with named components

Changes made:

Atomic inputs (e.g., named vectors, factors):

  • Now explicitly deep copied using copy(x) before converting to list form with as_list(x) to avoid modifying the original input in-place.

List inputs (e.g., list(a = c(a = 1, b = 2), ...)) that are not data.frames:

  • Are now deep copied before further processing, ensuring that internal modifications (like setDT) do not mutate the original list or strip names from its components.

Benefits:

  • Keeps names and attributes on vectors and lists intact.

Hi @tdhock, @joshhwuu, @jangorecki — please take a look when you have time. Thanks!

Copy link

codecov bot commented Jun 17, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.69%. Comparing base (f5ef6b8) to head (93b49d9).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #7072   +/-   ##
=======================================
  Coverage   98.69%   98.69%           
=======================================
  Files          79       79           
  Lines       14677    14680    +3     
=======================================
+ Hits        14486    14489    +3     
  Misses        191      191           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

github-actions bot commented Jun 17, 2025

No obvious timing issues in HEAD=issue_5617
Comparison Plot

Generated via commit 93b49d9

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 2 minutes and 58 seconds
Installing different package versions 39 seconds
Running and plotting the test cases 2 minutes and 28 seconds

@jangorecki
Copy link
Member

Would be good to see worst case scenario benchmark so we can see how severe can performance implications be.
I wonder if it is not more correct to look at the code, where names are being lost, and try to fix it there, rather than copy() upfront.

@tdhock
Copy link
Member

tdhock commented Jun 18, 2025

would be nice to create an atime performance test case so we can see if there are any performance differences.

@venom1204
Copy link
Contributor Author

Would be good to see worst case scenario benchmark so we can see how severe can performance implications be. I wonder if it is not more correct to look at the code, where names are being lost, and try to fix it there, rather than copy() upfront.

Following the suggestion, I've run the benchmark to evaluate the performance impact of the change.
The tests confirm that for worst-case inputs like large lists, there is a performance drop. The fixed version is approximately 1.6 to 1.8 times slower than the baseline.

@joshhwuu
Copy link
Member

Could you include the benchmarking script? Does the performance drop happen with both data.table and non-data.table inputs?

@venom1204
Copy link
Contributor Author

venom1204 commented Jun 18, 2025

Could you include the benchmarking script? Does the performance drop happen with both data.table and non-data.table inputs?

Here is the script I used for the following results. It tests both a list and a data.table as input to show the specific impact of the fix.

library(data.table)
library(microbenchmark)

N <- 1e6
M <- 10

large_ls <- replicate(M, setNames(runif(N), paste0("name_", 1:N)), simplify = FALSE)
names(large_ls) <- paste0("col_", 1:M)

large_dt <- as.data.table(large_ls)

results <- microbenchmark(
  "List_Input"       = frank(large_ls),
  "data.table_Input" = frank(large_dt),
  times = 10,
  unit = "ms"
)

print(results)

@tdhock
Copy link
Member

tdhock commented Jun 19, 2025

what versions are you comparing to see the 1.6-1.8x difference? I compared current master with this PR and I see no time difference but I see a difference in memory usage.
image

pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  # allow compilation on new R versions where 'Calloc' is not defined
  pkg_find_replace(
    file.path("src", "*.c"),
    "\\b(Calloc|Free|Realloc)\\b",
    "R_\\1")
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}

expr_list <- list()
for(suffix in c("dt", "ls")){
  var_name <- paste0("large_", suffix)
  suffix_call <- substitute(atime::atime_versions_exprs(
    "~/R/data.table",
    pkg.edit.fun = pkg.edit.fun,
    expr = data.table::frank(VAR),
    master="f5ef6b8da8bf9d60c0c4ae19ad54e06647229593",
    PR7072="eea8cff7dec9e3197f2d6613ad8cd0642f72e4fe"
  ), list(VAR=as.symbol(var_name)))
  some_expr_list <- eval(suffix_call)
  expr_list[paste(suffix, names(some_expr_list))] <- some_expr_list
}
expr_list

library(data.table)
frank_result <- atime::atime(
  setup={
    M <- 10
    large_ls <- replicate(M, setNames(runif(N), paste0("name_", 1:N)), simplify = FALSE)
    names(large_ls) <- paste0("col_", 1:M)
    large_dt <- as.data.table(large_ls)
  },
  expr.list = expr_list)
plot(frank_result)

@venom1204
Copy link
Contributor Author

venom1204 commented Jun 19, 2025

what versions are you comparing to see the 1.6-1.8x difference

I was comparing the latest master branch against my PR branch . In my local runs, I observed a time difference for list inputs, with my patched version being slower.

Given that your more robust atime analysis shows no significant timing difference but increased memory usage instead, I’m wondering if there’s something off in my benchmarking script, or perhaps a mistake on my end.

In any case, if there is indeed no performance regression, that is reassuring. Kindly let me know if we can move forward from here, or if there are any further checks you would recommend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

frank modifies non-data.table inputs as a side effect
4 participants