Fix: prevent frank() from mutating non-data.table inputs by deep copying atomic and list objects #7072

venom1204 · 2025-06-17T20:35:52Z

This PR resolves the long-standing issue where frank() modified non-data.table inputs as a side effect — particularly named atomic vectors and lists with named components

Changes made:

Atomic inputs (e.g., named vectors, factors):

Now explicitly deep copied using copy(x) before converting to list form with as_list(x) to avoid modifying the original input in-place.

List inputs (e.g., list(a = c(a = 1, b = 2), ...)) that are not data.frames:

Are now deep copied before further processing, ensuring that internal modifications (like setDT) do not mutate the original list or strip names from its components.

Benefits:

Keeps names and attributes on vectors and lists intact.

Hi @tdhock, @joshhwuu, @jangorecki — please take a look when you have time. Thanks!

codecov · 2025-06-17T20:43:44Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.69%. Comparing base (f5ef6b8) to head (93b49d9).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #7072   +/-   ##
=======================================
  Coverage   98.69%   98.69%           
=======================================
  Files          79       79           
  Lines       14677    14680    +3     
=======================================
+ Hits        14486    14489    +3     
  Misses        191      191

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-06-17T20:53:16Z

No obvious timing issues in HEAD=issue_5617

Generated via commit 93b49d9

Download link for the artifact containing the test results: ↓ atime-results.zip

Task	Duration
R setup and installing dependencies	2 minutes and 58 seconds
Installing different package versions	39 seconds
Running and plotting the test cases	2 minutes and 28 seconds

jangorecki · 2025-06-18T08:02:54Z

Would be good to see worst case scenario benchmark so we can see how severe can performance implications be.
I wonder if it is not more correct to look at the code, where names are being lost, and try to fix it there, rather than copy() upfront.

tdhock · 2025-06-18T14:00:47Z

would be nice to create an atime performance test case so we can see if there are any performance differences.

venom1204 · 2025-06-18T17:42:33Z

Would be good to see worst case scenario benchmark so we can see how severe can performance implications be. I wonder if it is not more correct to look at the code, where names are being lost, and try to fix it there, rather than copy() upfront.

Following the suggestion, I've run the benchmark to evaluate the performance impact of the change.
The tests confirm that for worst-case inputs like large lists, there is a performance drop. The fixed version is approximately 1.6 to 1.8 times slower than the baseline.

joshhwuu · 2025-06-18T17:58:51Z

Could you include the benchmarking script? Does the performance drop happen with both data.table and non-data.table inputs?

venom1204 · 2025-06-18T18:32:59Z

Could you include the benchmarking script? Does the performance drop happen with both data.table and non-data.table inputs?

Here is the script I used for the following results. It tests both a list and a data.table as input to show the specific impact of the fix.

library(data.table)
library(microbenchmark)

N <- 1e6
M <- 10

large_ls <- replicate(M, setNames(runif(N), paste0("name_", 1:N)), simplify = FALSE)
names(large_ls) <- paste0("col_", 1:M)

large_dt <- as.data.table(large_ls)

results <- microbenchmark(
  "List_Input"       = frank(large_ls),
  "data.table_Input" = frank(large_dt),
  times = 10,
  unit = "ms"
)

print(results)

tdhock · 2025-06-19T09:15:15Z

what versions are you comparing to see the 1.6-1.8x difference? I compared current master with this PR and I see no time difference but I see a difference in memory usage.

pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  # allow compilation on new R versions where 'Calloc' is not defined
  pkg_find_replace(
    file.path("src", "*.c"),
    "\\b(Calloc|Free|Realloc)\\b",
    "R_\\1")
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}

expr_list <- list()
for(suffix in c("dt", "ls")){
  var_name <- paste0("large_", suffix)
  suffix_call <- substitute(atime::atime_versions_exprs(
    "~/R/data.table",
    pkg.edit.fun = pkg.edit.fun,
    expr = data.table::frank(VAR),
    master="f5ef6b8da8bf9d60c0c4ae19ad54e06647229593",
    PR7072="eea8cff7dec9e3197f2d6613ad8cd0642f72e4fe"
  ), list(VAR=as.symbol(var_name)))
  some_expr_list <- eval(suffix_call)
  expr_list[paste(suffix, names(some_expr_list))] <- some_expr_list
}
expr_list

library(data.table)
frank_result <- atime::atime(
  setup={
    M <- 10
    large_ls <- replicate(M, setNames(runif(N), paste0("name_", 1:N)), simplify = FALSE)
    names(large_ls) <- paste0("col_", 1:M)
    large_dt <- as.data.table(large_ls)
  },
  expr.list = expr_list)
plot(frank_result)

venom1204 · 2025-06-19T11:11:31Z

what versions are you comparing to see the 1.6-1.8x difference

I was comparing the latest master branch against my PR branch . In my local runs, I observed a time difference for list inputs, with my patched version being slower.

Given that your more robust atime analysis shows no significant timing difference but increased memory usage instead, I’m wondering if there’s something off in my benchmarking script, or perhaps a mistake on my end.

In any case, if there is indeed no performance regression, that is reassuring. Kindly let me know if we can move forward from here, or if there are any further checks you would recommend.

updated frank condition

80ec759

tests

eea8cff

Merge branch 'master' into issue_5617

93b49d9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: prevent frank() from mutating non-data.table inputs by deep copying atomic and list objects #7072

Fix: prevent frank() from mutating non-data.table inputs by deep copying atomic and list objects #7072

Uh oh!

venom1204 commented Jun 17, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jun 17, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 17, 2025 •

edited

Loading

Uh oh!

jangorecki commented Jun 18, 2025

Uh oh!

tdhock commented Jun 18, 2025

Uh oh!

venom1204 commented Jun 18, 2025

Uh oh!

joshhwuu commented Jun 18, 2025

Uh oh!

venom1204 commented Jun 18, 2025 •

edited by joshhwuu

Loading

Uh oh!

tdhock commented Jun 19, 2025

Uh oh!

venom1204 commented Jun 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Fix: prevent frank() from mutating non-data.table inputs by deep copying atomic and list objects #7072

Are you sure you want to change the base?

Fix: prevent frank() from mutating non-data.table inputs by deep copying atomic and list objects #7072

Uh oh!

Conversation

venom1204 commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jangorecki commented Jun 18, 2025

Uh oh!

tdhock commented Jun 18, 2025

Uh oh!

venom1204 commented Jun 18, 2025

Uh oh!

joshhwuu commented Jun 18, 2025

Uh oh!

venom1204 commented Jun 18, 2025 • edited by joshhwuu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdhock commented Jun 19, 2025

Uh oh!

venom1204 commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

venom1204 commented Jun 17, 2025 •

edited

Loading

codecov bot commented Jun 17, 2025 •

edited

Loading

github-actions bot commented Jun 17, 2025 •

edited

Loading

venom1204 commented Jun 18, 2025 •

edited by joshhwuu

Loading

venom1204 commented Jun 19, 2025 •

edited

Loading