-
Notifications
You must be signed in to change notification settings - Fork 1k
Fix: prevent frank() from mutating non-data.table inputs by deep copying atomic and list objects #7072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #7072 +/- ##
=======================================
Coverage 98.69% 98.69%
=======================================
Files 79 79
Lines 14677 14680 +3
=======================================
+ Hits 14486 14489 +3
Misses 191 191 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
No obvious timing issues in HEAD=issue_5617 Generated via commit 93b49d9 Download link for the artifact containing the test results: ↓ atime-results.zip
|
Would be good to see worst case scenario benchmark so we can see how severe can performance implications be. |
would be nice to create an atime performance test case so we can see if there are any performance differences. |
Following the suggestion, I've run the benchmark to evaluate the performance impact of the change. |
Could you include the benchmarking script? Does the performance drop happen with both data.table and non-data.table inputs? |
Here is the script I used for the following results. It tests both a list and a data.table as input to show the specific impact of the fix. library(data.table)
library(microbenchmark)
N <- 1e6
M <- 10
large_ls <- replicate(M, setNames(runif(N), paste0("name_", 1:N)), simplify = FALSE)
names(large_ls) <- paste0("col_", 1:M)
large_dt <- as.data.table(large_ls)
results <- microbenchmark(
"List_Input" = frank(large_ls),
"data.table_Input" = frank(large_dt),
times = 10,
unit = "ms"
)
print(results) |
I was comparing the latest master branch against my PR branch . In my local runs, I observed a time difference for list inputs, with my patched version being slower. Given that your more robust atime analysis shows no significant timing difference but increased memory usage instead, I’m wondering if there’s something off in my benchmarking script, or perhaps a mistake on my end. In any case, if there is indeed no performance regression, that is reassuring. Kindly let me know if we can move forward from here, or if there are any further checks you would recommend. |
closes #5617
This PR resolves the long-standing issue where frank() modified non-data.table inputs as a side effect — particularly named atomic vectors and lists with named components
Changes made:
Atomic inputs (e.g., named vectors, factors):
List inputs (e.g., list(a = c(a = 1, b = 2), ...)) that are not data.frames:
Benefits:
Hi @tdhock, @joshhwuu, @jangorecki — please take a look when you have time. Thanks!