Skip to content

Tidyup 7: Recoding and replacing values in the tidyverse #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Jul 30, 2025

Easy to read link:
https://github.com/tidyverse/tidyups/blob/feature/007/007-tidyverse-recoding-and-replacing.md

We’d love to get your thoughts on this proposal to add new column recoding and replacing tools to dplyr. The goal is to fill some important gaps left by case_when() and case_match() by creating a slightly larger family of interconnected functions. Specifically, we wish to improve on:

  • Recoding columns, both interactively and programmatically (i.e. with a pre computed lookup table, like plyr::mapvalues())

    • Existing case_when()
    • New recode_values()
  • Replacing a few values within an existing column. In particular by providing obviously named, easy to use, and type stable tools for doing so, which function as enhanced forms of [<- and base::replace().

    • New replace_when()
    • New replace_values()

Please feel free to contribute however you feel comfortable — you're welcome to comment here on individual lines of the tidyup, or open bigger discussion topics in an new issue. If there are things you’d prefer to discuss in private, please feel free to email me. I’ll plan to close the discussion on Aug 18 and we will advance to the implementation stage.

@higgi13425
Copy link

higgi13425 commented Aug 4, 2025

recode_values is the boss for Likert scale responses. So great for a 100 question questionnaire with 5 item Likerts for every Q.
Rensis would approve. https://en.wikipedia.org/wiki/Rensis_Likert
Can you purrr across 100Q in a questionnaire to do this efficiently?

@JoFrhwld
Copy link

JoFrhwld commented Aug 5, 2025

I'm not sure if this is intended, but it's currently not possible to change the data type with replace_when()

penguins |> 
  mutate(
    size = body_mass |> 
    replace_when(
      body_mass > 4750 ~ "large",
      body_mass > 3550 ~ "medium",
      body_mass > 0 ~ "small"
    )
  )

#> Error in `mutate()`:
#> ℹ In argument: `size = replace_when(...)`.
#> Caused by error in `replace_when()`:
#> ! Can't convert `..1 (right)` <character> to <integer>.
#> Run `rlang::last_trace()` to see where the error occurred.

It also looks like if we wanted to use replace_when() as a sequence of if-else logic, we need to go back to how case_when() used to work.

penguins |> 
  mutate(
    size = body_mass |> 
    replace_when(
      body_mass > 4750 ~ 3,
      body_mass > 3550 ~ 2,
      TRUE ~ 1
    )
  )

The proposal doesn't say that replace_when() is meant to supersede case_when(), so would these be use cases where it would be recommended to use case_when() instead?

@DavisVaughan
Copy link
Member Author

@JoFrhwld to be extremely clear, case_when() is not going anywhere and is not being superseded.

These 3 functions join case_when() to round out the family, they do not replace it. The intro paragraph above shows how case_when() and recode_values() are on the "recode" side of things, and replace_when() and replace_values() are on the "replace" side of things.


it's currently not possible to change the data type with replace_when()

And that's exactly the point! replace_when() is type safe. If you want to update a few values in a column using a condition from another column, but you want to guarantee that the type of that column doesn't change out from under you, you use replace_when(). case_when() does not have this safety (and can't, it's meant for creating new columns, not updating existing ones)

See https://github.com/tidyverse/tidyups/blob/feature/007/007-tidyverse-recoding-and-replacing.md#type-stability

@EmilHvitfeldt
Copy link

I'm sure you have thought about it, but i didn't see it explicitly stated. I'm going to assume that if there are duplicate values in from that the first one takes precedence. Is that correctly assumed?

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Aug 5, 2025

@EmilHvitfeldt yep, same idea as case_when() where "first wins". Will be in the official docs for sure.

dplyr::replace_values(1, from = c(1, 1), to = c(2, 3))
#> [1] 2

dplyr::replace_values(1, 1 ~ 2, 1 ~ 3)
#> [1] 2

Created on 2025-08-05 with reprex v2.1.1

@RichardPatterson
Copy link

What is the expected use with factors? If the lookup tbl contains a factor in the to column will the levels be passed onto the the recoded variable?

What is the relationship with fct_recode() from forcats.

This all looks really great btw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants