You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Do we want some common example data files that are used on all the implementations? For example, common files used in the various dataset API recipes. I don't really know how often people will be bouncing between languages or comparing languages though.
One of the potential items on my to-do list is creating a nice tight set of interesting and useful example datasets for the R implementation, so I'd be in favour of this to prevent duplicating efforts. I also am unsure how likely it is that people would be comparing between different languages, but it sounds like a nice thing to have.
It also could help things look tidier in the examples where I'd talking about sharing data between R and Python, if I then link to the Python cookbook and it has the same datasets in the examples.
I am also +1 on having a nice set of example data to re-use in the cookbook examples. Some simple real world datasets can make it easier to understand example (compared to using random / dummy data). I don't think consistency for those between R and Python in itself is super important, but it seems stupid duplicated effort to do it differently for both though.
Another thing to think about here is dataset requirements. I'm currently using some really compact datasets which are created inline so the reader can see their exact contents, and are each around 3 lines long. These won't work for every recipe of course, but the advantage of them is that they don't require the reader to load in data, allow the reader to copy and paste all of the code, and are very easy to reason about.
Here are a few:
Oscars
actor
awards
"Katharine Hepburn"
4
"Meryl Streep"
3
"Jack Nicholson"
3
Shares
company
price
date
"AMZN"
3463.12
2021-09-02
"GOOG"
2884.38
2021-09-02
"BKNG"
2300.46
2021-09-02
"TSLA"
732.39
2021-09-02
In creating datasets, I've tried to come up with topics that would be familiar to most people, are vaguely interesting, and where necessary, contain a few different data types. An example of a dataset that I've used but would like to replace with something more compelling:
group
score
"A"
99
"B"
97
"C"
99
If expanded versions of the first 2 datasets would be of use to anyone else, let me know and I can try to create something. Alternatively, it'd be good to hear your requirements/ideas and see your existing datasets that could be useful to share.
Activity
thisisnic commentedon Sep 2, 2021
One of the potential items on my to-do list is creating a nice tight set of interesting and useful example datasets for the R implementation, so I'd be in favour of this to prevent duplicating efforts. I also am unsure how likely it is that people would be comparing between different languages, but it sounds like a nice thing to have.
It also could help things look tidier in the examples where I'd talking about sharing data between R and Python, if I then link to the Python cookbook and it has the same datasets in the examples.
jorisvandenbossche commentedon Sep 14, 2021
I am also +1 on having a nice set of example data to re-use in the cookbook examples. Some simple real world datasets can make it easier to understand example (compared to using random / dummy data). I don't think consistency for those between R and Python in itself is super important, but it seems stupid duplicated effort to do it differently for both though.
thisisnic commentedon Sep 14, 2021
Another thing to think about here is dataset requirements. I'm currently using some really compact datasets which are created inline so the reader can see their exact contents, and are each around 3 lines long. These won't work for every recipe of course, but the advantage of them is that they don't require the reader to load in data, allow the reader to copy and paste all of the code, and are very easy to reason about.
Here are a few:
Oscars
Shares
In creating datasets, I've tried to come up with topics that would be familiar to most people, are vaguely interesting, and where necessary, contain a few different data types. An example of a dataset that I've used but would like to replace with something more compelling:
If expanded versions of the first 2 datasets would be of use to anyone else, let me know and I can try to create something. Alternatively, it'd be good to hear your requirements/ideas and see your existing datasets that could be useful to share.