Skip to content

Common example files across implementations #57

@westonpace

Description

@westonpace
Member

Do we want some common example data files that are used on all the implementations? For example, common files used in the various dataset API recipes. I don't really know how often people will be bouncing between languages or comparing languages though.

Activity

added
discussionThis issue is for discussion and not an immediate change
on Aug 31, 2021
thisisnic

thisisnic commented on Sep 2, 2021

@thisisnic
Member

One of the potential items on my to-do list is creating a nice tight set of interesting and useful example datasets for the R implementation, so I'd be in favour of this to prevent duplicating efforts. I also am unsure how likely it is that people would be comparing between different languages, but it sounds like a nice thing to have.

It also could help things look tidier in the examples where I'd talking about sharing data between R and Python, if I then link to the Python cookbook and it has the same datasets in the examples.

jorisvandenbossche

jorisvandenbossche commented on Sep 14, 2021

@jorisvandenbossche
Member

I am also +1 on having a nice set of example data to re-use in the cookbook examples. Some simple real world datasets can make it easier to understand example (compared to using random / dummy data). I don't think consistency for those between R and Python in itself is super important, but it seems stupid duplicated effort to do it differently for both though.

thisisnic

thisisnic commented on Sep 14, 2021

@thisisnic
Member

Another thing to think about here is dataset requirements. I'm currently using some really compact datasets which are created inline so the reader can see their exact contents, and are each around 3 lines long. These won't work for every recipe of course, but the advantage of them is that they don't require the reader to load in data, allow the reader to copy and paste all of the code, and are very easy to reason about.

Here are a few:

Oscars

actor awards
"Katharine Hepburn" 4
"Meryl Streep" 3
"Jack Nicholson" 3

Shares

company price date
"AMZN" 3463.12 2021-09-02
"GOOG" 2884.38 2021-09-02
"BKNG" 2300.46 2021-09-02
"TSLA" 732.39 2021-09-02

In creating datasets, I've tried to come up with topics that would be familiar to most people, are vaguely interesting, and where necessary, contain a few different data types. An example of a dataset that I've used but would like to replace with something more compelling:

group score
"A" 99
"B" 97
"C" 99

If expanded versions of the first 2 datasets would be of use to anyone else, let me know and I can try to create something. Alternatively, it'd be good to hear your requirements/ideas and see your existing datasets that could be useful to share.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionThis issue is for discussion and not an immediate change

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @jorisvandenbossche@westonpace@thisisnic

        Issue actions

          Common example files across implementations · Issue #57 · apache/arrow-cookbook