|
| 1 | +--- |
| 2 | +# author: |
| 3 | +# - name: Emilio M. Bruna |
| 4 | +# affiliations: |
| 5 | +# - id: uf |
| 6 | +# - name: University of Florida |
| 7 | +title-block-style: default |
| 8 | +date-modified: last-modified |
| 9 | + |
| 10 | +--- |
| 11 | +# Class Outline: QA/QC 2 - Open Refine {#sec-openrefine-hw} |
| 12 | + |
| 13 | +::: {.callout-tip} |
| 14 | +## Objectives and Competencies |
| 15 | + |
| 16 | +By the end of this lesson students will: |
| 17 | + |
| 18 | +* Be able to import a data set into OpenRefine, make changes to the data set and its structure, and export the revised data set |
| 19 | + |
| 20 | +* Learn how to automatically track changes made and export the record of changes |
| 21 | + |
| 22 | +* Be able to apply these changes to a different data set |
| 23 | + |
| 24 | +::: |
| 25 | + |
| 26 | + |
| 27 | +## Pre-Class Preparation (Instructor): |
| 28 | + |
| 29 | + * Remind via email about OpenRefine Installation |
| 30 | + * Post Data sets |
| 31 | + |
| 32 | +**Bring to Class: ** |
| 33 | + |
| 34 | + * Snacks |
| 35 | + * Tent cards for student names |
| 36 | + |
| 37 | +## Pre-class Preparation (Students): |
| 38 | + |
| 39 | +### **Online Lectures:** None |
| 40 | + |
| 41 | +### **Readings**: None |
| 42 | + |
| 43 | + |
| 44 | + |
| 45 | +<!-- * **In-class Exercise:** --> |
| 46 | +<!-- * Data Carpentry Lesson (2:15 h) --> |
| 47 | +<!-- * https://datacarpentry.org/openrefine-socialsci/ --> |
| 48 | +<!-- * https://datacarpentry.org/OpenRefine-ecology-lesson/ --> |
| 49 | + |
| 50 | +### Computer Resources |
| 51 | + |
| 52 | +1. Install OpenRefine on your computer and verify it works by following the [instructions here](https://datacarpentry.org/openrefine-socialsci/index.html). |
| 53 | + |
| 54 | +2. _Optional:_ Read and watch about [how OpenRefine works here](https://openrefine.org/). You can also review the [basic workflow](https://hackmd.io/WnKT7eiMQde7jCF5teVHBQ#Setup) we will learn. |
| 55 | + |
| 56 | + |
| 57 | +<!-- ```{r timer, echo=FALSE} --> |
| 58 | +<!-- part_1 <- 10 --> |
| 59 | +<!-- part_2 <- 10 --> |
| 60 | +<!-- part_3 <- 45 --> |
| 61 | +<!-- snack_break <- 10 --> |
| 62 | +<!-- part_4 <-30 --> |
| 63 | +<!-- part_5 <-45 --> |
| 64 | +<!-- part_6 <-10 --> |
| 65 | + |
| 66 | +<!-- total_time <- --> |
| 67 | +<!-- part_1 + --> |
| 68 | +<!-- part_2 + --> |
| 69 | +<!-- part_3 + --> |
| 70 | +<!-- part_4 + --> |
| 71 | +<!-- part_5 + --> |
| 72 | +<!-- part_6 + --> |
| 73 | +<!-- snack_break --> |
| 74 | + |
| 75 | +<!-- time_remain <- 180 - total_time --> |
| 76 | +<!-- ``` --> |
| 77 | + |
| 78 | + |
| 79 | +## In-Class: Using Open Refine to clean data |
| 80 | + |
| 81 | +{width=30%} |
| 82 | + |
| 83 | + |
| 84 | +OpenRefine is a powerful, free, and open source tool that is used to work with and clean messy data. We will be working through some of OpenRefine's basic features, after which you will trying them onm your own on a new data set. |
| 85 | + |
| 86 | +<!-- **_Note: _** This is the url for accessing OpenRefine if a new tab/window doesn't open: http://127.0.0.1:3333/ --> |
| 87 | + |
| 88 | +### **Intro to OR** |
| 89 | +<!-- (`r part_1` min) --> |
| 90 | + |
| 91 | +### **Working with OR** |
| 92 | +<!-- (`r part_2` min) --> |
| 93 | + |
| 94 | +### **Filtering and Sorting** |
| 95 | +<!-- (`r part_3` min) --> |
| 96 | + |
| 97 | +### Break |
| 98 | +<!-- (`r snack_break` min) --> |
| 99 | + |
| 100 | +### **Examining Numbers** |
| 101 | +<!-- (`r part_4` min) --> |
| 102 | + |
| 103 | +### **Using Scripts, Exporting, and Saving** |
| 104 | +<!-- (`r part_5` min) --> |
| 105 | + |
| 106 | +### **Wrap-up, Questions** |
| 107 | +<!-- (`r part_6` min) --> |
| 108 | + |
| 109 | +## **Assignment** |
| 110 | +<!-- (`r time_remain` min) --> |
| 111 | + |
| 112 | +Now it's your turn. [Download this csv file](https://github.com/BrunaLab/LAS6292_DataCourseBook/blob/0633beb2b645c046aaea4211ba604a79e806554c/class_materials/class_sessions/06_QAQC2/portal_rodents_or_short.csv) and use OpenRefine to clean it up. After you create a Project, edit the data as follows: |
| 113 | + |
| 114 | +1. Correct and standardize the names of the countries in which the rodents were captured. |
| 115 | + |
| 116 | +2. The column `scientificName` contains two pieces of information (the genus _and_ species of each animal). Split this into two columns, rename them as `genus` and `species`, and then correct and standardize the data in each column as needed. NB: You may run into an obstacle when you try to rename the columns. How can you get around it? |
| 117 | + |
| 118 | +3. Save the clean data as a `.csv` file on your desktop. |
| 119 | + |
| 120 | +<!-- grel: if(value=="AH","Ammospermophilus harrisi","") --> |
| 121 | + |
| 122 | +4. Extract and save your steps (i.e., 'operation history' as JSON. Save this as a text file. |
| 123 | + |
| 124 | +5. _Bonus Brainteaser:_ Many of the cells in the column for the Latin bonomial are blank. How might you go about filling them in based on the column with the abbreviation? |
| 125 | + |
| 126 | +6. **Submission:** Submit your clean `.csv` and the JSON text file as `openrefine_assignment` on Canvas. |
| 127 | + |
| 128 | +### Grading Rubric: |
| 129 | + |
| 130 | +Data corrected and JSON file can be used on another data set: 35 |
| 131 | +Most data correction properly programmed; some require instructor follow-up: 25 |
| 132 | +Many of the corrections missing, JSON file unable to process new data : 20 |
| 133 | +Instructor follow-up required to implement most changes: 15 |
0 commit comments