-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathanalysis.qmd
More file actions
115 lines (92 loc) · 2.75 KB
/
Copy pathanalysis.qmd
File metadata and controls
115 lines (92 loc) · 2.75 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
title: "Data Analysis"
---
Load the required R packages:
```{r}
#| output: false
library(babynames)
library(knitr)
library(dplyr)
library(ggplot2)
library(tidyr)
library(pheatmap)
```
# Writing in Quarto part
Let's have a look at the first couple of rows in the data:
```{r}
head(babynames) |> kable()
```
Let's create functions to visualise some of the data according to sex:
```{r}
#| code-fold: true
get_most_frequent <- function(babynames, select_sex, from = 1950) {
most_freq <- babynames |>
filter(sex == select_sex, year > from) |>
group_by(name) |>
summarise(average = mean(prop)) |>
arrange(desc(average))
return(list(
babynames = babynames,
most_frequent = most_freq,
sex = select_sex,
from = from))
}
plot_top <- function(x, top = 10) {
topx <- x$most_frequent$name[1:top]
p <- x$babynames |>
filter(name %in% topx, sex == x$sex, year > x$from) |>
ggplot(aes(x = year, y = prop, color = name)) +
geom_line() +
scale_color_brewer(palette = "Paired") +
theme_classic()
return(p)
}
```
We are going to look at the distribution of baby names over time. In @fig-girls we can see the ten most frequent names for girls. Likewise in @fig-boys, we can see the same of boys.
```{r}
#| label: fig-girls
#| echo: false
#| fig-cap: Distribution of the top ten female names over time.
get_most_frequent(babynames, select_sex = "F") |>
plot_top()
```
```{r}
#| label: fig-boys
#| echo: false
#| fig-cap: Distribution of the top ten male names over time.
get_most_frequent(babynames, select_sex = "M") |>
plot_top()
```
# git and github part
We want to plot multiple panels in a figure - an example for the most liked girl's name can be seen in @fig-mult-girls.
```{r}
#| label: fig-mult-girls
#| layout: [[50,50], [100]]
#| fig-cap: "Most favourite girl names - a closer look!"
#| fig-subcap:
#| - "Top 5 girl's names over the years"
#| - "Top 10 girl's names over the years"
#| - "Heatmap of top 30 girl's names versus years"
# get most frequent girl names from 2010 onwards
from_year <- 2010
most_freq_girls <- get_most_frequent(babynames, select_sex = "F",
from = from_year)
# plot top 5 girl names
most_freq_girls |>
plot_top(top = 5)
# plot top 10 girl names
most_freq_girls |>
plot_top(top = 10)
# get top 30 girl names in a matrix
# with names in rows and years in columns
prop_df <- babynames |>
filter(name %in% most_freq_girls$most_frequent$name[1:30] & sex == "F") |>
filter(year >= from_year) |>
select(year, name, prop) |>
pivot_wider(names_from = year,
values_from = prop)
prop_mat <- as.matrix(prop_df[, 2:ncol(prop_df)])
rownames(prop_mat) <- prop_df$name
# create heatmap
pheatmap(prop_mat, cluster_cols = FALSE, scale = "row")
```