You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/day3/big_data.rst
+71-44Lines changed: 71 additions & 44 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,16 +14,32 @@ Big data with Python
14
14
15
15
.. admonition:: "For teacher"
16
16
17
-
Preliminary timings
17
+
Preliminary timings. Starting at 13.00
18
18
19
19
- Intro 10 min
20
20
- Files 5
21
21
- Exercise files 10
22
22
- Memory 5
23
23
- Exercise Allocation 10
24
24
- Dask 10
25
+
- BREAK 15min 13.50-14.05
25
26
- Exercise Dask 30
26
27
28
+
Prepare environment!
29
+
--------------------
30
+
31
+
.. admonition::
32
+
33
+
- We recommend a desktop environment for speed of the graphics.
34
+
- connecting from local terminal with "ssh -X" (X11 forwarding) can be be used but is slower.
35
+
36
+
1. Log in to a desktop (ThinLinc or OnDemand) (see :ref:`common-login`)
37
+
38
+
- Tetralith ThinLinc
39
+
- Dardel (ThinLinc)
40
+
- Alvis (
41
+
42
+
27
43
High-Performance Data Analytics (HPDA)
28
44
--------------------------------------
29
45
@@ -38,6 +54,10 @@ High-Performance Data Analytics (HPDA)
38
54
39
55
- Big data analysis challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy, and data source.” (from Wikipedia)
40
56
57
+
.. discussion::
58
+
59
+
Do you already work with large data sets?
60
+
41
61
Why we need to take special actions
42
62
-----------------------------------
43
63
@@ -50,47 +70,37 @@ Remember this one?
50
70
51
71
- What can limit us?
52
72
53
-
.. admonition:: What do we need to cover??
54
-
:class: dropdown
55
-
56
-
- File formats
57
-
- Methods
58
-
- RAM allocation
59
-
- chunking
60
-
61
-
scenario
62
-
::::::::
63
-
64
-
- use dataset (10 GB)
65
-
- fails in pandas or is slow
66
-
- Load with dask + xarray
67
-
68
73
What the constraints are
69
74
------------------------
70
75
71
76
- storage
72
-
- reading into memory
77
+
- memory
73
78
79
+
.. admonition:: What do we need to cover??
80
+
:class: dropdown
74
81
75
-
Memory, nodes
82
+
- storage --> make more effective files
83
+
- reading into memory
84
+
- --> read just parts of files into memory
85
+
- --> chunking
86
+
- allocate more memory
76
87
77
88
Solutions and tools
78
89
-------------------
79
90
80
91
- Choose file format for reading and writing
81
92
- Allocate enough RAM
82
-
- Choose the Python package
93
+
- Choose the right Python package
83
94
- Is chunking suitable?
84
95
85
96
File formats
86
97
------------
87
98
88
-
Bit and Byte
89
-
............
99
+
.. admonition:: Bits and Bytes
90
100
91
-
- The smallest building block of storage and memory (RAM) in the computer is a bit, which stores either a 0 or 1.
92
-
- Normally a number of 8 bits are combined in a group to make a byte.
93
-
- One byte (8 bits) can represent/hold at most 2^8 distinct values. Organising bytes in different ways can represent different types of information, i.e. data.
101
+
- The smallest building block of storage and memory (RAM) in the computer is a bit, which stores either a 0 or 1.
102
+
- Normally a number of 8 bits are combined in a group to make a byte.
103
+
- One byte (8 bits) can represent/hold at most 2^8 distinct values. Organising bytes in different ways can represent different types of information, i.e. data.
94
104
95
105
.. admonition:: Numerical data
96
106
:class: dropdown
@@ -341,11 +351,21 @@ An overview of common data formats
341
351
- 🟨 : Ok / depends on a case
342
352
- ❌ : Bad
343
353
344
-
Adapted from Aalto university's `Python for scientific computing <https://aaltoscicomp.github.io/python-for-scicomp/work-with-data/#what-is-a-data-format>`__... seealso::
354
+
Adapted from Aalto university's `Python for scientific computing <https://aaltoscicomp.github.io/python-for-scicomp/work-with-data/#what-is-a-data-format>`__
355
+
356
+
... seealso::
345
357
346
358
- ENCCS course "HPDA-Python": `Scientific data <https://enccs.github.io/hpda-python/scientific-data/>`_
0 commit comments