Skip to content

Commit b06bbfe

Browse files
authored
big_data.rst restruct first part
1 parent 20ffa8b commit b06bbfe

File tree

1 file changed

+71
-44
lines changed

1 file changed

+71
-44
lines changed

docs/day3/big_data.rst

Lines changed: 71 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -14,16 +14,32 @@ Big data with Python
1414

1515
.. admonition:: "For teacher"
1616

17-
Preliminary timings
17+
Preliminary timings. Starting at 13.00
1818

1919
- Intro 10 min
2020
- Files 5
2121
- Exercise files 10
2222
- Memory 5
2323
- Exercise Allocation 10
2424
- Dask 10
25+
- BREAK 15min 13.50-14.05
2526
- Exercise Dask 30
2627

28+
Prepare environment!
29+
--------------------
30+
31+
.. admonition::
32+
33+
- We recommend a desktop environment for speed of the graphics.
34+
- connecting from local terminal with "ssh -X" (X11 forwarding) can be be used but is slower.
35+
36+
1. Log in to a desktop (ThinLinc or OnDemand) (see :ref:`common-login`)
37+
38+
- Tetralith ThinLinc
39+
- Dardel (ThinLinc)
40+
- Alvis (
41+
42+
2743
High-Performance Data Analytics (HPDA)
2844
--------------------------------------
2945

@@ -38,6 +54,10 @@ High-Performance Data Analytics (HPDA)
3854

3955
- Big data analysis challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy, and data source.” (from Wikipedia)
4056

57+
.. discussion::
58+
59+
Do you already work with large data sets?
60+
4161
Why we need to take special actions
4262
-----------------------------------
4363

@@ -50,47 +70,37 @@ Remember this one?
5070

5171
- What can limit us?
5272

53-
.. admonition:: What do we need to cover??
54-
:class: dropdown
55-
56-
- File formats
57-
- Methods
58-
- RAM allocation
59-
- chunking
60-
61-
scenario
62-
::::::::
63-
64-
- use dataset (10 GB)
65-
- fails in pandas or is slow
66-
- Load with dask + xarray
67-
6873
What the constraints are
6974
------------------------
7075

7176
- storage
72-
- reading into memory
77+
- memory
7378

79+
.. admonition:: What do we need to cover??
80+
:class: dropdown
7481

75-
Memory, nodes
82+
- storage --> make more effective files
83+
- reading into memory
84+
- --> read just parts of files into memory
85+
- --> chunking
86+
- allocate more memory
7687

7788
Solutions and tools
7889
-------------------
7990

8091
- Choose file format for reading and writing
8192
- Allocate enough RAM
82-
- Choose the Python package
93+
- Choose the right Python package
8394
- Is chunking suitable?
8495

8596
File formats
8697
------------
8798

88-
Bit and Byte
89-
............
99+
.. admonition:: Bits and Bytes
90100

91-
- The smallest building block of storage and memory (RAM) in the computer is a bit, which stores either a 0 or 1.
92-
- Normally a number of 8 bits are combined in a group to make a byte.
93-
- One byte (8 bits) can represent/hold at most 2^8 distinct values. Organising bytes in different ways can represent different types of information, i.e. data.
101+
- The smallest building block of storage and memory (RAM) in the computer is a bit, which stores either a 0 or 1.
102+
- Normally a number of 8 bits are combined in a group to make a byte.
103+
- One byte (8 bits) can represent/hold at most 2^8 distinct values. Organising bytes in different ways can represent different types of information, i.e. data.
94104

95105
.. admonition:: Numerical data
96106
:class: dropdown
@@ -341,11 +351,21 @@ An overview of common data formats
341351
- 🟨 : Ok / depends on a case
342352
- ❌ : Bad
343353

344-
Adapted from Aalto university's `Python for scientific computing <https://aaltoscicomp.github.io/python-for-scicomp/work-with-data/#what-is-a-data-format>`__... seealso::
354+
Adapted from Aalto university's `Python for scientific computing <https://aaltoscicomp.github.io/python-for-scicomp/work-with-data/#what-is-a-data-format>`__
355+
356+
... seealso::
345357

346358
- ENCCS course "HPDA-Python": `Scientific data <https://enccs.github.io/hpda-python/scientific-data/>`_
347359
- Aalto Scientific Computing course "Python for Scientific Computing": `Xarray <https://aaltoscicomp.github.io/python-for-scicomp/xarray/>`_
348360

361+
Exercise file formats (10 minutes)
362+
---------------------------------
363+
364+
Go over file formats and see if some are more relevant for your work.
365+
366+
.. discussion::
367+
368+
- Would you look at other file formats and why?
349369

350370

351371
Computing efficiency with Python
@@ -378,28 +398,9 @@ XARRAY Package
378398

379399
- Explore these in the exercise below!
380400

381-
Exercise file formats
382-
---------------------
383-
384-
Go over file formats and see if some are more relevant for your work.
385-
386-
.. discussion::
387-
388-
- Would you look at other file formats and why?
389-
390401
Allocating RAM
391402
--------------
392403

393-
- Mention memory per core considerations.
394-
- Show SLURM options for memory and time.
395-
- Briefly explain what happens when a Dask job runs on multiple cores.
396-
397-
398-
399-
.. admonition:: Keywords
400-
401-
OOM
402-
403404
- Storing the data in an efficient way is one thing!
404405

405406
- Using the data in a program is another.
@@ -414,6 +415,18 @@ Allocating RAM
414415
- Note that shared memory among the cores works within node only.
415416

416417

418+
.. admonition:: To cover
419+
420+
- Mention memory per core considerations.
421+
- Show SLURM options for memory and time.
422+
- Briefly explain what happens when a Dask job runs on multiple cores.
423+
424+
.. admonition:: Keywords
425+
426+
OOM
427+
428+
429+
417430
.. discussion::
418431

419432
- Take some time to find out the answers on the questions below, using the table of hardware
@@ -581,6 +594,20 @@ Data source → Format choice → Load/Chunk → Process → Write
581594
Exercises
582595
---------
583596

597+
Start interactive session with 4 cores
598+
599+
.. admonition:: Compute allocations in this workshop
600+
:class: dropdown
601+
602+
- Pelle: ``uppmax2025-2-393``
603+
- Kebnekaise: ``hpc2n2025-151``
604+
- Cosmos: ``lu2025-7-106``
605+
- Alvis: ``naiss2025-22-934``
606+
- Tetralith: ``naiss2025-22-934``
607+
- Dardel: ``naiss2025-22-934``
608+
609+
610+
584611
- Pandas
585612
- xarray
586613
- dask

0 commit comments

Comments
 (0)