-
Notifications
You must be signed in to change notification settings - Fork 13
Description
- Abstract (2-3 lines)
As a Data Scientist, we face few challenges while dealing with large volume of data:
- Popular Python libraries like NumPy & Pandas are not designed to scale beyond single processor/core
- Numpy, Pandas, Scikit-Learn are not designed to scale beyond a single machine
- If data is bigger than RAM, these libraries can't be used
In this session, I will discuss how these challenges can be addressed using parallel computing library, Dask.
- Brief Description and Contents to be covered
The talk is divided in two portions:
-
Understanding the challenges of large data (Will be delivered through presentation)
a. Fundamentals of computer architecture (with a focus on Computing Unit & Memory unit)
b. Why parallelism is necessary in a multi-core architecture?
c. Challenges with large data (data that doesn't fit RAM) & how to address
d. Introduction to distributed computing? -
How does Dask handle large data? (Code walk through)
a. What is Dask and Why it is needed?
b. How Dask parallelizes jobs across cores/processors?
c. How Dask handles larger than memory data using out of core computing and distributed computing?
- Pre-requisites for the talk
Basic knowledge about the Python based Data Science libraries like Pandas, NumPy, ScikitLearn
- Time required for the talk
45 minutes to 1 hour. This talk can be extended to a 2 hour long work shop as well.
- Link to slides
https://speakerdeck.com/arnabbiswas1/scale-up-your-data-science-work-flow-using-dask
- Will you be doing hands-on demo as well?
Yes.
- Link to ipython notebook (if any)
https://github.com/arnabbiswas1/dask_workshop
- About yourself
- Are you comfortable if the talk is recorded and uploaded to PyData Delhi's YouTube channel ?
Yes
- Any query ?
This talk (45 minutes) have been delivered recently to Bangalore Python User Group, BangPypers. Here is the recording for your reference: Link