Homework for Week 5 Streaming Pipelines
To ensure smooth processing of your submissions through GitHub Classroom automation, we cannot accommodate individual requests for changes. Therefore, please read all instructions carefully.
- Sync fork: Ensure your fork is up-to-date with the upstream repository. You can do this through the GitHub UI or by running the following commands in your local forked repository:
git pull upstream main --rebase git push origin <main/your_homework_repo> -f
- Open a PR in the upstream repository to submit your work:
- Open a Pull Request (PR) to merge changes from your
mainor customhomeworkbranch in your forked repository into the main branch in the upstream repository, i.e., the repository your fork was created from. - Ensure your PR is opened correctly to trigger the GitHub workflow. Submitting to the wrong branch or repository will cause the GitHub action to fail.
- See the Steps to open a PR section below if you need help with this.
- Open a Pull Request (PR) to merge changes from your
- Complete assignment prompts: Write your query files corresponding to the prompt number in the
submissionfolder. Do not change or rename these files! - Lint your SQL code for readability. Ensure your code is clean and easy to follow.
- Add comments to your queries. Use the
--syntax to explain each step and help the reviewer understand your thought process.
A link to your PR will be shared with our TA team after the homework deadline.
- Go to the upstream
streaming-pipelinesrepository - Click the Pull Requests tab.
- Click the "New pull request" button on the top-right. This will take you to the "Compare changes" page.
- Click the "compare across forks" link in the text.
- Leave the base repository as is. For the "head repository", select your forked repository and then the name of the branch you want to compare from your forked repo.
- Click the "Create pull request" button to open the PR
- Grades are pass or fail, used solely for certification.
- Final grades will be submitted by a TA after the deadline.
- An approved PR means a Pass grade. If changes are requested, the grade will be marked as Fail.
- Reviewers may provide comments or suggestions with requested changes. These are optional and intended for your benefit. Any changes you make in response will not be re-reviewed.
Your task is to write a streaming pipeline to understand DataExpert.io user behavior using sessionization.
Use the glue_job_runner.py connectors and examples in the airflow-dbt-project repo to get started. Copy and paste the DDL into submission/session_ddl.sql and copy the job into submission/session_job.py!
-
Write a DDL query (
session_ddl.sql) that creates a table that tracks Data Expert sessions.- Make sure to have a unique identifier for each session
- the start and end of the session
- a unique identifier for the session
- how many events happened in that session
- the date of the beginning of the session
- What city, country, and state is associated with this session
- What operating system and browser are associated with this session
- Whether this session is for logged in or logged out users
-
Write a Spark Streaming Job (
session_job.py) that reads from Kafka- It groups the Kafka events with a
session_windowwith a 5-minute gap (sessions end after 5 minutes of inactivity)- It is also grouped by
user_idandipto track each user behavior
- It is also grouped by
- Make sure to increase the
timeoutof your job to1 hourso you can capture real sessions - Create a unique identifier for each session called
session_id(you should think about usinghashand the necessary columns to uniquely identify the session) - If
user_idis not null, that means the event is from a logged in user - If a user logs in, that creates a new session (if you group by
ipanduser_idthat solves the problem pretty elegantly)
- It groups the Kafka events with a