Welcome to the official repository of Team FVNWL for the European Statistics Awards' Web Intelligence Classification Challenge. This project focuses on the classification of job advertisements into standardized occupational categories, utilizing advanced web content processing techniques to extract valuable data for statistical analysis.
The challenge revolves around developing cutting-edge methodologies to process web content efficiently and accurately. Our primary goal is to assign class labels from a predefined taxonomy to job advertisements, facilitating more precise and scalable data extraction for statistical purposes.
The project leverages state-of-the-art Large Language Models (LLMs), both locally and through APIs, to enhance the performance and reliability of our classification system.
For a deep dive into our approach, please refer to our comprehensive report: classification_approach_description.docx.
To set up the environment and reproduce our results, follow these steps:
Install Conda using the Miniforge installation instructions.
On Linux x86_64 (amd64), run the following commands to install Miniforge:
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge-pypy3-Linux-x86_64.sh
sh Miniforge-pypy3-Linux-x86_64.sh
# Follow the on-screen instructionsOnce Conda is installed, set up the required environment by running:
mamba env create -f env.yml
conda activate fvnwlDue to privacy and legal constraints, we are unable to share the original dataset used in this challenge. To facilitate reproducibility, we have synthesized part of the data for public use. This synthetic data mimics the structure and characteristics of the real data while ensuring compliance with privacy regulations.
If you would like to reproduce the full results with the original pre-run outputs, please contact us directly to request access to the necessary archived artifacts. Send an email to teamfvnwl@example.com with your request, and we will review it based on data-sharing agreements and guidelines.
If you'd like to reproduce our pre-run results, simply run:
sh scripts/reproduce.shEnsure that:
- The
fvnwlenvironment is activated. - The
artifacts/archivefolder is available.
For those running the project from scratch, access to an LLM is required.
This is how to setup a Local LLM:
- Set up the LLM locally by serving the model using the script in the
servicefolder. - Use the provided
udockerscript if running on a Slurm cluster without root access.
To confirm the environment setup, run:
sh scripts/test.shOnce the environment is ready, run the following command to execute inference: (This will generate the classification results for the subset of data only, to run on the full dataset, please change the configuration file to the prefix full_)
sh scripts/inference.shEach configuration file in the config folder corresponds to specific results discussed in our report. Choose the appropriate configuration for your experiment.
We also release the premilinary results of our experiments in the artifacts folder for those who want to skip some steps. We provide the following files (artifacts/subset1000.tar.gz and artifacts/subset7000.tar.gz) for the subset of data.
If you want to re-run the preprocessing step, you can run the following command:
python src/preprocess.py --config config/stella_bm25_gemma2.iniIf you want to re-run the retrieval step, you can run the following command:
python src/retrieval.py --config config/stella_bm25_gemma2.iniIf you want to re-run the reranking step, you can run the following command:
python src/reranking.py --config config/stella_bm25_gemma2.iniIf you want to re-run the llm-based classification step, you can run the following command:
python src/llm_classification.py --config config/stella_bm25_gemma2.iniRemember to serve the LLM model before running the llm-based classification step. If you want to re-run the submission construction step, you can run the following command:
python src/submission.py --config config/stella_bm25_gemma2.ini --output submission.csvDo the same with the _mistral model by changing the configuration file.
Ensemble the results by running the following command:
python src/merge.py --ensemble_files \
artifacts/results_stella_gemma2/classification.csv \
artifacts/results_gemma2_gemma2/classification.csv \
artifacts/results_stella_mistral/classification.csv \
artifacts/results_gemma2_mistral/classification.csv \
--compare_file latest_predictions.csvlatest_predictions.csv is the file that contains the latest submission. If it not exists, you can choose any of the results files in the artifacts folder, such as artifacts/results_stella_gemma2/classification.csv.
Interested in contributing? Here’s how to get started:
- Fork the repository.
- Create a new branch, e.g.,
fix-loss,add-feat. - Make your changes, add features, or fix bugs.
- Add relevant test cases to the
testfolder. - Ensure all test cases pass (run
sh scripts/test.sh). - Document your feature or bug fix in the Pull Request (PR).
- Push your changes and submit a PR to the main repository.
Expected test results upon successful execution:
============================== test session starts ===============================
platform linux -- Python 3.10.15, pytest-8.3.3, pluggy-1.5.0 --
cachedir: .pytest_cache
rootdir: ...
configfile: pyproject.toml
plugins: anyio-4.6.0
collected 10 items
tests/test_api_llm.py::test_generate_content PASSED [ 10%]
tests/test_api_llm.py::test_generate_content_with_different_prompt PASSED [ 20%]
tests/test_api_llm.py::test_generate_content_with_high_temperature PASSED [ 30%]
tests/test_keyword_extraction.py::test_keyword_extraction PASSED [ 40%]
tests/test_langdet.py::test_detect_language PASSED [ 50%]
tests/test_local_llm.py::test_simple PASSED [ 60%]
tests/test_local_llm.py::test_generate_content PASSED [ 70%]
tests/test_search.py::test_search_with_valid_description PASSED [ 80%]
tests/test_search.py::test_search_with_special_characters PASSED [ 90%]
tests/test_search.py::test_search_with_long_description PASSED [100%]
============================== 10 passed in 35.13s ===============================A huge thanks to all contributors who made this project a success:
We look forward to your feedback and contributions to make this repository even better!