Data Mining assignment
Find a file
2026-05-19 18:05:37 +01:00
ai_ref Added logarithmical transformation of target scores 2026-05-19 14:05:32 +01:00
catboost_info Rerun for checks and readme update 2026-05-19 17:00:20 +01:00
data Expanded feature engineering, added ensemble, more models 2026-05-19 11:52:16 +01:00
orig Added datasets 2026-04-08 20:01:45 +02:00
output Added some guiding markdown cells 2026-05-19 16:35:18 +01:00
01_eda.ipynb Rerun for checks and readme update 2026-05-19 17:00:20 +01:00
02_feature_engineering.ipynb Rerun for checks and readme update 2026-05-19 17:00:20 +01:00
03_ensemble.ipynb Rerun for checks and readme update 2026-05-19 17:00:20 +01:00
README.md added name to readme 2026-05-19 18:05:37 +01:00

dm2-kaggle - Zalán Tóth - 20102768

Data Mining 2 assignment kaggle competition

Approach

My approach was quite simple, create/add many features that may be relevant, then do automatic feature selection to choose the best ones, then train many models to select the best performing ones. Then using those best performing diverse models with the selected features, do ensemble of voting and stacking for a combined better final result.

The notebooks and their run

This codebase has 3 jupyter notebooks:

01 is EDA which is very simple just to familiarise myself with the data and explore their shapes etc. There is no need to run this file, but it would run in like a second.

02 is where most things happen including model selections, evaluation, data processing, cleanup and feature engineering. This is the file you can set the SEED in and that is carried forward to the 03 file on run with other info like the dataframe and meta information like the selected features. Running this will create/modify a few files, but the important 2 are in the data folder saved as feather and a JSON file for meta. This was the experimentation notebook technically. Running this takes about 1 and a half minute

03 loads in the df and meta created by notebook 02. This is where ensemble happens: voting & stacking. It is using manually selected models set within the notebook which is based on the experimentation done in the 2nd notebook. Feature selection is carried on as well from the meta.json file. Logarithmic transformation is done here as well on the target cases as the cases are unbalanced with spikes identified in the EDA, but the idea was given by AI to do it. It improved the end result and proved to be a good idea. Its reference is in the ai_ref folder (disclaimer I also used it to explain the data for me in Hungarian and when I asked after in the same chat about improvement in English, it answered in Hungarian going forward anyway, so all the conversation is in that language and I went with it, sorry about that, although it is easier for me to understand this way regardless). This file also take about 1 and a half minute to run

I did comment out some stuff that are not important and only there for testing or verifying so running them won't take too much time hopefully.

I'm unsure if anything needs to be installed by pip or anything for the python/conda env, but I used these models:

xgboost, lightgbm, catboost

So i don't know if these needs to be downloaded as I already installed many during labs and I didn't need to install anything again during this assignment.

Model performance

I think the model performance was decent/ok. I'm pretty sure that every aspect of the notebooks can be improved upon as the time I allocated was limited, but regardless I'm happy with the results. The new features I think are quite simple, but sometimes simple things are better. Of course some advanced features could be added, more models, and more hyperparameter tuning (I did some tunings, but that is still not too much of a customisation), but the results are still good and I don't think it takes much time to run these. I think the notebooks are self explanatory and the final results (cross validation score) of the stacking and voting ensembles are decent.

References

Comments may include references

AI reference: ai_ref folder (sorry about the 1 chat being in Hungarian, it is primarily about describing the origin csv files and then exploring how the models can be improved, that's where the logarithmic transformation was suggested as an idea to improve the results, which actually really makes sense looking at the EDA plots of the target cases distribution),

Context-aware Windsurf AI Cascade was also used for autocompletion to speed up writing and reduce issues and bugs in the code and improve algorithms and such (although for these notebooks in data mining, it was less correct than it is generally in other programming codebases). As this is a tab-complete feature, it cannot be referenced directly.

Repository of work: https://git.rifstar.net/Enyzat/dm2-kaggle