Dataset Preparation for File-Level Vulnerability Classification on CVEfixes dataset

Замовник: AI | Опубліковано: 24.03.2026
Бюджет: 750 $

I need help preparing a dataset from CVEfixes for a file-level vulnerability classification task. My target programming languages are: PHP Java JavaScript Python My target CWE scope is: CWE-20 CWE-22 CWE-79 CWE-89 CWE-352 Important note: this will be a file-level dataset, and the labels in CVEfixes should be treated as derived / proxy file-level labels, not perfect manual ground truth. I want the preparation pipeline to be strict, realistic, and academically defensible. The dataset has a real class imbalance problem, and I want it handled carefully. What I want: I want the dataset to be organized by language first. That means I need 4 separate language-specific datasets: Java PHP JavaScript Python Then, for each language-specific dataset, I want a split into: train validation test So each language should end up with its own 3 split files. Important constraints: The split must be strict and leakage-aware Duplicates and near-duplicates should be handled carefully before or during splitting If possible, avoid putting highly similar samples across train/validation/test Augmentation must be applied only to the training split Validation and test must remain original, real, and unaugmented I do not want any synthetic or LLM-generated samples in validation or test The final evaluation setting must stay realistic and fair Augmentation requirement: To address class imbalance, you may use an LLM to generate augmented variants in a controlled way, but only under these conditions: augmentation must be applied after the split only the train split may be augmented augmented samples must not leak into validation or test validation and test should remain fully original and unchanged the process should improve training balance without making evaluation unrealistic What I need help with I want help designing and preparing the full data preparation notebook correctly, including: filtering CVEfixes to the required languages filtering to the required CWE scope building a clean file-level dataset handling duplicate or pathological repeated samples carefully splitting each language dataset into train/validation/test in a strict way applying augmentation only on the training split improving class balance without contaminating validation or test Final goal: The final result should be: Java: train / validation / test PHP: train / validation / test JavaScript: train / validation / test Python: train / validation / test with augmentation used only on train, and with validation/test kept fully real and untouched.