Efficient Feature Engineering for Structured Data

Бюджет: 8 $

I have a large, fully structured dataset sitting in spreadsheets and a relational database. Training and inference are becoming expensive, so my top priority is to slim the feature space while keeping predictive power intact. The assignment is centred on feature selection and engineering—no model-building or heavy preprocessing beyond what supports that goal. Scope of work • Examine the existing numeric and categorical fields, flag redundancy, multicollinearity and low-information columns. • Propose and implement dimensionality-reduction or transformation techniques—e.g., variance thresholds, recursive feature elimination, PCA, embeddings—whichever yields the best speed-to-performance ratio. • Benchmark before-and-after runtime and memory usage as well as accuracy drift, highlighting the trade-offs clearly. • Hand back clean, reproducible Python code (pandas, scikit-learn, or similar), a compacted dataset ready for downstream modelling, and a short report that explains your choices so I can maintain the pipeline later. Acceptance criteria 1. At least a 30 % reduction in feature count or measurable compute savings with no more than a minimal loss in validation accuracy (≤1 pp). 2. All steps captured in a well-commented Jupyter notebook or script plus a markdown/PDF summary. 3. Results reproducible on my machine using standard open-source libraries only. If you have a proven track record of squeezing efficiency out of structured data pipelines, your expertise will be invaluable here.

Python

Регистрация