Comprehensive Data Preprocessing Pipeline

I have a collection of raw files that mix free-text fields with numerical measurements, and they need to be transformed into one clean, analytics-ready dataset. The goal of this project is data cleaning and preprocessing—no modelling for now—so the entire focus is on building a robust, reproducible workflow. You will receive the original CSV/JSON exports together with a brief data dictionary and notes on known issues such as duplicates, inconsistent units, character-encoding errors, nulls, and unstructured text. From there I need: • Reproducible Python scripts or Jupyter notebooks that detect and resolve missing values, outliers, encoding inconsistencies, and unit mismatches; normalise or scale numerical columns; encode categorical variables; and tokenise or vectorise text where relevant. • A merged, tidy dataset saved to both CSV and parquet, ready for downstream modelling. • Clear in-line comments plus a concise README explaining each step and listing the key libraries used (pandas, NumPy, scikit-learn, spaCy/NLTK, etc.). • A brief summary report highlighting the main cleaning decisions and resulting data-quality metrics. Please attach a detailed project proposal describing your approach, the specific validation checks you plan to run, and an estimated timeline. Prior experience with mixed data pipelines, code modularity, and thorough documentation will weigh heavily in my decision. All deliverables should be shared via GitHub (or another version-controlled workspace) so I can review the commits and run the notebook end-to-end before sign-off.

Python

Реєстрація