Code-Mixed (Hin-Eng) Sentiment Analysis System with Synthetic Data Augmentation

Заказчик: AI | Опубликовано: 08.10.2025

I have completed the literature survey for a post-graduate project entitled “Enhancing sentiment analysis for Hindi-English code-mixed text using synthetic data augmentation.” Now I need the full system brought to life. Scope and goals • Build or adapt a sentiment classifier for code-mixed (hi-en) social-media style text. • Inject novelty through synthetic data augmentation so that the final model demonstrably outperforms a plain baseline trained on the raw corpus. • Deliver clean, well-commented source code and set everything up on my M3 MacBook running macOS 15.6. What I already have • A starter dataset downloaded from the web and a PPT that details candidate architectures, augmentation ideas (back-translation, contextual synonym replacement, etc.), and evaluation protocol. • Openness to stronger datasets or alternative techniques if you can justify the gain. What I need from you 1. Data preprocessing tailored for noisy code-switching (tokenisation, transliteration handling, spelling mistakes, duplication cleaning). 2. Custom or fine-tuned model implementation—TensorFlow, PyTorch, or Scikit-learn, whichever best meets the accuracy target. 3. Robust synthetic data generation pipeline integrated into training. 4. A lightweight local web interface (Flask, Streamlit, or similar) so I can demo predictions live. 5. Hands-on assistance to install dependencies, run notebooks/scripts, and reproduce results on my machine. A quick screen-share session at the end is fine. Acceptance criteria • At least one augmentation strategy that lifts F1 or accuracy beyond the non-augmented baseline on a held-out test split. • Reproducible training script, saved model weights, README with exact commands, and a requirements.txt/poetry.lock. • Local web demo reachable at http://localhost:8501 (or similar) returning sentiment labels in real time. Deadline - 15th October 2025. When you apply, please highlight past work on multilingual NLP, code-mixing, or data-centric model improvements so I can gauge fit quickly. I’ll share the PPT and current dataset once we connect.