# Intelligent Indian Sign Language Recognition System (ISL-R) *A unified deep-learning system for real-time recognition of words, alphabets, and numbers in Indian Sign Language.* ## Abstract Communication for hearing or speech-impaired individuals relies on sign language. However, existing Indian Sign Language (ISL) systems mostly focus on alphabets or digits, leaving out common conversational words and mixed gestures. This project develops a real-time ISL recognition system that detects words, alphabets, and numbers from live video and converts them into coherent text or speech output. It supports mixed input sequences like: “I love you 3000” → “I” (alphabet) + “love” (word) + “you” (word) + “3000” (numbers). The system uses hand-pose landmark extraction, temporal deep-learning models, and sentence-formation logic to produce readable text and optional voice feedback in real time. ## Objectives - Build a comprehensive ISL dataset with alphabets (A–Z), numbers (0–9), and daily-use words. - Train multi-branch neural networks for different gesture types. - Create a real-time inference system for mixed gesture sequences. - Implement sentence builder logic with debounce and text-to-speech. - Ensure cross-platform compatibility (Web + Embedded). ## Problem Statement Existing ISL tools have limited vocabulary, poor robustness, no support for mixed sequences, and high latency. This system fixes those by providing a diverse dataset, optimized lightweight models, and a low-latency unified recognition engine for alphabets, numbers, and words. ## System Architecture **Flow:** Camera → Hand Detection → Pose Estimation → Model Inference (W/C/N) → Token Assembly → Sentence Formation → TTS Output. **Modules:** 1. **Pose Extraction:** MediaPipe Hands or TFJS HandPose (21 keypoints per hand). 2. **Feature Encoding:** Normalized xyz, pairwise distances, angles. 3. **Model Heads:** CNN for characters/numbers, Bi-LSTM for words. 4. **Sequence Builder:** Collects tokens, applies debounce, auto-space, backspace. 5. **Output Layer:** Text display + optional speech via TTS. ## Dataset Design ### Gesture Categories | Type | Count | Examples | |------|--------|-----------| | Words (W) | 15–25 | hello, bye, thank_you, sorry, please, yes, no, help, love, you, me, ok, where, what | | Alphabets (C) | 26 | A–Z | | Numbers (N) | 10 | 0–9 | ### Dataset Size - 500–1000 samples per class, 10–15 participants. - Lighting/background diversity. - Train/Val/Test = 70/15/15 (no subject overlap). ### Annotation Format ```json { "id": "vid_0045", "label": {"type": "W", "name": "love"}, "fps": 30, "frames": 60, "landmarks": [[x,y,z]*21 per frame], "subject": "s09", "lighting": "outdoor" }