I’m spinning up a research-grade prototype that fuses a large language model with a computer-vision pipeline and I need an expert who already lives in that overlap. The immediate goal is to translate language that accompanies visual content, so you’ll be wiring an image encoder and an LLM together, optimising the whole flow on NVIDIA GPUs. Here’s the shape of the work: • Build an end-to-end proof-of-concept that accepts an image (or short video snippet) plus its source-language text and returns an accurate translation in the target language. • Select or fine-tune the vision backbone and the LLM, keeping inference on-device with CUDA / TensorRT for speed. • Instrument performance so we can compare different model sizes and quantisation strategies during our experimentation phase. Acceptance criteria 1. Demo notebook or small web demo that runs on an RTX or A100 class card and produces reproducible translations. 2. Clear, commented code (Python, PyTorch preferred) plus a brief read-me explaining model choices, preprocessing, and how to swap in alternative checkpoints. 3. Short benchmark report covering latency, VRAM footprint, and translation quality on a sample set we’ll provide. If you have prior work with multimodal transformers, Stable Diffusion + LLM mash-ups, or any experience squeezing models into GPU memory via LoRA or similar tricks, you’ll hit the ground running. Let’s prototype something impressive.