WebLLM Conversion & WebGPU Deployment

I’m leading a team that already fine-tunes Hugging Face models but we’re stalled on the last mile: turning those checkpoints into WebLLM artefacts that run smoothly inside the browser through WebGPU/WebAssembly. I need a short-term partner who has actually walked this path before and can sit virtually with us, show exactly how to compile a model into the WebLLM format, debug any hiccups, and prove the result works in-browser with stable latency. What I expect from you • A step-by-step script or notebook that converts a standard HF model (think Llama-2, GPT-J, BLOOM or similar) into WebLLM format. • Clear explanation of the conversion tools, flags and weight slicing decisions you use, so my engineers can repeat the process later without you. • A minimal demo web page (TypeScript or vanilla JS is fine) that loads the converted model, allocates buffers correctly, and serves a prompt via WebGPU back-end. • Performance metrics (token / s, memory footprint) captured on at least one consumer-grade GPU so we can compare. Acceptance criteria 1. The model loads in an evergreen Chromium-based browser with no console errors. 2. First token latency ≤ 3 s and sustained generation comparable to your benchmark notes. 3. Full reproducibility on our hardware following your instructions. We already have the fine-tuned weights and a dev environment in place; I simply need your expertise to unblock compilation and browser inference. If you have prior commits or public demos with WebLLM, please share a link when you respond so we can hit the ground running.

PHP

Реєстрація