Job Description
This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more.
Role Description
- As a Machine Learning Researcher at Bland, you'll be working on foundational research and development across the core components of our voice stack: speech-to-text, large language models, neural audio codecs, and text-to-speech. Your work will define how our agents understand, reason, and speak in real time at enterprise scale.
- Build and Scale Next-Generation TTS Systems
- Design and train large scale text-to-speech models capable of expressive, controllable, human-sounding output.
- Develop neural audio codec-based TTS architectures for efficient, high-fidelity generation.
- Improve prosody modeling, question inflection, emotional expression, and multi-speaker robustness.
- Optimize for real-time, low-latency inference in production.
- Advance Speech-to-Text Modeling
- Build and fine-tune large scale ASR systems robust to accents, noise, telephony artifacts, and code switching.
- Leverage self-supervised pretraining and large-scale weak supervision.
- Improve transcription accuracy for real-world enterprise scenarios, including structured extraction and conversational nuance.
- Pioneer Neural Audio Codecs
- Research and implement neural audio codecs that achieve extreme compression with minimal perceptual loss.
- Explore discrete and continuous latent representations for scalable speech modeling.
- Design codec architectures that enable downstream generative modeling and controllable synthesis.
- Develop Scalable Training Pipelines
- Curate and process massive audio datasets across languages, speakers, and environments.
- Design staged training curricula and data filtering strategies.
- Scale training across distributed GPU clusters focusing on cost, throughput, and reliability.
- Run Rigorous Experiments
- Design ablation studies that isolate the impact of architectural changes.
- Measure improvements using both objective metrics and perceptual evaluations.
- Validate ideas quickly through focused experiments that confirm or eliminate hypotheses.
- Qualifications
- Experience with self-supervised learning, multimodal modeling, or generative modeling.
- Hands-on experience building or scaling TTS, STT, or neural audio codec systems.
- Familiarity with large scale speech datasets and real-world audio variability.
- Experience training and serving large models on modern accelerators.
- Track record of designing controlled experiments and meaningful ablations.
- Comfortable in fast-moving startup environments.
- Requirements
- Ability to derive new formulations and implement them efficiently.
- Strong intuition for audio quality, prosody, and conversational dynamics.
- Knowledge of inference optimization techniques, including quantization, kernel optimization, and memory efficiency.
- Understanding of real-time constraints in telephony or streaming environments.
- Ability to move quickly from hypothesis to validation.
- Strong ownership mindset from research through deployment.
- Excited by ambiguous, unsolved problems.
- Benefits
- Healthcare, dental, vision, all the good stuff
- Meaningful equity in a fast-growing company
- Every tool you need to succeed
- Beautiful office in Jackson Square, SF with rooftop views
- Competitive salary: $160,000 to $250,000
Apply tot his job
Apply To this Job