[Hiring] Machine Learning Researcher, Audio @Bland

🌍 Remote, USA 💹 Full-time 🕐 Posted Recently

Job Description

This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more.

Role Description

Build and Scale Next-Generation TTS Systems
Design and train large scale text-to-speech models capable of expressive, controllable, human-sounding output.
Develop neural audio codec-based TTS architectures for efficient, high-fidelity generation.
Improve prosody modeling, question inflection, emotional expression, and multi-speaker robustness.
Optimize for real-time, low-latency inference in production.
Advance Speech-to-Text Modeling
Build and fine-tune large scale ASR systems robust to accents, noise, telephony artifacts, and code switching.
Leverage self-supervised pretraining and large-scale weak supervision.
Improve transcription accuracy for real-world enterprise scenarios, including structured extraction and conversational nuance.
Pioneer Neural Audio Codecs
Research and implement neural audio codecs that achieve extreme compression with minimal perceptual loss.
Explore discrete and continuous latent representations for scalable speech modeling.
Design codec architectures that enable downstream generative modeling and controllable synthesis.
Develop Scalable Training Pipelines
Curate and process massive audio datasets across languages, speakers, and environments.
Design staged training curricula and data filtering strategies.
Scale training across distributed GPU clusters focusing on cost, throughput, and reliability.
Run Rigorous Experiments
Design ablation studies that isolate the impact of architectural changes.
Measure improvements using both objective metrics and perceptual evaluations.
Validate ideas quickly through focused experiments that confirm or eliminate hypotheses.

Experience with self-supervised learning, multimodal modeling, or generative modeling.
Hands-on experience building or scaling TTS, STT, or neural audio codec systems.
Familiarity with large scale speech datasets and real-world audio variability.
Experience training and serving large models on modern accelerators.
Track record of designing controlled experiments and meaningful ablations.
Comfortable in fast-moving startup environments.

Ability to derive new formulations and implement them efficiently.
Strong intuition for audio quality, prosody, and conversational dynamics.
Knowledge of inference optimization techniques, including quantization, kernel optimization, and memory efficiency.
Understanding of real-time constraints in telephony or streaming environments.
Ability to move quickly from hypothesis to validation.
Strong ownership mindset from research through deployment.
Excited by ambiguous, unsolved problems.