THE ROLE:We are seeking a skilled and motivated Software Development Engineer to join our Training at Scale team. In this role, you will develop tools and automation to support large-scale model training on the latest client GPUs. You’ll work closely with engineersacross teams to optimize training workloads, manage CI/CD pipelines, and ensure reliable, high-performance releases. This is a hands-on engineering position with a strong focus on distributed systems, performance, and automation at scale.THE PRESON:The ideal candidate brings deep experience in open-source software (OSS) release cycles, container-based packaging (e.g., Docker), and has strong debugging skills—particularly around model training workloads. You thrive in fast-paced environments and arepassionate about automation, system reliability, and continuous improvement.KEY RESPONSIBILITIES:Manage and maintain nightly builds for multiple training frameworksCollaborate on integrating new training workloads and expanding test coverageEnsure the stability and releasability of the main branch at all timesUpdate and maintain build processes to support biweekly release and performance goalsHandle and deliver ad-hoc development test builds as requestedTrack build performance and reliability metrics over timePREFERRED EXPERIENCE:Experience with open-source software contributions and release managementStrong hands-on experience with Docker and container-based workflowsExcellent problem-solving skills and attention to detailAbility to work independently and a willingness to learn new technologies quicklyACADEMIC CREDENTIALS:Bachelor’s degree in Computer Science, Engineering, or a related technical fieldNotes:Onsite/Hybrid
(Please ensure email matches your resume email)
(document types allowed: doc/docx/rtf/pdf/txt) (max 20MB)
By submitting this form, you are consenting to the VIVA team contacting you via Phone/Email