Description:JOB SUMMARYMachine Learning EngineerWe are seeking a skilled Machine Learning Engineer (Contractor) to support and maintain critical batch workflows that generate large-scale forecasts. These workflows are orchestrated through a custom in-house scheduling tool and leverage R, Python, Bash, and Spark on a YARN-managed on-premise cluster. The engineer will be responsible for ensuring smooth daily operations, including monitoring, restarting, and troubleshooting jobs to minimize downtime and maintain system reliability. Strong problem-solving skills and the ability to quickly diagnose issues across multiple technologies will be key in this roleIn addition to operations, the engineer will contribute development expertise by enhancing the functionality, stability, and performance of existing jobs. This will include submitting code changes in Python (PySpark), Bash, or Terraform to improve orchestration and infrastructure configurations. The role also involves implementing observability metrics and monitoring solutions, using tools such as OTEL, Kibana, REST APIs, and custom instrumentation. The ideal candidate will be comfortable collaborating via GitHub (PRs), proactive in identifying improvement opportunities, and effective at balancing operational support with development contributions.Required SkillsTechnical Skills:Proficiency with Python (PySpark), Bash, and working knowledge of RExperience with Apache Spark on YARN-managed clusters (large-scale, on-premise environments preferred)Familiarity with workflow orchestration tools (Airflow, Luigi, or custom equivalents)Experience with Terraform (infrastructure-as-code)Strong background in job monitoring and troubleshooting in distributed environmentsKnowledge of observability/monitoring practices using OTEL, Kibana, REST APIs, and custom metrics instrumentationHands-on experience with GitHub workflows (pull requests, branching strategies, code reviews)Soft Skills:Strong analytical and troubleshooting skills with attention to detailClear and effective communication, especially in cross-functional environmentsAbility to prioritize operational stability while driving code improvementsProactive mindset with a focus on reliability and continuous improvementCollaborative attitude, able to work effectively with developers, data scientists, and operations staffTECHNICAL SKILLSMust HaveApache Hadoop , Apache Hive, Apache Spark, Apache spark ecosystem, Big DataDockerGit/GitHubPySparkPythonNice To HaveAirflow or Similar Orchestration ToolsBash ScriptingGrafanaKibanaMLOpsOpenTelemetryRTerraformMust have (Spark, Hadoop, orchestration, observability, etc)VIVA is an equal opportunity employer. All qualified applicants have an equal opportunity for placement, and all employees have an equal opportunity to develop on the job. This means that VIVA will not discriminate against any employee or qualified applicant on the basis of race, color, religion, sex, sexual orientation, gender identity, national origin, disability or protected veteran status
Description:
JOB SUMMARY
Machine Learning EngineerWe are seeking a skilled Machine Learning Engineer (Contractor) to support and maintain critical batch workflows that generate large-scale forecasts. These workflows are orchestrated through a custom in-house scheduling tool and leverage R, Python, Bash, and Spark on a YARN-managed on-premise cluster. The engineer will be responsible for ensuring smooth daily operations, including monitoring, restarting, and troubleshooting jobs to minimize downtime and maintain system reliability. Strong problem-solving skills and the ability to quickly diagnose issues across multiple technologies will be key in this role
In addition to operations, the engineer will contribute development expertise by enhancing the functionality, stability, and performance of existing jobs. This will include submitting code changes in Python (PySpark), Bash, or Terraform to improve orchestration and infrastructure configurations. The role also involves implementing observability metrics and monitoring solutions, using tools such as OTEL, Kibana, REST APIs, and custom instrumentation. The ideal candidate will be comfortable collaborating via GitHub (PRs), proactive in identifying improvement opportunities, and effective at balancing operational support with development contributions.
Required Skills
Technical Skills:Proficiency with Python (PySpark), Bash, and working knowledge of RExperience with Apache Spark on YARN-managed clusters (large-scale, on-premise environments preferred)Familiarity with workflow orchestration tools (Airflow, Luigi, or custom equivalents)Experience with Terraform (infrastructure-as-code)Strong background in job monitoring and troubleshooting in distributed environmentsKnowledge of observability/monitoring practices using OTEL, Kibana, REST APIs, and custom metrics instrumentationHands-on experience with GitHub workflows (pull requests, branching strategies, code reviews)Soft Skills:Strong analytical and troubleshooting skills with attention to detailClear and effective communication, especially in cross-functional environmentsAbility to prioritize operational stability while driving code improvementsProactive mindset with a focus on reliability and continuous improvementCollaborative attitude, able to work effectively with developers, data scientists, and operations staffTECHNICAL SKILLS
Must Have
Apache Hadoop , Apache Hive, Apache Spark, Apache spark ecosystem, Big DataDockerGit/GitHubPySparkPythonNice To HaveAirflow or Similar Orchestration ToolsBash ScriptingGrafanaKibanaMLOpsOpenTelemetryRTerraformMust have (Spark, Hadoop, orchestration, observability, etc)
(Please ensure email matches your resume email)
(document types allowed: doc/docx/rtf/pdf/txt) (max 2MB)
By submitting this form, you are consenting to the VIVA team contacting you via Phone/Email