Remote Site Reliability Engineer - AI/ML Focus
About the Role
Join Mistral AI as a Remote Site Reliability Engineer and play a pivotal role in shaping the reliability, scalability, and performance of our AI-driven platform. You will collaborate closely with software engineers and research teams to ensure our systems not only meet but exceed customer expectations.
What You'll Do
- Design, build, and maintain scalable, highly available, and fault-tolerant infrastructures to support our web services and machine learning workloads.
- Ensure our platform, inference, and model training environments are always highly available, enabling seamless replication across multiple HPC clusters.
- Operate systems and troubleshoot issues in production environments, including on-call responses and infrastructure scaling.
- Implement and enhance monitoring, alerting, and incident response systems to optimize performance and minimize downtime.
- Drive continuous improvement in infrastructure automation and orchestration using tools like Kubernetes, Flux, and Terraform.
- Collaborate with AI/ML researchers to develop solutions that enable safe and reproducible model-training experiments.
- Document processes and procedures to ensure consistency and knowledge sharing across the team.
- Contribute to open-source projects, research publications, and blog articles.
Requirements
- Master’s degree in Computer Science, Engineering, or a related field.
- 7+ years of experience in a DevOps/SRE role.
- Strong experience with cloud computing and highly available distributed systems.
- Hands-on experience with CI/CD, containerization, and orchestration tools (Docker, Kubernetes).
- Proficiency in scripting languages (Python, Go, Bash) and knowledge of software development best practices.
- Excellent problem-solving and communication skills.
- Self-motivated and able to work well in a fast-paced startup environment.
Nice to Have
- Experience in an AI/ML environment.
- Experience with high-performance computing (HPC) systems and workload managers (Slurm).
- Familiarity with infrastructure-as-code tools like Terraform or CloudFormation.
What We Offer
- Competitive salary and equity options.
- Health insurance coverage.
- Transportation and sport allowances.
- Meal vouchers and a private pension plan.
- Generous parental leave policy.
- Visa sponsorship available.
This Remote Site Reliability Engineer position at Mistral AI offers a unique opportunity to work with cutting-edge AI technology in a collaborative environment. With competitive compensation and a focus on innovation, it's a great chance to make a significant impact.
Who Will Succeed Here
Proficient in Kubernetes and Terraform for building and managing container orchestration and infrastructure as code, ensuring seamless deployment and scaling of AI/ML applications.
Strong experience with CI/CD pipelines and Docker to automate the deployment processes, allowing for rapid iteration and deployment of AI-driven features in a remote setting.
A proactive problem-solver with a deep understanding of monitoring and logging tools, such as Prometheus and Grafana, to maintain system reliability and performance under varying loads.
Learning Resources
Career Path
Market Overview
Skills & Requirements
Domain Trends
Industry News
Loading latest industry news...
Finding relevant articles from the last 6 months