Senior Software Reliability Engineer - AI Remote
About the Role
We are seeking a Senior Software Reliability Engineer for AI remote to join our innovative team at MixMode. This position is crucial for enhancing the reliability and performance of our AI systems, which are at the forefront of cybersecurity technology. You will work in a dynamic environment where your contributions will directly impact the security of large organizations and critical infrastructure.
What You'll Do
- Own the reliability, performance, and operational health of production AI systems, focusing on improving complex, existing services.
- Lead efforts to refactor and harden the AI codebase to improve observability, maintainability, and resilience.
- Diagnose and resolve issues across distributed systems, including latency, throughput, data pipelines, and resource utilization.
- Design and build monitoring, alerting, and debugging tools for high-availability services.
- Partner with researchers and ML engineers to productionize models at scale.
- Establish best practices for testing, deployment, capacity planning, and incident response.
Requirements
- 5+ years of experience in software engineering, with a focus on reliability engineering.
- Strong understanding of distributed systems, cloud infrastructure, and container orchestration (Kubernetes).
- Experience with AI/ML systems and their operational challenges.
- Proficiency in programming languages such as Python, Go, or Java.
- Familiarity with monitoring tools and frameworks (e.g., Prometheus, Grafana).
- Excellent problem-solving skills and attention to detail.
Nice to Have
- Experience in cybersecurity or related fields.
- Knowledge of data pipelines and ETL processes.
- Familiarity with CI/CD practices and tools.
What We Offer
- Competitive salary ranging from $140,000 to $180,000 per year.
- Fully remote work environment with flexible hours.
- Opportunity to work with cutting-edge AI technologies.
- Collaborative and innovative team culture.
- Health and wellness benefits.
- Professional development opportunities.
This role offers a unique opportunity to work at the intersection of AI and cybersecurity, focusing on reliability and performance in a fully remote environment.
Who Will Succeed Here
Proficient in Python and Go, with a strong understanding of concurrency and parallelism to optimize AI system performance and reliability.
Self-motivated and disciplined to excel in a fully remote work environment, demonstrating effective time management and proactive communication skills.
Deep experience in deploying and managing Kubernetes clusters, leveraging tools like Prometheus and Grafana for monitoring and alerting to ensure system reliability.
Learning Resources
Career Path
Market Overview
Skills & Requirements
Domain Trends
Industry News
Loading latest industry news...
Finding relevant articles from the last 6 months