Senior Principal Software Engineer - AI Infrastructure Remote
About the Role
We are seeking a highly skilled Senior Principal Software Engineer - AI Infrastructure Remote to join our GPU Availability and Monitoring team at Oracle. This role is crucial for designing and developing architectural changes for GPU delivery, health monitoring, triage automation, and diagnostic services essential for running distributed AI/ML/HPC workloads across thousands of GPUs. You will be responsible for architecting solutions that scale and optimize monitoring and repair solutions for AI infrastructure components, ensuring peak performance for customer workloads.
What You'll Do
- Architect solutions to scale and optimize Monitoring and Repair for components like GPU, CPU, Network, and Storage.
- Develop best-in-class AI compute infrastructure, ensuring services are modularized, secure, reliable, and actively monitored.
- Collaborate with cross-functional teams to understand requirements and design respective solutions.
- Optimize software development processes to improve developer efficiency.
- Mentor junior developers and drive modern software engineering practices.
- Develop benchmark metrics and automation to track performance and reliability across customer workloads.
- Stay updated with industry trends and emerging technologies in distributed systems and AI infrastructure management.
Requirements
- BS in Computer Science, Engineering, or related field.
- 10 years of experience in software development with languages including C, C++, C#, Java, Go, Rust.
- 5 years of experience designing large-scale distributed systems.
- 3 years of experience providing technical leadership to cross-functional teams.
- Strong communication skills and a systematic problem-solving approach.
- Experience with cloud infrastructures such as OCI, AWS, Azure, and GCP.
- Familiarity with containerization technologies like Docker and API design.
Nice to Have
- Experience with AI-powered tools and platforms.
- Familiarity with data management practices.
- Knowledge of Agile development methodologies.
What We Offer
- Comprehensive benefits package including medical, dental, and vision insurance.
- 401(k) with company match and flexible spending accounts.
- Paid time off and sick leave policies.
- Employee Stock Purchase Plan and financial planning assistance.
- Opportunities for professional growth and development.
This role offers a unique opportunity to lead AI infrastructure projects at Oracle, with a competitive salary and comprehensive benefits.
Who Will Succeed Here
Expertise in C, C++, and Java with a strong understanding of performance optimization in distributed systems, particularly in AI/ML workloads.
Proficiency in Docker and cloud infrastructure services (e.g., AWS, Azure) to efficiently manage containerized applications and deployment pipelines in a remote environment.
A mindset geared towards continuous improvement and scalability, with a focus on architecting robust API designs that support high availability and fault tolerance.
Learning Resources
Career Path
Market Overview
Skills & Requirements
Domain Trends
Industry News
Loading latest industry news...
Finding relevant articles from the last 6 months