Senior Principal Software Engineer - AI Infrastructure Innovation
About the Role
We are seeking a Senior Principal Software Engineer - AI Infrastructure Innovation to join our team at Oracle. In this remote position, you will be at the forefront of pioneering next-generation AI and HPC networking for GPU superclusters at massive scale. Your expertise will help us design and deliver state-of-the-art RDMA-based networking solutions that enable our customers to achieve high performance for AI training and inference.
What You'll Do
- Lead the architecture, system design, and implementation of high-performance RDMA solutions across OCI’s AI/HPC platforms.
- Innovate on network and TCP performance, identifying necessary changes across Kernel, NIC, switch, transport, protocol, storage, and GPU communications.
- Develop production-grade, high-performance software features with a focus on reliability, observability, and security.
- Define performance goals and success metrics; design benchmarks and conduct large-scale experiments to validate throughput, latency, and tail behavior.
- Collaborate with GPU platform, storage, database, and control-plane teams to deliver end-to-end solutions and influence OCI-wide network architecture and standards.
- Mentor engineers, provide technical leadership and reviews, and contribute to long-term roadmap and technical strategy.
Requirements
- Strong software engineering background with a deep understanding of data structures and algorithms.
- Experience in developing, shipping, and operating high-performance production code.
- Demonstrated ability to lead technically, mentor others, and deliver results in complex problem spaces.
- BS/MS in Computer Science, Electrical/Computer Engineering, or equivalent practical experience.
- Experience with RDMA networking (RoCE and/or InfiniBand) is preferred.
- Familiarity with AI/HPC stacks and workloads, including NCCL/RCCL/MPI, Slurm, and GPU communication patterns.
- Hands-on experience with observability and performance tooling (e.g., eBPF, perf, flame graphs).
Nice to Have
- Experience integrating GPU Direct and NVMe-oF access in production.
- Knowledge of SLO-driven operations at scale.
What We Offer
- Comprehensive benefits package including medical, dental, and vision insurance.
- 401(k) Savings and Investment Plan with company match.
- Flexible paid time off with 13 days of vacation annually for the first three years, increasing to 18 days thereafter.
- Paid parental leave and adoption assistance.
- Employee Stock Purchase Plan and financial planning services.
- Voluntary benefits including auto, homeowner, and pet insurance.
This role offers a unique opportunity to lead AI infrastructure innovation at Oracle, with a competitive salary and comprehensive benefits package.
Who Will Succeed Here
Deep expertise in RDMA (Remote Direct Memory Access) and HPC (High Performance Computing) systems, demonstrating the ability to optimize networking solutions for GPU superclusters and enhance AI training performance.
Self-motivated and proactive work style suitable for remote environments, with a strong ability to manage time effectively, collaborate asynchronously, and deliver results without direct supervision.
A results-oriented mindset with a proven track record in performance tuning and observability practices, ensuring that AI applications run efficiently and meet high-performance benchmarks.
Learning Resources
Career Path
Market Overview
Skills & Requirements
Domain Trends
Industry News
Loading latest industry news...
Finding relevant articles from the last 6 months