Oracle10.03.26
AI SCORE 9.2

Principal Software Engineer - AI Infrastructure Innovation (Remote)

$97K–$223K/year

About the Role

We are seeking a Principal Software Engineer - AI Infrastructure Innovation to join our team at Oracle. This remote position offers the chance to work on pioneering AI and HPC networking solutions for GPU superclusters at a massive scale. You will play a critical role in designing and delivering state-of-the-art RDMA-based networking that enables high performance for AI training and inference.

What You'll Do

  • Lead architecture, system design, and implementation for high-performance RDMA solutions across OCI’s AI/HPC platforms.
  • Innovate on network and TCP performance, identifying necessary changes across Kernel, NIC, switch, transport, protocol, storage, and GPU communications.
  • Develop production-grade, high-performance software features with a focus on reliability, observability, and security.
  • Define performance goals and success metrics; design benchmarks and conduct large-scale experiments to validate throughput, latency, and tail behavior.
  • Collaborate with GPU platform, storage, database, and control-plane teams to deliver end-to-end solutions and influence OCI-wide network architecture and standards.
  • Mentor engineers, provide technical leadership and reviews, and contribute to the long-term roadmap and technical strategy.

Requirements

  • Strong software engineering background with a deep understanding of data structures and algorithms.
  • Demonstrated ability to optimize for high scale, low latency, and high throughput in large-scale systems.
  • Experience in developing, shipping, and operating high-performance production code.
  • Ability to lead technically, mentor others, and deliver results in complex problem spaces.
  • BS/MS in Computer Science, Electrical/Computer Engineering, or equivalent practical experience.

Nice to Have

  • Experience with RDMA networking (RoCE and/or InfiniBand).
  • Familiarity with AI/HPC stacks and workloads: NCCL/RCCL/MPI, Slurm, and GPU communication patterns.
  • Experience integrating GPU Direct and NVMe-oF access in production.
  • Hands-on experience with observability and performance tooling (e.g., eBPF, perf, flame graphs).

What We Offer

  • Comprehensive benefits package including medical, dental, and vision insurance.
  • 401(k) Savings and Investment Plan with company match.
  • Flexible paid time off and 11 paid holidays.
  • Paid parental leave and adoption assistance.
  • Employee Stock Purchase Plan and financial planning services.
Language Requirements
EnglishC1
BasicIntermediateAdvancedNative
Why This Job9.2 of 10

This role offers a unique opportunity to lead innovative AI infrastructure projects at Oracle, with a competitive salary and comprehensive benefits.

Salary Range
Required
0/1
Optional
0/1
Bonus
0/1

Who Will Succeed Here

Expert in RDMA and HPC technologies, with a proven track record of optimizing performance and designing scalable networking solutions for AI workloads, particularly in GPU superclusters.

Strong self-motivated individual who thrives in a remote work environment, demonstrating exceptional time management skills and the ability to independently drive complex projects to completion.

Deep understanding of observability tools and performance tuning techniques, combined with a mindset focused on continuous learning and adaptation to new AI technologies and infrastructure innovations.

Learning Resources

RDMA and HPC Networking Overviewarticle

Career Path

Principal Software Engineer - AI Infrastructure Innovation(Now)Director of AI Infrastructure Engineering(2-4 years)Chief Technology Officer (CTO)(5-7 years)

Market Overview

Market Size 2024
$150B
Annual Growth
10.5%
AI Adoption in Software Engineering
75%
Investment in AI Infrastructure
+50%
Labour Demand for AI Engineers
+25%
Avg Salary for Principal Software Engineers
$180K

Skills & Requirements

Required
Software EngineeringRDMAAI
Growing in Demand
Cloud ComputingContainerization (Docker, Kubernetes)Machine Learning Operations (MLOps)
Declining
Traditional Networking Protocols (e.g., TCP/IP)Legacy Systems Programming (e.g., COBOL)

Domain Trends

Rise of AI-Driven Development
By 2025, 60% of software development will leverage AI tools to enhance productivity and code quality.
Increased Demand for High-Performance Computing (HPC)
The HPC market is expected to grow by 12% annually, driven by AI workloads and data-intensive applications.
Shift Towards Observability in Software Engineering
Companies adopting observability tools have seen a 30% reduction in downtime, emphasizing the importance of performance tuning and monitoring.

Industry News

Loading latest industry news...

Finding relevant articles from the last 6 months

All job postings are automatically gathered by algorithms. We do not review or verify listings, be careful when applying and do not sign-in with iCloud or Google services.