Principal Software Architect
Company: NVIDIA
Location: Redmond
Posted on: April 25, 2024
Job Description:
We are now looking for a Principal Software Architect for AI and
HPC. At NVIDIA, we are advancing the frontiers of AI capabilities.
We seek an expert in high-performance computing and AI to design
and develop software resiliency features for training AI models on
the world's most powerful and largest supercomputers. In this role,
you will outline mission requirements for ultra large-scale AI
supercomputers, thoroughly investigate and evaluate RAS feature
designs, establish software requirements and evaluation metrics,
and oversee the complete implementation of RAS features in
software. As a leader in HPC and AI software development, you will
interact with multiple teams across the organization. Your
responsibilities include conducting regular reviews and check-ins
with execution teams, ensuring the timely delivery of essential RAS
software features such as checkpoint-recovery logic, error
detection and attribution, error containment, SDC detection, and
other related RAS elements. Leading cross-organizational efforts
among various stakeholders and teams, you will coordinate
priorities with senior leadership, provide timely updates, and
ensure adequate resourcing for the projects. What You'll Be
Doing:
- Collaborate with both internal and external customers and
partners to define innovative Reliability, Availability, and
Serviceability (RAS) requirements and objectives for present and
future AI supercomputing products.
- Oversee and guide the development of RAS features across the
entire AI stack, encompassing aspects from job-level scheduling and
AI application frameworks (such as PyTorch), down to driver-level
and hardware health monitoring on GPUs.
- Develop and maintain comprehensive software roadmaps, ensuring
alignment with diverse engineering teams and synchronizing with
engineering and product leadership for strategic coherence.
- Drive successful implementation and execution of RAS features
in software, with demonstrable improvements in end-to-end metrics
such as availability during large-scale training runs. What We Need
to See:
- A Master's or Ph.D. in Computer Science, Electrical or Computer
Engineering from a reputed university, or equivalent professional
experience.
- 15+ years of industry experience in systems architecture or
related fields, demonstrating a deep understanding of system
complexities.
- Proven ability to work and communicate effectively in a
collaborative environment, bridging multiple engineering
disciplines.
- At least 5 years of hands-on experience in software
development, preferably in high-complexity projects involving HPC
or AI. Ways to Stand Out From the Crowd:
- Demonstrated experience with large-scale AI supercomputing
applications, particularly in training and inference stages.
- In-depth knowledge of the requirements for large-scale AI
workload training and inference.
- A strong passion for and experience in developing system
architectures tailored for AI applications, encompassing CPU, GPU,
memory, storage, and networking.
- Hands-on involvement in the entire lifecycle - from design to
deployment - of large-scale High-Performance Computing (HPC)
systems.
- Practical experience in adopting and implementing HPC software
development practices in large-scale system environments. As NVIDIA
makes inroads into the Datacenter business, our team plays a
central role in getting the most out of our exponentially growing
datacenter deployments as well as establishing a data-driven
approach to hardware design and system software development. We
collaborate with a broad cross section of teams at Nvidia ranging
from DL research teams to CUDA Kernel and DL Framework development
teams, to Silicon Architecture Teams. -NVIDIA is widely considered
to be one of the technology world's most desirable employers. We
have some of the most forward-thinking and hardworking people on
the planet working for us. If you're creative and autonomous, we
want to hear from you! The base salary range is 272,000 USD -
419,750 USD. Your base salary will be determined based on your
location, experience, and the pay of employees in similar
positions. You will also be eligible for equity and benefits.
NVIDIA accepts applications on an ongoing basis. NVIDIA is
committed to fostering a diverse work environment and proud to be
an equal opportunity employer. As we highly value diversity in our
current and future employees, we do not discriminate (including in
our hiring and promotion practices) on the basis of race, religion,
color, national origin, gender, gender expression, sexual
orientation, age, marital status, veteran status, disability status
or any other characteristic protected by law.
Keywords: NVIDIA, Seattle , Principal Software Architect, Education / Teaching , Redmond, Washington
Didn't find what you're looking for? Search again!
Loading more jobs...