Auf einen Blick
- Aufgaben: Design and maintain large-scale HPC/AI clusters while collaborating with researchers and developers.
- Arbeitgeber: Join NVIDIA, a leader in groundbreaking computing technologies and AI advancements.
- Mitarbeitervorteile: Enjoy a diverse workplace with opportunities for growth and innovation in cutting-edge tech.
- Warum dieser Job: Be at the forefront of AI and HPC, contributing to revolutionary solutions and workflows.
- Gewünschte Qualifikationen: 5+ years in HPC/AI with expertise in Linux, job scheduling, and automation tools required.
- Andere Informationen: NVIDIA values diversity and provides accommodations for applicants with disabilities.
Das voraussichtliche Gehalt liegt zwischen 72000 - 84000 € pro Jahr.
NVIDIA is looking for an experienced HPC Engineer to join the E2E software verification HPC/AI Infrastructure team. We are focused on building supercomputers and HPC clusters based on groundbreaking technologies. We are looking for an outstanding architect for a senior HPC position, to be a key player in the most exciting computing hardware and software, contributing to the latest breakthroughs in artificial intelligence and GPU computing. You will provide insights on at-scale system design and tuning mechanisms for large-scale compute runs. You will work with the latest Accelerated computing and Deep Learning software and hardware platforms, collaborating with many scientific researchers, developers, and customers to craft improved workflows and develop new, leading differentiated solutions. You will interact with HPC, OS, GPU compute, and systems specialists to architect, develop, and bring up large scale performance platforms.
What you will be doing:
- Design, implement and maintain large scale HPC/AI clusters with monitoring, logging and alerting.
- Manage Linux job/workload schedules and orchestration tools.
- Develop and maintain continuous integration and delivery pipelines.
- Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources.
- Deploy monitoring solutions for the servers, network, and storage.
- Perform troubleshooting from bare metal, operating system, software stack, and application level.
- Being a technical resource, develop, redefine, and document standard methodologies to share with internal teams.
- Support Research & Development activities and engage in POCs/POVs for future improvements.
What we need to see:
- A degree in Computer Science, Engineering, or a related field and 5+ years of experience.
- Knowledge of HPC and AI solution technologies from CPUs and GPUs to high speed interconnects and supporting software.
- Experience with job scheduling workloads and orchestration tools such as Slurm, K8s.
- Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu) networking (sockets, firewalld, iptables, wireshark, etc.) and internals, ACLs and OS level security protection and common protocols e.g. TCP, DHCP, DNS, etc.
- Experience with multiple storage solutions such as Lustre, GPFS, zfs and xfs. Familiarity with newer and emerging storage technologies.
- Python programming and bash scripting experience.
- Comfortable with automation and configuration management tools such as Jenkins, Ansible, Puppet/Chef.
- Deep knowledge of Networking Protocols like InfiniBand, Ethernet.
- Deep understanding and experience with virtual systems (for example VMware, Hyper-V, KVM, or Citrix).
- Familiarity with cloud computing platforms (e.g. AWS, Azure, Google Cloud).
Ways to stand out from the crowd:
- Knowledge of CPU and/or GPU architecture.
- Knowledge of Kubernetes, container-related microservice technologies.
- Experience with GPU-focused hardware/software (DGX, Cuda).
- Background with RDMA (InfiniBand or RoCE) fabrics.
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status. We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.
#J-18808-Ljbffr
Senior HPC AI Engineer Arbeitgeber: TN Switzerland
Kontaktperson:
TN Switzerland HR Team
StudySmarter Bewerbungstipps 🤫
So bekommst du den Job: Senior HPC AI Engineer
✨Tip Number 1
Make sure to showcase your experience with HPC and AI technologies prominently. Highlight specific projects where you've designed or managed large-scale HPC clusters, as this will resonate well with the team at NVIDIA.
✨Tip Number 2
Familiarize yourself with the latest tools and technologies mentioned in the job description, such as Slurm, Kubernetes, and various storage solutions. Being able to discuss these in detail during your interactions will demonstrate your expertise and enthusiasm for the role.
✨Tip Number 3
Engage with the HPC and AI community online. Participate in forums or discussions related to GPU computing and share your insights. This can help you build a network and may even catch the attention of someone at NVIDIA.
✨Tip Number 4
Prepare to discuss your troubleshooting strategies and experiences in-depth. Given the technical nature of the role, being able to articulate how you've resolved complex issues in past projects will set you apart from other candidates.
Diese Fähigkeiten machen dich zur top Bewerber*in für die Stelle: Senior HPC AI Engineer
Tipps für deine Bewerbung 🫡
Tailor Your CV: Make sure your CV highlights relevant experience in HPC and AI technologies. Emphasize your knowledge of job scheduling tools like Slurm and Kubernetes, as well as your programming skills in Python and bash scripting.
Craft a Strong Cover Letter: In your cover letter, express your passion for HPC and AI. Mention specific projects or experiences that demonstrate your ability to design and maintain large-scale HPC/AI clusters, and how you can contribute to NVIDIA's innovative environment.
Showcase Relevant Projects: Include examples of past projects where you implemented or managed HPC systems. Detail your role, the technologies used, and the outcomes achieved. This will help illustrate your hands-on experience and problem-solving skills.
Highlight Collaboration Skills: Since the role involves working with researchers and developers, emphasize your teamwork and communication skills. Provide examples of how you've successfully collaborated on technical projects in the past.
Wie du dich auf ein Vorstellungsgespräch bei TN Switzerland vorbereitest
✨Showcase Your Technical Expertise
Be prepared to discuss your experience with HPC and AI technologies in detail. Highlight specific projects where you've designed or maintained large-scale HPC/AI clusters, and be ready to explain the challenges you faced and how you overcame them.
✨Demonstrate Problem-Solving Skills
Expect technical questions that assess your troubleshooting abilities. Prepare examples of how you've diagnosed and resolved issues at various levels, from bare metal to application level, and be ready to walk through your thought process.
✨Familiarize Yourself with Relevant Tools
Make sure you are well-versed in job scheduling and orchestration tools like Slurm and Kubernetes. Be ready to discuss your experience with automation tools such as Jenkins and Ansible, and how you've used them to streamline processes.
✨Engage with the Interviewers
During the interview, ask insightful questions about the team's current projects and future goals. This shows your genuine interest in the role and helps you understand how you can contribute to their success.