Principal Network Development Engineer, ML Networking
Amazon.com
The Performance Assured Networking organization (PAN) owns delivering high performance networks for running ML workloads with specialized network products and a custom control plane solution to meet the scale, performance and availability needs of such workloads. The organization owns five inter-related product portfolios. First is the ML network and the network connectivity service it provides to ML servers. AWS Intent Driven Networking (AIDN) is our control plane in which network routing and forwarding behaviors, called Intents, can be programmed across an entire network using highly available APIs. AIDN uses closed-loop actors to program network devices and ensure that the network is in sync with the specified Intent. Third, SIDR (Scalable Intent Driven Routing), our only AIDN actor in production, is a fabric routing protocol and a network controller system, that leverages the prescriptive nature of our networks allowing topology, prefixes, and policy to be controlled using Intents. SIDR harnesses a multi-phase commit mechanism (MPC) with built-in rollback to distribute and atomically enable administrative changes across a single fabric. It also provides rapid responses to network events within the fabric, minimizing customer impact. Fourth is a set of safety systems that assures that changes being rolled out to the fabric will not cause customer impact. Fifth is AWACS, a set of off-the-box services that enables WCMP-based traffic engineering in existing DC fabrics to increase effective capacity of the CLOS network and provide capacity safety for shared failure domains. All of the products and services described above are operational. Each are in different stages of expansion and new capabilities.
Key job responsibilities
This Principal Engineer will take ownership of ML network performance dependent on the EC2 interface, a critical capability that directly impacts our customers' ability to train and deploy ML models efficiently. In the immediate term, they'll tackle one of our most pressing challenges: building a comprehensive understanding of network performance for ML workloads in production. This means designing and implementing systems that can intelligently measure and baseline performance without direct visibility into customer applications.
Over the next 12-18 months, they'll need to transform how we approach ML networking. This starts with developing new ways to identify and classify network traffic patterns from ML training, building systems that can automatically tune network configurations based on observed workload characteristics. They'll architect flexible abstractions that allow us to quickly adapt to new ML training patterns while maintaining peak performance for existing workloads.
The role requires someone who can move from theoretical understanding to practical implementation. They'll need to deliver a production-grade telemetry system that provides actionable insights about network performance, develop new approaches to baseline measurements, and demonstrate concrete performance improvements for key ML workloads. Success in this role means not just solving today's performance challenges, but building systems flexible enough to handle tomorrow's ML innovations.
This PE will be the technical authority for ML networking performance at AWS, working across teams to drive adoption of their approaches and establishing best practices that will shape how we build and operate our ML infrastructure for years to come.
Key job responsibilities
This Principal Engineer will take ownership of ML network performance dependent on the EC2 interface, a critical capability that directly impacts our customers' ability to train and deploy ML models efficiently. In the immediate term, they'll tackle one of our most pressing challenges: building a comprehensive understanding of network performance for ML workloads in production. This means designing and implementing systems that can intelligently measure and baseline performance without direct visibility into customer applications.
Over the next 12-18 months, they'll need to transform how we approach ML networking. This starts with developing new ways to identify and classify network traffic patterns from ML training, building systems that can automatically tune network configurations based on observed workload characteristics. They'll architect flexible abstractions that allow us to quickly adapt to new ML training patterns while maintaining peak performance for existing workloads.
The role requires someone who can move from theoretical understanding to practical implementation. They'll need to deliver a production-grade telemetry system that provides actionable insights about network performance, develop new approaches to baseline measurements, and demonstrate concrete performance improvements for key ML workloads. Success in this role means not just solving today's performance challenges, but building systems flexible enough to handle tomorrow's ML innovations.
This PE will be the technical authority for ML networking performance at AWS, working across teams to drive adoption of their approaches and establishing best practices that will shape how we build and operate our ML infrastructure for years to come.
Por favor confirme su dirección de correo electrónico: Send Email
Todos los trabajos de Amazon.com