Sr. DevOps Engineer - Amazon Search

A9 Palo Alto, CA

A9.com, an Amazon company, creates powerful, customer-focused search and advertising solutions and technologies. Whenever a customer visits an Amazon site worldwide and types in a query or browses through product categories, Amazon Search services go to work. We design, develop, and deploy high performance, fault-tolerant distributed search systems used by millions of Amazon customers every day.

Our Search Operations team operates one of the Internet's largest search infrastructures, made up of thousands of servers, supporting millions of customers performing hundreds of millions of queries - all delivered in milliseconds. Members of this team are continuously working to maximize performance, availability and efficiency to help solve complex scaling and growth challenges.

What A9.com Operations Does:

Join us and be part of the team running one of the Internet's largest search infrastructures. This team combines application operations, systems engineering, and software development expertise to run distributed, large-scale and fault-tolerant Tier-1 systems. We need customer-obsessed engineers who relentlessly focus on performance, availability and efficiency to help us solve our complex scaling and growth challenges. Come push the envelope into the next phase of distributed computing and see your work used my millions of customers every day.

Responsibilities:

* Site reliability and application operations of the Amazon search service.

* Project ownership of engineering initiatives from inception, actively engaging during design reviews and development efforts to ensure a sound deployment plan and mitigation of operational burden.

* Represent the Engineering Operations team on key engineering releases and features – ensure operational readiness and communicate deployment and mitigation planning to worldwide Ops team.

* Lead investigation efforts and propose high impact initiatives and projects – lead the effort by working with other ops or search development engineers.

* Diagnose and mitigate critical failures in high pressure situations. Also have the ability to communicate status on high-severity events while oncall.

* Fleet and application performance analysis and scaling to keep up with business growth and improve efficiency.

* Analyze big data sets to identify optimization opportunities and act on them.

* Perform troubleshooting deep-dives on system and application issues, driving root cause resolution with a sense of urgency.

* Develop tools and scripts to automate manual processes or improve existing frameworks.

* Mentor and lead operations engineers on projects, troubleshooting or large scale events. Train engineers on best practices and drive operational excellence.

* Daytime on-call support, monitoring, and deployment management as part of a worldwide shared rotation. This is 8x7 daytime oncall once every 5-6 weeks.

At A9, you'll experience the benefits of working in a dynamic, entrepreneurial environment, while leveraging the resources of Amazon.com (AMZN), one of the world's leading internet companies. We provide a highly customer-centric, team-oriented environment in our offices located in Palo Alto, California.

Basic Qualifications:

* Bachelors in Computer Science or equivalent experience.

* 5+ years of recent systems administration, site reliability or dev-ops experience in a medium to large scale production Linux or other UNIX environment.

* 5+ years experience with Linux, Apache, DNS, monitoring, load-balancing, and caching.

* 2+ years of scripting and development skills in bash, Perl, or Python

Preferred Qualifications:

* Experience developing software in C++.

* Experience with Amazon Web Services (AWS), ideally S3, EC2, EMR, and DynamoDB.

* Experience managing large scale systems on the internet with a focus on application operations.

* Experience working with high-availability, distributed systems and services in a hosting environment including hardware, OS, storage, network, and database solutions.

* Experience in the development and rollout of technical operations processes and new services.

* Working knowledge of Agile development methods (Kanban, Scrum, etc).