Site Reliability Engineer
Olympia, WashingtonJob ID R1908404-1 Date posted Aug. 13, 2019
Site Reliability Engineer, Cloud Management
Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures internally critical and externally visible systems have reliability and up-time appropriate to users' needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance. SRE is a mindset and a set of engineering approaches focused on optimizing existing systems, building infrastructure, and eliminating work through automation. As a Site Reliability Engineer in the Cloud Management team, you will build and operate cloud management solutions for Vmware services being offered across multiple public and private clouds.
Our team focuses on common service components across the stack. We develop and operate solutions to support public cloud management, CI/CD container orchestration, security and monitoring, closing the potential gaps between software and service requirements.
We work with various Software Engineering teams building high performance and reliable cloud systems. You will tackle a variety of business, infrastructure security and application problems in a complex ecosystem. You will collaborate with many SaaS teams across all disciplines. These teams will look to you for support and guidance on how to build and operate complex services. Our team is directly responsible for solutions around cloud management, security, reliability and visibility into cloud systems.
As the SaaS business runs on a 24 by 7 basis, the role requires rotational on-call availability (weekdays at work, evenings and weekend for service/system related incidents).
Success in this role requires very strong technical skills, a broad background and understanding of every layer of the software development and cloud ecosystem and excellent understanding of the cloud and container management stacks. You should be comfortable working independently and as part of a specialized team.
3+ years in various DevOps/SRE roles
3+ years of experience working with AWS
Experience administering Linux systems in a production environment
Experience in building and running large-scale systems and application architectures
Deep understanding of system performance and monitoring
Understanding of containers and container orchestration
Experience in one or more of the following languages: Python, Java, Go and/or NodeJS
Excellent project management skills and the ability to work in a fast-paced and hectic work environment
Demonstrate skills in priority setting, analysis, communication, time management, scheduling, and multitasking.
Proven verbal and written communication skills
BS or MS degree in Computer Science, or a related field
U.S. citizen able to attain a U.S. government security clearance and pass regular background investigations
Experience with modern container orchestration systems: Kubernetes, Mesos, DC/OS, Swarm
Experience with infrastructure configuration and automations processes and tools: Terraform, Puppet, Ansible, Chef, Fabric
Experience with security in the cloud: Intrusion, penetration, and vulnerability scanning
Experience with monitoring solutions: ELK, Splunk, SUMO, Nagios, Prometheus
Experience with various data technologies including relational and non-relational databases and message queues
Good working knowledge of build automation and continuous integration/delivery ecosystem: Git, Gerrit, Maven/Gradle, Jenkins, Docker, Nexus, Artifactory. Selenium