Site reliability engineers create a bridge between development and operations by applying a software engineering mindset to system administration topics and developing systems as well as software that help increase production site reliability, stability, and performance.
SREs collaborate closely with product developers to ensure that the designed solution responds to non-functional requirements such as availability, performance, security, and maintainability By building self-service tools for user groups that rely on their services (e.g., automatic provisioning of test environments, logs, and statistics visualization) SREs reduce work in progress for all parties and allow developers to focus exclusively on feature development.
- Engage in and improve the whole lifecycle of services from inception and design, throughout development, capacity planning, and launch reviews, to deployment, operation, and refinement
- Design and implement software platforms and monitor frameworks for efficient, automated, and intelligent service-oriented architecture (SOA) governance
- Scale systems sustainably through mechanisms such as automation; evolve systems reliability, efficiency, and velocity by pushing for changes
- Practice sustainable user support, incident response, and blameless postmortems.
- Implement automation and industry best practices to run our large-scale, rapidly growing infrastructure with minimum human intervention.
- Design and implement monitoring and recovery tools that help us meet performance and availability SLAs.
- Research, design, and implement solutions for fault tolerance, monitoring, performance enhancements, capacity optimization, disaster recovery, and configuration management of systems and applications.
- Evaluate and implement 3rd-party platforms as core elements of our own solution, e.g., streaming platforms, ETL platforms, etc.
- Configure and build tools for Continuous Integration and Continuous Deployment
- Set standards with the development team for how code should be optimally structured for easy deployment.
- Recommend new technologies to ensure quality and productivity.
- Bachelor's degree majoring in Computer Science, or related fields, with at least 2 years of related work experience
- Experience in SRE of designing and supporting large-scale distributed systems & deployment with high reliability and scalability
- Experience working on teams with a heavy emphasis on DevOps, Automation, CI/CD, and Quality;
- Experience working in an Agile development environment.
- Familiar with system operation skills in Linux and network
- Configuration management skills such as Puppet, Chef, Ansible
- Effective communication skills and a sense of ownership and drive
- An ability to work with a minimum of supervision while collaborating with colleagues in multiple departments