Roles and responsibilities
Our cloud operations engineers bring Python software-engineering skills and rigour to the operations domain. We practise devsecops from bare metal to application. We architect and run OpenStack, Kubernetes and software defined storage, and we enable devsecops for applications running on that infrastructure too.
To become a member of this team, you need to be a software engineer fluent in Python, you need a genuine interest in the full open source infrastructure stack from metal to containers, and you need the ability to work in a high pressure operations environment with mission-critical services for global brand name customers.
As a member of the team you will gain experience in a broad range of cloud technologies. We evolve our offerings as the state of the art improves, so you get to stay current with the latest capabilities in open source infrastructure. We drive upgrades to keep our customers on the latest, best solutions.
What we are looking for in you
- Degree in Software Engineering or Computer Science
- Experience with Linux and familiarity with Linux networking and storage
- Python software development expertise
- Operational experience
- Excellent interpersonal skills, curiosity, flexibility, and accountability
- Ability to travel internationally twice a year, for company events up to two weeks long
Nice-to-have skills
- Experience with OpenStack or Kubernetes deployment or operations
- Familiarity with public or private cloud management
What we offer colleagues
We consider geographical location, experience, and performance in shaping compensation worldwide. We revisit compensation annually (and more often for graduates and associates) to ensure we recognise outstanding performance. In addition to base pay, we offer a performance-driven annual bonus or commission. We provide all team members with additional benefits, which reflect our values and ideals. We balance our programs to meet local needs and ensure fairness globally.
- Distributed work environment with twice-yearly team sprints in person
- Annual compensation review
- Recognition rewards
- Annual holiday leave
- Maternity and paternity leave
- Employee Assistance Programme
- Opportunity to travel to new locations to meet colleagues
- Priority Pass, and travel upgrades for long haul company events
Desired candidate profile
1. Reliability Engineering
- Availability and Performance: Ensure that the systems, applications, and services are highly available and performant. Monitor uptime, response times, and system health, taking proactive steps to address potential issues.
- Service Level Objectives (SLOs) and Service Level Indicators (SLIs): Define, monitor, and maintain SLOs, SLIs, and SLAs for various services, ensuring that reliability goals are met and exceeded.
- Incident Management and Resolution: Act as an escalation point for incidents, diagnose and resolve production issues in real-time, and lead postmortem analysis to prevent future occurrences.
- Capacity Planning and Scalability: Ensure that systems are capable of scaling to meet increased demand, using tools like load balancing, auto-scaling, and horizontal scaling techniques.
2. Automation and Infrastructure Management
- Infrastructure as Code (IaC): Write and manage infrastructure code (using tools like Terraform, Ansible, or CloudFormation) to automate the provisioning, configuration, and management of cloud resources and infrastructure.
- CI/CD Pipelines: Build, improve, and maintain Continuous Integration and Continuous Deployment pipelines, ensuring that software is deployed quickly, safely, and reliably.
- Configuration Management: Use automation tools to manage system configurations and deployments, reducing manual intervention and improving consistency.
- Monitoring and Alerting: Implement and maintain comprehensive monitoring and alerting systems to detect issues early. Use tools like Prometheus, Grafana, ELK stack, Datadog, or New Relic to ensure the systems' health is constantly tracked.
3. Collaboration with Development Teams
- DevOps Practices: Work closely with development teams to bridge the gap between software development and operations. Implement best practices for building and running software in production environments, promoting a culture of DevOps.
- Code Review and Guidance: Participate in code reviews, providing feedback on application code, infrastructure code, and architectural decisions to improve reliability and maintainability.
- Incident Response: Work alongside development teams to identify root causes of incidents, recommend fixes, and ensure future incidents are prevented through improved practices.
4. Security and Compliance
- Security Best Practices: Ensure that the systems are secure by following best practices in securing infrastructure, network, and applications. This includes managing access control, encryption, and vulnerability management.
- Compliance: Ensure the systems comply with relevant standards and regulations (e.g., PCI DSS, HIPAA, GDPR) and that security measures are in place to meet compliance requirements.
- Disaster Recovery and Business Continuity: Design and implement disaster recovery plans and business continuity procedures, ensuring that critical systems can be restored quickly in the event of a failure.