Cloud Engineer- Senior I - SRE
TraceLink’s software solutions and Opus Platform help the pharmaceutical industry digitize their supply chain and enable greater compliance, visibility, and decision making. It reduces disruption to the supply of medicines to patients who need them, anywhere in the world.
Founded in 2009 with the simple mission of protecting patients, today Tracelink has 8 offices, over 800 employees and more than 1300 customers in over 60 countries around the world. Our expanding product suite continues to protect patients and now also enhances multi-enterprise collaboration through innovative new applications such as MINT.
Tracelink is recognized as an industry leader by Gartner and IDC, and for having a great company culture by Comparably.
TraceLink is the company that the world’s most trusted pharmaceutical companies choose for complete connectivity, visibility and traceability of prescription medications across their global supply chains, manufacturing and distribution operations. We’ve built the world's largest cloud-based network to connect the entire Life Sciences supply chain and eliminate counterfeit prescription drugs from the global marketplace, so that people everywhere receive the medicines they need in the safest, most secure, and most timely manner possible.
We're looking for an experienced, driven and passionate engineering team member with backgrounds in programming, distributed systems and Kubernetes to help our SRE team improve its Service Mesh and Kubernetes architecture. The SRE group is building and expanding on the critical need to maintain visibility and provide scalability of the TraceLink global platform. Within SRE, you'll have plenty of opportunities to share your strengths, guide us on how to build a scalable platform and collaborate closely with various engineering stakeholders.
You will work in a global team, in an inclusive environment with AWS cloud-based deployments and focus on ensuring services are running smoothly, continuously assess opportunities to reduce toil and help improve service availability and reliability, optimise AWS resources usage across multiple environments to deliver cost effective services to the engineering organisation.
- As a member of the SRE core team, ensure high availability, performance and reliability expected by our customers and delivery to defined OKRs
- Collaborate with engineering and business stakeholders to maintain and refine the backlog of user epics for prioritized opportunities.
- Design, build, document, test new tools and technologies as part of an Agile development team. Maintain and improve these to eliminate bugs, increase performance/efficiency, or extend capabilities
- Lead code reviews, systems design and architectural sessions to ensure that our platform and supporting services are developed/deployed using best practices.
- Drive the testing, diagnosis and troubleshooting as part of a CI/CD pipeline and test automation, always expanding and improving the testing coverage. Help design and implement self-healing, resiliency patterns.
- Help implement cloud engineering team’s OKRs to achieve business goals
- Play an active role in the development process, deliver on commitments, communicate issues, work with others both in the team and in other teams
- Offer suggestions on how to improve tools and/or processes and help define our sprint epics, stories based on business priorities
- Lead planned releases or any outage triage that impact the infrastructure and collaborate closely with CloudOps. Lead blameless postmortems, refine play books to reduce MTTR
- 8+ years of experience with increasing responsibility as an SRE/DevOps/system engineer
- Strong understanding of cloud deployment and management practices
- Hands-on experience with Terraform, Helm, Docker, Kubernetes, Prometheus and Istio
- Hands-on experience with tools and techniques to diagnose and uncover container and overall system performance
- Skilled in AWS (or any cloud) services both from technology and cost optimisation perspectives
- Skilled in DevOps/SRE practices and build/release pipelines
- Familiarity with Chaos engineering tools - ChaosMonkey, Gremlin etc., demonstrable experience with increasing overall system resilience
- Experience working with mature development practices and tools for source control, security, and deployment
- Infrastructure/app performance engineering experience is a nice to have
- Excellent communication skills, written and verbal
- Strong analytical and problem-solving skills
- Nice to have: Willingness to learn/Experience in Go, Terratest, Inspec, Terraform compliance