Avetta

Site Reliability Engineer

Reposted 18 Hours Ago

Be an Early Applicant

Sydney, New South Wales

Senior level

Sydney, New South Wales

Senior level

As a Site Reliability Engineer, lead management of cloud systems, oversee NOC operations, and ensure high availability while collaborating with development and product teams.

The summary above was generated by AI

Join Avetta as a Site Reliability Engineer in Australia!

Site Reliability Engineers are pioneers of the production systems, we believe in proactive discovery and analysis of our entire stack, continually optimizing, tuning, and scaling the system for maximal end-user experience on a globally distributed cloud-based SaaS platform. Downtime is not within the SRE’s vocabulary. The ability to maintain highly resilient and distributed systems, while integrating uptime monitors using programmatic APIs and developing intelligent scaling algorithms are important skills for the SRE. In addition, the SRE needs to be able to communicate effectively with both development and product teams to drive technical discovery and help prioritize features that maintain and exceed uptime goals and end-user experience.

Essential Duties and Responsibilities:

Lead the management and monitoring of highly available replicated cloud systems.
Oversee 24/7 Network Operations Center (NOC) operations, maintaining a minimum 99.9% annual uptime.
Define golden signals for all services in our core SaaS application.
Manage NOC engineer teams, including scheduling and responsibilities.
Design PagerDuty escalation policies across various teams.
Expertise in AWS technologies and building dashboards with leading observability platforms.
Automate monitors and dashboards using modern programmatic methods.
Provide regular reports to Engineering leadership and executive teams for continuous improvement.

Minimum Qualifications:

Minimum B.S. or B.A. in Computer Science.
Minimum of 5 years of experience as a Site Reliability Engineer, including some experience in managing teams and leading projects.
Stellar communication and interpersonal skills for effective collaboration with Development & Product teams.
Proficiency in monitoring the networking stack using distributed tracing and profiling tools.
Proficient with building dashboards with NewRelic, Kibana, Grafana, Prometheus and other observability platforms.
Proficient with AWS technologies.
Working knowledge in monitoring RESTful microservices and basic HTTP protocols.
Able to automate monitors and dashboards using REST APIs, GraphQL, and other modern programmatic methods.
Working knowledge of profiling tools for measuring CPU, Memory, I/O, Disk, and process threads dumps.
Experience in managing, integrating, and automating alerting and escalation tools.
Must live in Australia with unlimited rights to work. Preference will be given to those living in Sydney or Newcastle areas.

Nice to Haves:

Troubleshooting experience with modern container and networking technologies (Kubernetes, HAProxy, ALB).
Familiarity with scripting languages like Bash, Python, and Go.
Ability to administer and tune load balancer technologies.
Experience in managing, monitoring, and benchmarking distributed file systems.
Proficiency in configuration management tools (SaltStack, Ansible, Terraform).

Metrics That Matter:

System Monitoring: Create and automate system monitor and escalation policies.
System Management: Respond and resolve internal requests within business hours.
High Availability & Resilience: Maintain 99.95% uptime and be the first responder in emergency situations.
Full-Stack Observability: Build dashboards for end-to-end detection of system anomalies.
Innovation: Propose new ideas and improvements to the team regularly.

Join us at Avetta and be at the forefront of driving technical excellence and ensuring a seamless experience for our users across the globe.

#LI-HYBRID

#LI-REMOTE

Top Skills

Ansible

AWS

Bash

Grafana

GraphQL

Haproxy

Kibana

Kubernetes

Newrelic

Prometheus

Python

Rest Apis

Saltstack

Terraform

Similar Jobs

Xero

Lead Site Reliability Engineer (Product SRE)

4 Days Ago

Hybrid

Senior level

Cloud • Fintech • Information Technology • Machine Learning • Software

The Lead Site Reliability Engineer at Xero will provide technical leadership for an SRE team, ensuring product reliability and continuous improvement, while fostering a culture of observability and error budget management.

Top Skills: AWSAzureC#CloudFormationGCPJavaJavaScriptPythonTerraform

Citadel Securities

Site Reliability Engineer

6 Days Ago

Sydney, New South Wales, AUS

Mid level

Information Technology • Software • Financial Services

The Site Reliability Engineer role involves supporting real-time environments, troubleshooting, application migrations, and infrastructure upgrades. Prior experience in UNIX/Linux, networking, SQL, and scripting is essential.

Top Skills: BashPythonSQLTcp/IpUdpUnix/Linux

Xero

Lead Site Reliability Engineer (Observability)

7 Hours Ago

Remote

Hybrid

Senior level

Cloud • Fintech • Information Technology • Machine Learning • Software

Lead the observability strategy as a hands-on technical leader, enhancing system reliability and performance. Mentor engineers, promote best practices, and support team growth in observability and engineering excellence.

Top Skills: AWSC#DatadogDynatraceGoJaegerJavaScriptNew RelicPrometheusPythonScalyrSignalfxSplunkSumologicVictoriametrics

What you need to know about the Sydney Tech Scene

From opera to comedy shows, the Sydney Opera House hosts more than 1,600 performances a year, yet its entertainment sector isn't the only one taking center stage. The city's tech sector has earned a reputation as one of the fastest-growing in the region. More specifically, its IT sector stands out as the country's third-largest, growing at twice the rate of overall employment in the past decade as businesses continue to digitize their operations to stay competitive.