Remote SRE Jobs – Senior Site Reliability Engineer (Remote) – $130k‑$170k USD – Full‑Time – Escondido, California – Cloud/DevOps, Kubernetes, Terraform, Prometheus

Worldwide Salaried Open

TITLE: Remote SRE Jobs – Senior Site Reliability Engineer (Remote) – $130k‑$170k USD – Full‑Time – Escondido, California – Cloud/DevOps, Kubernetes, Terraform, Prometheus ---

Who we are

We are a mid‑stage SaaS company that grew from a garage‑side prototype to a platform serving > 200 enterprise customers worldwide. Our flagship product—an API‑driven data‑pipeline—processes ≈ 15 TB of events per day, and we guarantee customers 99.9 % uptime. The engineering culture is built on blunt feedback, data‑driven post‑mortems, and a relentless focus on reliability. While the code lives in the cloud, the heart of our operational decisions is made by a small, tight‑knit crew spread across the globe.

Why this role exists now

In the last 12 months we added three new data‑centers (AWS us‑east‑1, us‑west‑2 and GCP europe‑west1) to shave latency for European clients. That expansion bumped our monthly alert volume from ≈ 2,800 to ≈ 5,200, and our MTTR climbed from 12 minutes to 18 minutes because the on‑call rotation stretched thin. The leadership team decided it was time to double‑down on site reliability: we need a senior engineer who can own the reliability roadmap, coach the junior members, and tighten our alert fatigue.

Where you’ll sit (virtually)

Although the job is remote, we have a legal entity in Escondido, California that handles payroll, benefits, and compliance. You’ll be part of a “virtual office” that meets daily in a Slack channel called #sre‑hub, a weekly video‑call huddle, and a quarterly in‑person meetup hosted in Escondido, California when travel permits. Being anchored to Escondido, California helps us stay aligned with local tax regulations and gives you a community of other remote professionals who live in the same time zone.

The team you’ll join

Size & composition:

12 engineers total—5 senior SREs, 4 junior reliability engineers, 2 platform developers, and 1 manager. -

Current metrics:

99.92 % uptime over the past quarter, 5,200 alerts processed per month, 18‑minute average MTTR, 0.2 % alert fatigue (defined as > 3 alerts per incident). -

SLA commitments:

99.9 % availability for all customer‑facing APIs, 99.7 % for internal data‑processing pipelines.

What you’ll do day‑to‑day

Own reliability initiatives

– Define and ship SLOs for new services, write error‑budget policies, and track them in Grafana dashboards. 2.

Incident ownership

– Lead the response during high‑severity incidents, drive the post‑mortem narrative, and ensure actionable remediation items are filed in JIRA within 24 hours. 3.

Automation & tooling

– Write Terraform modules to provision Kubernetes clusters, build Helm charts for micro‑services, and shrink manual run‑books into reproducible Ansible playbooks. 4.

Capacity planning

– Run quarterly load‑tests using Locust, model growth with Python scripts, and present forecasts to product leadership. 5.

Mentorship

– Pair up with junior SREs for “bug‑hunting” sessions, run monthly reliability workshops, and contribute to our internal “SRE Playbook”.

Who we think will thrive

5+ years

of production‑grade experience with Linux/Unix, networking, and cloud infrastructure (AWS or GCP). -

Deep familiarity

with monitoring stacks: Prometheus, Grafana, Alertmanager, and log aggregation via Splunk or ELK. -

Infrastructure‑as‑Code

fluency: Terraform ≥ 0.13, Helm ≥ 3, and Ansible. - Container orchestration: Running production workloads on Kubernetes (experience with EKS or GKE). - Programming: Comfortable writing Python or Go for automation; Bash scripting is a given. - Incident mindset: You can stay calm under pressure, triage noisy alerts, and keep a clear incident timeline. - Communication: Able to explain complex reliability concepts to product managers and non‑technical stakeholders in plain language.

Tools & tech stack (the ones we actually use)

Cloud

– AWS (EC2, RDS, S3, Lambda) and GCP (Compute Engine, Cloud SQL, Pub/Sub). -

Container

– Docker ≥ 20, Kubernetes ≥ 1.24, Helm ≥ 3.5. -

IaC

– Terraform ≥ 1.0, Ansible ≥ 2.9. -

CI/CD

– GitHub Actions, Jenkins, CircleCI (for legacy pipelines). -

Monitoring

– Prometheus, Grafana, Alertmanager, Datadog (for some legacy services). -

Logging

– Splunk, Elasticsearch‑Kibana stack, Loki. -

Incident response

– PagerDuty, Opsgenie (we’re migrating fully to PagerDuty). -

Version control

– GitHub (private repos, branch protection rules). -

Collaboration

– Slack (primary chat), Confluence (knowledge base), JIRA (ticketing).

On‑call rhythm & expectations

Our on‑call schedule is a 7‑day rotation with a 48‑hour backup window. Each engineer handles roughly ≈ 350 alerts per month, averaging ≈ 2 incidents per week. We have a “no‑call‑out‑of‑hours” policy for holidays: the next engineer in the rotation covers the entire period, and the team shares the load. During an incident you’ll have a clear run‑book, but we also encourage “play‑by‑play Apply tot his job Apply To this Job

Apply now