Senior Platform Engineer
anaqua
Job Description
- Own the GCP infrastructure with the team — GKE clusters, multi-region setup, global load balancing, autoscaling, VPC networking, DNS, firewall rules and IAM.
- Build and maintain GitLab CI/CD pipelines and shared CI templates that every service team consumes — build, scan, deploy, promote across Dev / QA / Staging / Pre-Prod / Production.
- Help shape the company-wide standards for how services get deployed, secured, monitored and rolled back.
- Operate and harden the cluster — node pool upgrades, namespace / RBAC / resource-quota design, rolling updates, health probes, base images and supply-chain security.
- Run the platform security stack — gateway policies, API-key and JWT issuance, secret rotation, OWASP and dependency scanning, workload identity, IAM least-privilege.
- Own observability and incident response on GCP — structured logging, metrics, dashboards, SLIs / SLOs / error budgets, alerting, post-mortems and on-call runbooks.
- Build internal developer tooling — CLIs, self-service workflows and golden-path automation that make the next service easy to ship.
What you will need to be successful:
- Strong production ownership on GCP — operating real workloads, not just standing up demos. GCP is the cloud we run on.
- Kubernetes in production (GKE) — deployments, Helm, namespaces, RBAC, resource quotas, rolling updates, health and readiness probes, multi-region setups and rollbacks.
- Terraform as a daily tool — modular, reusable modules with remote state, drift detection and clean management of IAM, networking, Pub/Sub, Cloud SQL and secrets.
- CI/CD pipeline depth — GitLab CI (or equivalent) at scale; reusable templates, fast feedback loops, security and dependency scans as pipeline stages, deploy promotion across Dev / QA / Staging / Pre-Prod / Production.
- Git workflow fluency — GitFlow or trunk-based branching, tagging and release strategies that fit a multi-service org.
- Cloud networking depth — VPC design, load balancing (global and regional), DNS, firewall rules and network security groups.
- Hosting and application security ownership — gateway and edge policies, secret rotation, OWASP and dependency scanning, workload identity, IAM least-privilege hygiene.
- Production observability and reliability on GCP — structured logs, metrics, dashboards, alerting, SLIs / SLOs / error budgets, on-call rotations, post-mortems.
- Performance work — load testing, capacity planning and operational tuning of services under real traffic.
- Operational PostgreSQL — migrations under load, backups, restores, replication basics, query plans and indexing.
- Asynchronous messaging on GCP Pub/Sub — topology, subscriptions, dead-letter handling and operational tuning. Pub/Sub is our primary message bus.
- Scripting and automation — Bash plus one of Python or Go for internal tooling and platform automation.
- Excellent written and spoken English; comfortable working across time zones with engineers in EU and the US.