GETTR TechOps: Scaling Uptime to 99.99% in a Global, Always-On Platform

TL;DR (for Portfolio Summary)

Project: GETTR TechOps SLA & Security Overhaul

Role: Director of TechOps

Problem: Rapid platform growth led to instability, fragmented triage, and increasing security exposure

Solution:

Designed incident protocols, severity triage, and RCA post-mortem loops
Implemented PagerDuty, public status page, multi-cloud architecture
Integrated Imperva WAF and DDoS protection
Partnered with cybersecurity for red teaming, audits, and policy enforcement
Built a full observability stack (SLIs/SLOs) for platform health

Impact:
Achieved 99.99% uptime
Reduced incident response time by 80%
Maintained clear public communications during upstream outages (e.g., Shopify Merch Store incident in Nov 2023)
Built a culture of proactive defense, transparency, and systems thinking at scale

When I became Director of TechOps at GETTR, we were no longer a scrappy startup with a few servers and a single product.

We had grown into a full-stack social platform—with livestreaming, short-form video, direct messaging, public feeds, real-time comments, notifications, and a moderation engine. Our user base spanned multiple continents, our traffic surged with every political event, and downtime wasn't just inconvenient anymore.

It was unacceptable.

We were evolving from a fast-moving tech team into a critical infrastructure provider for real-time expression. My mandate was clear: