Skyro is a rapidly expanding fintech company serving thousands of customers in the Philippines through our lending business. Our mission is to evolve into a full-fledged financial ecosystem, delivering cutting-edge solutions that make financial services more accessible, efficient, and secure for everyone.
What you will do
Technical Execution:
Design, build, and maintain reliable infrastructure and platform services across multiple regions
Own reliability for assigned services, including SLI/SLO definition, monitoring, and improvement
Troubleshoot and resolve complex production issues across the full stack
Implement and improve Infrastructure as Code, CI/CD pipelines, and automation tooling
Drive architectural improvements that enhance availability, performance, and fault tolerance
Operational Excellence:
Lead incident response for critical production issues and conduct thorough postmortems
Develop and maintain runbooks, operational documentation, and recovery procedures
Contribute to DR strategy and participate in DR testing and chaos engineering exercises
Proactively identify reliability risks and implement preventive measures
Collaboration & Mentorship:
Mentor junior and mid-level engineers through code reviews, pairing, and knowledge sharing
Collaborate with product engineering teams on reliability requirements and best practices
Contribute to cross-team standards for observability, alerting, and incident management
Participate in architecture reviews and provide technical guidance on reliability decisions
Continuous Improvement:
Evaluate and recommend new tools, technologies, and methodologies to improve SRE practices
Automate toil and repetitive operational tasks to improve team efficiency
Contribute to capacity planning and cost optimization initiatives
Stay current with industry trends in reliability engineering and cloud infrastructure
What you should have:
5+ years of experience in SRE, infrastructure engineering, or platform engineering
Strong expertise in distributed systems, reliability patterns, and fault-tolerant architecture
Solid programming skills in Python, Go, or similar languages, with scripting proficiency (Bash)
Deep hands-on experience with Kubernetes operations and troubleshooting at scale
Strong experience with public cloud platforms (AWS, GCP) in production environments
Proficient with Infrastructure as Code (Terraform, Terragrunt) including module development
Strong observability skills: Prometheus, Grafana, distributed tracing, and centralized logging
Experience defining and maintaining SLIs/SLOs for production services
Proven ability to lead incident response and drive meaningful postmortems
Nice to have:
Database operations experience (PostgreSQL, MongoDB, ClickHouse)
GitLab CI/CD pipelines and Helm chart development
Financial services or regulated environment background
Experience with XaaS platforms (S3, DBaaS, VMaaS, CDNaaS)
Familiarity with security and compliance frameworks (ISO 27001, SOC 2, NIST)
Chaos engineering or DR testing experience
What happens after you apply?
We review applications on a rolling basis and aim to get back within 2–3 business days. If there’s a fit, we’ll reach out. If you don’t hear from us within 2–3 weeks – consider it a pass. Thanks for taking the time – we appreciate your interest. 🚀
Collaboration Notice
Please note that the workday should start no later than 2 PM (GMT+8)/7 AM (CET) to ensure effective collaboration within our international team.