About the Role
We're looking for a talented DevOps, DBA and SRE Engineer to join myZoi and help ensure our cloud-native, regulated fintech application maintains excellent reliability, performance, and security. In this role, you'll bridge the gap between development and operations, implementing robust automation, monitoring, and incident response processes that support our mission of financial inclusion.
Key Responsibilities
Design, implement, and maintain infrastructure as code for our cloud-native platform on AWS
Create and optimize CI/CD pipelines for reliable, consistent deployments using Argo, GitHub, Jenkins, Sonar and other best of breed tooling
Implement comprehensive monitoring and alerting systems to ensure platform health
Drive incident management processes and conduct thorough postmortems
Collaborate with development teams to improve application reliability and performance
Establish SLOs/SLIs and track reliability metrics
Work on Continuous Improvement SRE and TechOps processes and rituals
Implement security best practices and ensure compliance with financial regulations
Automate routine operational tasks to improve efficiency and reduce human error
Contribute to capacity planning and cost optimization efforts
Configure and maintain our Datadog and CloudWatch observability stack across AWS environments
Design and implement custom metrics, dashboards, and alerts in Datadog to provide visibility into system health
Establish effective log management and analysis workflows to support troubleshooting
Manage complex multi-account infrastructure and access best-practices
Manage complex networking configuration
Manage AWS infrastructure including EKS, RDS, DirectConnect, Dynamo, Control Tower
Required Skills & Experience
8+ years of experience in SRE, DevOps, or platform engineering roles
Strong knowledge of cloud platforms (AWS) and containerization technologies
Knowledge os AWS Networking, VPCs, Peering, Security Groups, MPLS, VPN
Experience with infrastructure as code tools (Terraform, CloudFormation, or equivalent)
Proficiency in at least one programming/scripting language (Python, Go, or TypeScript preferred)
Experience implementing and maintaining CI/CD pipelines, ArgoCD, Github, Jenkins
Strong understanding of monitoring tools and observability practices
Hands-on experience with Datadog for infrastructure monitoring, APM, and log management
Knowledge of security principles for cloud-native applications
Experience with container orchestration systems (Kubernetes)
Background in incident management and problem resolution
Experience with AWS CloudWatch metrics and integration with third-party observability platforms
Experience in Agile delivery, Jira, Confluence
Preferred Qualifications
Experience working with fintech or other regulated applications
Knowledge of financial compliance requirements (e.g., PCI DSS, CISv8)
Experience implementing Zero Trust security models
Familiarity with service mesh technologies
Background in performance optimization and capacity planning
Experience with chaos engineering principles
Advanced Datadog skills, including creating custom agents and integrations
Experience implementing distributed tracing across microservices
Experience with Jira