This is a remote position.
Datadog Platform Expert
We are seeking a high-level Datadog Expert to audit and optimize our leading client’s primary observability platform. This is not a "user" role; we need an expert capable of re-engineering data flows for maximum efficiency.
Datadog Platform Expertise (Must Have)
- Minimum 4+ years of hands-on experience with the Datadog platform in production environments.
- Deep expertise across Datadog’s core product suite: Infrastructure Monitoring, APM (Application Performance Monitoring), Log Management, Synthetics, Network Monitoring, and Real User Monitoring (RUM).
- Proven experience in Datadog cost optimisation, including data ingestion reduction, licence right-sizing, and metric cardinality management.
- Expert-level knowledge of Datadog Agent deployment, configuration, and troubleshooting across bare-metal, VM, and containerised environments (Docker, Kubernetes).
- Strong experience with Datadog’s tagging strategy, service catalogue, and custom metrics (DogStatsD, custom checks).
- Experience with Datadog API and programmatic management of monitors, dashboards, and SLOs.
- Familiarity with Datadog’s pricing model and ability to forecast and optimise costs based on usage patterns.
Cloud Infrastructure (Must Have)
- Strong AWS experience (minimum 3+ years), including EC2, ECS/EKS, Lambda, RDS, S3, CloudWatch, and VPC networking.
- Experience monitoring AWS cost drivers and correlating infrastructure changes with observability cost impact.
- Familiarity with Infrastructure-as-Code (Terraform, CloudFormation) for managing Datadog resources programmatically.
- Understanding of Kubernetes monitoring patterns: DaemonSets, sidecar injection, cluster-level metrics, and container log collection.
Service Management and Automation (Must Have)
- Experience integrating Datadog with Jira Service Management, including webhook-based alert forwarding and bidirectional status sync.
- Knowledge of incident management workflows: escalation policies, runbook automation, and post-incident review processes.
- Experience with PagerDuty, OpsGenie, or similar on-call management tools and their integration with Datadog.
- Ability to design and implement automated remediation workflows triggered by Datadog alerts.
Data Quality and Analytics (Must Have)
- Experience auditing and improving data quality in observability pipelines (metrics, logs, traces).
- Strong analytical skills with the ability to identify patterns, anomalies, and data integrity issues in large-scale telemetry data.
- Experience designing custom dashboards and reports for engineering leadership, focusing on actionable insights.
Preferred and Bonus Skills
- Datadog Fundamentals Certification, Log Management Certification, or APM Certification (highly preferred).
- Datadog Cloud SIEM for AWS Fundamentals certification.
- Experience with FinOps frameworks and cloud cost management tools (AWS Cost Explorer, Trusted Advisor, CloudHealth, Kubecost).
- Experience in financial services or banking environments, particularly with regulatory compliance for data handling and retention.
- Familiarity with Thought Machine (core banking platform) or similar modern banking technology stacks.
- Experience with AI/ML-driven observability features: anomaly detection, forecasting, Watchdog, and intelligent alerting.
- Contributions to or experience with Datadog’s open-source ecosystem (datadog-agent, dd-trace libraries, integrations).
- Experience with log parsing, pipeline processing, and log-to-metric conversion strategies in Datadog.