The Cloud Engineer - Senior (Observability - Datadog) supports the SEC ISS contract by engineering,operating, and continuously improving the enterprise observability platform across hybrid cloud and containerized environments. This role is hands-on: instruments services with distributed tracing, code-level profiling, and custom metrics; builds and tunes Datadog (or comparable) dashboards, alerts, APM, log pipelines, RUM, and synthetic monitors; then uses that telemetry to solve production performance, reliability, and capacity problems. The engineer partners with cloud, platform, and application teams to embed observability into Azure, AWS, and container platforms (OpenShift/Kubernetes), and drives reduction of alert noise, mean time to detect (MTTD), and mean time to resolve (MTTR). This position provides senior technical leadership for APM/distributed tracing strategy, SLO/SLI engineering, and data-driven operational decision-making in a 24x7x365 operating environment.
PRIMARY RESPONSIBILITIES
Observability Platform Engineering
- Engineer andoperatethe enterprise observability stack (Datadog or comparable), including metrics, logs, traces, APM, RUM, synthetic monitoring, and network performance monitoring.
- Build, tune, andmaintaindashboards, monitors, SLOs/SLIs, and alerting policies that produce actionable signal and minimize noise.
- Instrument services, infrastructure, and containerized workloads using agents,OpenTelemetry, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) with consistent span tagging, W3CTraceContextpropagation, and unified service tagging across the estate.
- Develop andmaintainintegrations between observability platforms, ITSM (ServiceNow), CI/CD pipelines, and on-call/paging workflows.
- Define and enforce a unified tagging standard (environment, service, version, team/ownership, data classification, cost center) across metrics, logs, and traces; manage tag cardinality, governance, and custom business tags to keep telemetryqueryable, attributable, and cost-controlled.
Cloud and Container Monitoring Engineering
- Design and deliver monitoring coverage for Microsoft Azure and AWS workloads, including PaaS services, serverless, networking, identity, managed databases, and cloud-native data services.
- Engineer managed database observability across AWS RDS/Aurora (MySQL, PostgreSQL, SQL Server, Oracle), Azure SQL/PostgreSQL/MySQL, and NoSQL/cache services (DynamoDB, Cosmos DB,ElastiCache/Redis), including query-level performance analytics, slow-query and execution-plan capture, lock/deadlock/wait analysis, connection pool and session monitoring, replication lag, storage/IOPS saturation, and backup/HA health -- correlating database spans with upstream APM traces.
- Engineer container-platform observability for OpenShift/Kubernetes, covering cluster health, control plane, nodes