Summary
Job title : Senior Cloud Observability Engineer - Data Dog Location: Washington, D.C., 20549 (100 % Onsite) Duration: 6 Months Salary Range: $58.00 - $60.00/Hour on W2 (Without Benefits). Applicants must be willing to work on W2. Clearance : Ability to obtain and maintain SEC Public Trust (or higher if required). Primary Responsibilities: Observability Platform Engineering: * Engineer and operate the enterprise observability stack (Datadog or comparable), including metrics, logs, traces, APM, RUM, synthetic monitoring, and network performance monitoring. * Build, tune, and maintain dashboards, monitors, SLOs/SLIs, and alerting policies that produce actionable signal and minimize noise. * Instrument services, infrastructure, and containerized workloads using agents, OpenTelemetry, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) with consistent span tagging, W3C TraceContext propagation, and unified service tagging across the estate. * Develop and maintain integrations between observability platforms, ITSM (ServiceNow), CI/CD pipelines, and on call/paging workflows. * Define and enforce a unified tagging standard (environment, service, version, team/ownership, data classification, cost center) across metrics, logs, and traces; manage tag cardinality, governance, and custom business tags to keep telemetry queryable, attributable, and cost controlled. Cloud and Container Monitoring Engineering: * Design and deliver monitoring coverage for Microsoft Azure and AWS workloads, including PaaS services, serverless, networking, identity, managed databases, and cloud-native data services. * Engineer managed database observability across AWS RDS/Aurora (MySQL, PostgreSQL, SQL Server, Oracle), Azure SQL/PostgreSQL/MySQL, and NoSQL/cache services (DynamoDB, Cosmos DB, ElastiCache/Redis), including query-level performance analytics, slow-query and execution-plan capture, lock/deadlock/wait analysis, connection pool and session monitoring, replication lag, storage/IOPS saturation, and backup/HA health -- correlating database spans with upstream APM traces. * Engineer container-platform observability for OpenShift/Kubernetes, covering cluster health, control plane, nodes, pods, namespaces, ingress, service mesh, and workload APM. * Build standardized, reusable monitoring modules deployable via infrastructure-as-code (Terraform, Bicep, ARM) and CI/CD. * Support hybrid visibility across on-premises, cloud, and containerized workloads with correlated telemetry. Performance Engineering and Problem Solving: * Lead data-driven investigation and resolution of complex performance, latency, saturation, and reliability issues across the estate. * Use APM distributed traces, service/dependency maps, continuous code profiling (CPU, memory, lock contention), database query analytics, exception/error tracking, and RUM-to-backend trace correlation to isolate bottlenecks in applications, platforms, middleware, and downstream dependencies. * Partner with engineering teams to define and implement remediation, tuning, and architectural improvements based on telemetry evidence. * Define and implement trace-based SLOs, deployment tracking, and change-correlation workflows so performance regressions are detected and attributed to specific releases, versions, or configuration changes. * Provide senior technical leadership during major incidents, delivering impact analysis, contributing to root-cause analysis, and owning post-incident observability gaps. Capacity, Reliability, and Continuous Improvement: * Analyze operational telemetry and trend data to identify capacity risks, recurring constraints, and opportunities for efficiency. * Build and maintain capacity and performance dashboards and reports that communicate posture, risk, and recommendations to technical and leadership stakeholders. * Define capacity thresholds, alert baselines, and trigger points for scaling, technology refresh, and resource reallocation. * Drive continuous improvement of observability coverage, alert quality, runbook linkage, and operational maturity aligned to SEC SLA/KPI expectations. Required qualifications: Education: * Bachelor's degree in a relevant field (e.g., Information Technology, Computer Science, Engineering). Required Experience: * Minimum 8 years of experience in IT infrastructure or platform engineering roles, including 5+ years focused on observability, performance engineering, or site reliability engineering. * Demonstrated experience engineering and operating an enterprise observability platform (Datadog strongly preferred; equivalent experience with Dynatrace, New Relic, Splunk Observability, or Grafana/Prometheus stacks considered). * Proven experience building APM and distributed tracing coverage for production multi-tier applications -- including language-specific tracer deployment, custom instrumentation of business transactions, service/dependency mapping, continuous profiling, and RUM-to-backend trace correlation -- across cloud and containerized workloads. * Proven experience leading complex production performance and reliability problem-solving from telemetry to remediation.
Job Description
Job title : Senior Cloud Observability Engineer - Data Dog Location: Washington, D.C., 20549 (100 % Onsite) Duration: 6 Months Salary Range: $58.00 - $60.00/Hour on W2 (Without Benefits). Applicants must be willing to work on W2. Clearance : Ability to obtain and maintain SEC Public Trust (or higher if required). Primary Responsibilities: Observability Platform Engineering: * Engineer and operate the enterprise observability stack (Datadog or comparable), including metrics, logs, traces, APM, RUM, synthetic monitoring, and network performance monitoring. * Build, tune, and maintain dashboards, monitors, SLOs/SLIs, and alerting policies that produce actionable signal and minimize noise. * Instrument services, infrastructure, and containerized workloads using agents, OpenTelemetry, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) with consistent span tagging, W3C TraceContext propagation, and unified service tagging across the estate. * Develop and maintain integrations between observability platforms, ITSM (ServiceNow), CI/CD pipelines, and on call/paging workflows. * Define and enforce a unified tagging standard (environment, service, version, team/ownership, data classification, cost center) across metrics, logs, and traces; manage tag cardinality, governance, and custom business tags to keep telemetry queryable, attributable, and cost controlled. Cloud and Container Monitoring Engineering: * Design and deliver monitoring coverage for Microsoft Azure and AWS workloads, including PaaS services, serverless, networking, identity, managed databases, and cloud-native data services. * Engineer managed database observability across AWS RDS/Aurora (MySQL, PostgreSQL, SQL Server, Oracle), Azure SQL/PostgreSQL/MySQL, and NoSQL/cache services (DynamoDB, Cosmos DB, ElastiCache/Redis), including query-level performance analytics, slow-query and execution-plan capture, lock/deadlock/wait analysis, connection pool and session monitoring, replication lag, storage/IOPS saturation, and backup/HA health -- correlating database spans with upstream APM traces. * Engineer container-platform observability for OpenShift/Kubernetes, covering cluster health, control plane, nodes, pods, namespaces, ingress, service mesh, and workload APM. * Build standardized, reusable monitoring modules deployable via infrastructure-as-code (Terraform, Bicep, ARM) and CI/CD. * Support hybrid visibility across on-premises, cloud, and containerized workloads with correlated telemetry. Performance Engineering and Problem Solving: * Lead data-driven investigation and resolution of complex performance, latency, saturation, and reliability issues across the estate. * Use APM distributed traces, service/dependency maps, continuous code profiling (CPU, memory, lock contention), database query analytics, exception/error tracking, and RUM-to-backend trace correlation to isolate bottlenecks in applications, platforms, middleware, and downstream dependencies. * Partner with engineering teams to define and implement remediation, tuning, and architectural improvements based on telemetry evidence. * Define and implement trace-based SLOs, deployment tracking, and change-correlation workflows so performance regressions are detected and attributed to specific releases, versions, or configuration changes. * Provide senior technical leadership during major incidents, delivering impact analysis, contributing to root-cause analysis, and owning post-incident observability gaps. Capacity, Reliability, and Continuous Improvement: * Analyze operational telemetry and trend data to identify capacity risks, recurring constraints, and opportunities for efficiency. * Build and maintain capacity and performance dashboards and reports that communicate posture, risk, and recommendations to technical and leadership stakeholders. * Define capacity thresholds, alert baselines, and trigger points for scaling, technology refresh, and resource reallocation. * Drive continuous improvement of observability coverage, alert quality, runbook linkage, and operational maturity aligned to SEC SLA/KPI expectations. Required qualifications: Education: * Bachelor's degree in a relevant field (e.g., Information Technology, Computer Science, Engineering). Required Experience: * Minimum 8 years of experience in IT infrastructure or platform engineering roles, including 5+ years focused on observability, performance engineering, or site reliability engineering. * Demonstrated experience engineering and operating an enterprise observability platform (Datadog strongly preferred; equivalent experience with Dynatrace, New Relic, Splunk Observability, or Grafana/Prometheus stacks considered). * Proven experience building APM and distributed tracing coverage for production multi-tier applications -- including language-specific tracer deployment, custom instrumentation of business transactions, service/dependency mapping, continuous profiling, and RUM-to-backend trace correlation -- across cloud and containerized workloads. * Proven experience leading complex production performance and reliability problem-solving from telemetry to remediation.