Senior Cloud Observability Engineer - Data Dog with Security Clearance

Artech Information Systems

Washington, DC 20001 United States View Map

Posted: Jun 18, 2026

Full Time
Federal Government

Summary

Job title : Senior Cloud Observability Engineer - Data Dog Location: Washington, D.C., 20549 (100 % Onsite) Duration: 6 Months Salary Range: $58.00 - $60.00/Hour on W2 (Without Benefits). Applicants must be willing to work on W2. Clearance : Ability to obtain and maintain SEC Public Trust (or higher if required). Primary Responsibilities: Observability Platform Engineering: * Engineer and operate the enterprise observability stack (Datadog or comparable), including metrics, logs, traces, APM, RUM, synthetic monitoring, and network performance monitoring. * Build, tune, and maintain dashboards, monitors, SLOs/SLIs, and alerting policies that produce actionable signal and minimize noise. * Instrument services, infrastructure, and containerized workloads using agents, OpenTelemetry, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) with consistent span tagging, W3C TraceContext propagation, and unified service tagging across the estate. * Develop and maintain integrations between observability platforms, ITSM (ServiceNow), CI/CD pipelines, and on call/paging workflows. * Define and enforce a unified tagging standard (environment, service, version, team/ownership, data classification, cost center) across metrics, logs, and traces; manage tag cardinality, governance, and custom business tags to keep telemetry queryable, attributable, and cost controlled. Cloud and Container Monitoring Engineering: * Design and deliver monitoring coverage for Microsoft Azure and AWS workloads, including PaaS services, serverless, networking, identity, managed databases, and cloud-native data services. * Engineer managed database observability across AWS RDS/Aurora (MySQL, PostgreSQL, SQL Server, Oracle), Azure SQL/PostgreSQL/MySQL, and NoSQL/cache services (DynamoDB, Cosmos DB, ElastiCache/Redis), including query-level performance analytics, slow-query and execution-plan capture, lock/deadlock/wait analysis, connection pool and session monitoring, replication lag, storage/IOPS saturation, and backup/HA health -- correlating database spans with upstream APM traces. * Engineer container-platform observability for OpenShift/Kubernetes, covering cluster health, control plane, nodes, pods, namespaces, ingress, service mesh, and workload APM. * Build standardized, reusable monitoring modules deployable via infrastructure-as-code (Terraform, Bicep, ARM) and CI/CD. * Support hybrid visibility across on-premises, cloud, and containerized workloads with correlated telemetry. Performance Engineering and Problem Solving: * Lead data-driven investigation and resolution of complex performance, latency, saturation, and reliability issues across the estate. * Use APM distributed traces, service/dependency maps, continuous code profiling (CPU, memory, lock contention), database query analytics, exception/error tracking, and RUM-to-backend trace correlation to isolate bottlenecks in applications, platforms, middleware, and downstream dependencies. * Partner with engineering teams to define and implement remediation, tuning, and architectural improvements based on telemetry evidence. * Define and implement trace-based SLOs, deployment tracking, and change-correlation workflows so performance regressions are detected and attributed to specific releases, versions, or configuration changes. * Provide senior technical leadership during major incidents, delivering impact analysis, contributing to root-cause analysis, and owning post-incident observability gaps. Capacity, Reliability, and Continuous Improvement: * Analyze operational telemetry and trend data to identify capacity risks, recurring constraints, and opportunities for efficiency. * Build and maintain capacity and performance dashboards and reports that communicate posture, risk, and recommendations to technical and leadership stakeholders. * Define capacity thresholds, alert baselines, and trigger points for scaling, technology refresh, and resource reallocation. * Drive continuous improvement of observability coverage, alert quality, runbook linkage, and operational maturity aligned to SEC SLA/KPI expectations. Required qualifications: Education: * Bachelor's degree in a relevant field (e.g., Information Technology, Computer Science, Engineering). Required Experience: * Minimum 8 years of experience in IT infrastructure or platform engineering roles, including 5+ years focused on observability, performance engineering, or site reliability engineering. * Demonstrated experience engineering and operating an enterprise observability platform (Datadog strongly preferred; equivalent experience with Dynatrace, New Relic, Splunk Observability, or Grafana/Prometheus stacks considered). * Proven experience building APM and distributed tracing coverage for production multi-tier applications -- including language-specific tracer deployment, custom instrumentation of business transactions, service/dependency mapping, continuous profiling, and RUM-to-backend trace correlation -- across cloud and containerized workloads. * Proven experience leading complex production performance and reliability problem-solving from telemetry to remediation.
Job Description

Job title : Senior Cloud Observability Engineer - Data Dog Location: Washington, D.C., 20549 (100 % Onsite) Duration: 6 Months Salary Range: $58.00 - $60.00/Hour on W2 (Without Benefits). Applicants must be willing to work on W2. Clearance : Ability to obtain and maintain SEC Public Trust (or higher if required). Primary Responsibilities: Observability Platform Engineering: * Engineer and operate the enterprise observability stack (Datadog or comparable), including metrics, logs, traces, APM, RUM, synthetic monitoring, and network performance monitoring. * Build, tune, and maintain dashboards, monitors, SLOs/SLIs, and alerting policies that produce actionable signal and minimize noise. * Instrument services, infrastructure, and containerized workloads using agents, OpenTelemetry, and language-specific APM tracers (Java, .NET, Python, Node.js, Go) with consistent span tagging, W3C TraceContext propagation, and unified service tagging across the estate. * Develop and maintain integrations between observability platforms, ITSM (ServiceNow), CI/CD pipelines, and on call/paging workflows. * Define and enforce a unified tagging standard (environment, service, version, team/ownership, data classification, cost center) across metrics, logs, and traces; manage tag cardinality, governance, and custom business tags to keep telemetry queryable, attributable, and cost controlled. Cloud and Container Monitoring Engineering: * Design and deliver monitoring coverage for Microsoft Azure and AWS workloads, including PaaS services, serverless, networking, identity, managed databases, and cloud-native data services. * Engineer managed database observability across AWS RDS/Aurora (MySQL, PostgreSQL, SQL Server, Oracle), Azure SQL/PostgreSQL/MySQL, and NoSQL/cache services (DynamoDB, Cosmos DB, ElastiCache/Redis), including query-level performance analytics, slow-query and execution-plan capture, lock/deadlock/wait analysis, connection pool and session monitoring, replication lag, storage/IOPS saturation, and backup/HA health -- correlating database spans with upstream APM traces. * Engineer container-platform observability for OpenShift/Kubernetes, covering cluster health, control plane, nodes, pods, namespaces, ingress, service mesh, and workload APM. * Build standardized, reusable monitoring modules deployable via infrastructure-as-code (Terraform, Bicep, ARM) and CI/CD. * Support hybrid visibility across on-premises, cloud, and containerized workloads with correlated telemetry. Performance Engineering and Problem Solving: * Lead data-driven investigation and resolution of complex performance, latency, saturation, and reliability issues across the estate. * Use APM distributed traces, service/dependency maps, continuous code profiling (CPU, memory, lock contention), database query analytics, exception/error tracking, and RUM-to-backend trace correlation to isolate bottlenecks in applications, platforms, middleware, and downstream dependencies. * Partner with engineering teams to define and implement remediation, tuning, and architectural improvements based on telemetry evidence. * Define and implement trace-based SLOs, deployment tracking, and change-correlation workflows so performance regressions are detected and attributed to specific releases, versions, or configuration changes. * Provide senior technical leadership during major incidents, delivering impact analysis, contributing to root-cause analysis, and owning post-incident observability gaps. Capacity, Reliability, and Continuous Improvement: * Analyze operational telemetry and trend data to identify capacity risks, recurring constraints, and opportunities for efficiency. * Build and maintain capacity and performance dashboards and reports that communicate posture, risk, and recommendations to technical and leadership stakeholders. * Define capacity thresholds, alert baselines, and trigger points for scaling, technology refresh, and resource reallocation. * Drive continuous improvement of observability coverage, alert quality, runbook linkage, and operational maturity aligned to SEC SLA/KPI expectations. Required qualifications: Education: * Bachelor's degree in a relevant field (e.g., Information Technology, Computer Science, Engineering). Required Experience: * Minimum 8 years of experience in IT infrastructure or platform engineering roles, including 5+ years focused on observability, performance engineering, or site reliability engineering. * Demonstrated experience engineering and operating an enterprise observability platform (Datadog strongly preferred; equivalent experience with Dynatrace, New Relic, Splunk Observability, or Grafana/Prometheus stacks considered). * Proven experience building APM and distributed tracing coverage for production multi-tier applications -- including language-specific tracer deployment, custom instrumentation of business transactions, service/dependency mapping, continuous profiling, and RUM-to-backend trace correlation -- across cloud and containerized workloads. * Proven experience leading complex production performance and reliability problem-solving from telemetry to remediation.
ABOUT THE COMPANY
- Government Careers
Government jobs offer stability, competitive benefits, and the chance to make a meaningful impact on your community and country.

Whether you’re starting your career or seeking new opportunities, these roles provide pathways for growth, security, and service.

Explore positions across a wide range of fields and take the first step toward a rewarding future in public service.

Show more

Senior Cloud Observability Engineer - Data Dog with Security Clearance

Summary

Job Description

ABOUT THE COMPANY

Government Careers

MORE JOBS

Entry-Level Customs and Border Protection Officer (GS-5/7)

Customs and Border Protection Officer (CBPO) Entry Level New Hire Sign-On and Retention Incentives

Air Interdiction Agent New Hire Sign-On Incentives

Senior Penetration Tester

Criminal Division Prosecutor - Trials & Appeals

Customs and Border Protection Officer (CBPO) Entry Level New Hire Sign-On and Retention Incentives