Systems Engineer Grafana Expertise (DEVOPS)

Dallas

Posted 8 months ago

The Observability team is at the forefront of ensuring the health, performance, and reliability of our critical systems and applications. We empower the organization with real-time access to infrastructure and business applications by using innovative monitoring, reporting, and visualization tools.
Our team collects and analyzes metrics, logs, and traces using platforms like Splunk and other telemetry solutions. This data is crucial for assessing application health and availability, and for enabling rapid root cause analysis when issues arise—helping us maintain resilience in a fast-paced, high-volume trading environment.
If you’re passionate about observability, data-driven problem solving, and building systems that make a real-world impact, we’d love to have you on our team.

RESPONSIBILITIES:

As a member of our clients Observability team, you will play a pivotal role in enhancing our monitoring and telemetry capabilities across critical infrastructure and business applications. Your responsibilities will include:
Lead the migration from OpenText monitoring tools to Grafana and other open-source platforms.
Design and deploy monitoring rules for infrastructure and business applications.
Develop and manage alerting rules and notification workflows.
Build real-time dashboards to visualize system health and performance.
Configure and manage OpenTelemetry Collectors and Pipelines.
Integrate observability tools with CI/CD, incident management, and cloud platforms.
Deploy and manage observability agents across diverse environments.
Perform upgrades and maintenance of observability platforms.

QUALIFICATIONS:

Minimum of 6-8 years of related experience.
Bachelor’s degree preferred or equivalent experience.
Demonstrable experience designing intuitive, real-time dashboards (e.g., in Grafana) that effectively communicate system health, performance trends, and business critical metrics.
Expertise in defining and tuning monitoring rules, thresholds, and alerting logic to ensure accurate and actionable incident detection.
Good understanding of both application-level and operating system-level metrics, including CPU, memory, disk I/O, network, and custom business metrics.
Experience with structured log ingestion, parsing, and analysis using tools like Splunk, Fluentd, or OpenTelemetry.
Familiarity with implementing and analyzing synthetic transactions and real user monitoring to assess end-user experience and application responsiveness.
Hands-on experience with application tracing tools and frameworks (e.g., OpenTelemetry, Jaeger, Zipkin) to diagnose performance bottlenecks and service dependencies.
Proficiency in configuring and using AWS CloudWatch for collecting and visualizing cloud-native metrics, logs, and events.
Understanding of containerized environments (e.g., Docker, Kubernetes) and how to monitor container health, resource usage, and orchestration metrics.
Ability to write scripts or small applications in languages such as Python, Java, or Bash to automate observability tasks and data processing.
Experience with automation and configuration management tools such as Ansible, Terraform, Chef, or SCCM to deploy and lead observability components at scale.

Systems Engineer Grafana Expertise (DEVOPS)

Apply For This Job