EOP System Reliability Engineer TSSCI Required

Not Interested
Bookmark
Report This Job

profile Job Location:

Washington, AR - USA

profile Monthly Salary: Not Disclosed
Posted on: 2 days ago
Vacancies: 1 Vacancy

Job Summary

cFocus Software seeks a System Reliability Engineer to join our program supporting the Executive Office of the President. This position is remote. This position requires TS/SCI clearance.
Qualifications:
  • 5 years and Bachelors Degree in Computer Programming Science Engineering or a related technical discipline or the equivalent combination of education technical training or work/military experience including:
  • 3 years of related systems programming experience
  • Technical Skills: Proficiency in programming languages (such as Python Go or Bash) is essential for scripting and automation tasks. Knowledge of Linux/Unix systems is also crucial as SREs often work in these environments.
  • Problem-Solving: analytical and problem-solving skills are necessary to diagnose and resolve complex system issues effectively.
  • Understanding of SRE Principles: Familiarity with key SRE concepts such as Service Level Indicators (SLIs) Service Level Objectives (SLOs) and error budgets is important for measuring and maintaining system reliability.
  • Reliability and Availability: SRE practices help ensure that services are consistently available and reliable which is critical for user satisfaction and business success.
  • Scalability: SREs implement strategies that allow systems to scale efficiently as demand increases ensuring that performance remains optimal even under heavy load.
  • Cost Management: By optimizing resource usage and reducing downtime SREs contribute to cost savings for organizations.
  • Programming and Scripting: Proficiency in languages like Python Go or Ruby is crucial for automating tasks and managing infrastructure.
  • Operating Systems: A strong understanding of Linux/Unix systems is essential for troubleshooting and managing servers.
  • Cloud Computing: Familiarity with cloud platforms like AWS Azure or Google Cloud is vital for deploying and managing applications in distributed environments.
  • Containers & Orchestration: Understanding containerization tools like Docker and managing containerized workloads with Kubernetes is crucial for cloud-native applications.
  • Monitoring and Logging: Proficiency in tools like Prometheus Grafana or Elasticsearch Logstash and Kibana (ELK) Stack is necessary for tracking metrics setting up alerts and analyzing logs.
  • Networking: Knowledge of networking protocols and configurations is essential for maintaining system health and performance.
  • Configuration Management: Skills in managing and maintaining system configurations are critical for ensuring system reliability.
  • Incident Response: Ability to respond quickly and effectively to incidents including documenting and learning from them.
  • Security Best Practices: Understanding security protocols and best practices to protect systems from vulnerabilities.
  • These skills are essential for SREs to maintain high availability and performance balancing the demands of development and operations.
  • Support required during core business hours of 8am 5pm Monday through Friday.
  • On-call for evenings or weekends if needed for outages application upgrades security patches or other unplanned activities.
Duties:
  • Administration of accounts (role-based access and rights).
  • Manage accessibility to the application through EOPs authentication systems.
  • Manage the workflow templates to ensure consistent and predictable task flows.
  • Configure workflow management for new or adjustments based on user requests while adhering to EOP template standards.
  • Maintain configurations and configurable fields for users and workflows.
  • Maintain the test environment to mimic production and conduct test and evaluation in the environment prior to deployments.
  • Design and maintain a secure and reliable form of backups ensuring High Availability (HA) and resiliency.
  • Develop a Disaster Recovery (DR) or Incident Response (IR) plan for specific applications and services in the event of a disaster or unexpected downtime.
  • Maintain unique instances that support various offices.
  • Configure and support integrations with complementary systems.
  • Establish and Improve system monitoring while maintaining established security protocols within development test and production systems.
  • Architect build and maintain on premise and/or cloud infrastructure to support team and customer initiatives.
  • Maintain and improve existing infrastructure (build out autoscaling support new services optimize for cost efficiencies/authentication/search etc.).
  • Administer production staging and development environments.
  • Manage and aggregate server logs and monitor for security and system related incidents.
  • Monitor and analyze system performance such as server load and resource usage.
  • Maintain and improve existing build and deployment processes using CI/CD tools.
  • Apply configuration management disciplines to maintain software revisions security patches hardening and documentation.
  • Enforce best practices for security and reliability and drive security initiatives like access control and vulnerability testing.
  • Maintain up to date documentation of designs/configurations ensuring team members have continuity of recurring tasks.
  • Maintain status of operations at all times: perform after actions reporting on all outages and work with engineering teams to determine solution and root cause analysis. Present findings to management for prioritization and tasking.
  • Create and determine required metrics for dashboards and service health.
  • Follow up on engineering tasks for operational solutions and validate completion
  • Manage operational readiness board present at weekly meetings and determine if development services are ready for automation based on best practices and maintainability.
  • Track and ensure routine operations maintenance tasks are completed in a timely manner.
  • Align to the customers strategies for configuration of workflows without compromising the integrity of the workflow tool and templates.
  • Build maintain and utilize the customers enterprise Development Security and Operations (DevSecOps) pipeline.
  • Work with other service providers to support areas of common interest.
  • On-call support may be required.

Required Experience:

Senior IC

cFocus Software seeks a System Reliability Engineer to join our program supporting the Executive Office of the President. This position is remote. This position requires TS/SCI clearance.Qualifications:5 years and Bachelors Degree in Computer Programming Science Engineering or a related technical di...
View more view more

Key Skills

  • Kubernetes
  • FMEA
  • Continuous Improvement
  • Elasticsearch
  • Go
  • Root cause Analysis
  • Maximo
  • CMMS
  • Maintenance
  • Mechanical Engineering
  • Manufacturing
  • Troubleshooting

About Company

Company Logo

Our exclusive ATO as a Service™ software & expert services automate FISMA RMF & FedRAMP compliance.

View Profile View Profile