EOP System Reliability Engineer TSSCI Required

CFocus Softwareorporated

Job Location:

Washington, AR - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

5 years and Bachelors Degree in Computer Programming Science Engineering or a related technical discipline or the equivalent combination of education technical training or work/military experience including:
3 years of related systems programming experience
Technical Skills: Proficiency in programming languages (such as Python Go or Bash) is essential for scripting and automation tasks. Knowledge of Linux/Unix systems is also crucial as SREs often work in these environments.
Problem-Solving: analytical and problem-solving skills are necessary to diagnose and resolve complex system issues effectively.
Understanding of SRE Principles: Familiarity with key SRE concepts such as Service Level Indicators (SLIs) Service Level Objectives (SLOs) and error budgets is important for measuring and maintaining system reliability.
Reliability and Availability: SRE practices help ensure that services are consistently available and reliable which is critical for user satisfaction and business success.
Scalability: SREs implement strategies that allow systems to scale efficiently as demand increases ensuring that performance remains optimal even under heavy load.
Cost Management: By optimizing resource usage and reducing downtime SREs contribute to cost savings for organizations.
Programming and Scripting: Proficiency in languages like Python Go or Ruby is crucial for automating tasks and managing infrastructure.
Operating Systems: A strong understanding of Linux/Unix systems is essential for troubleshooting and managing servers.
Cloud Computing: Familiarity with cloud platforms like AWS Azure or Google Cloud is vital for deploying and managing applications in distributed environments.
Containers & Orchestration: Understanding containerization tools like Docker and managing containerized workloads with Kubernetes is crucial for cloud-native applications.
Monitoring and Logging: Proficiency in tools like Prometheus Grafana or Elasticsearch Logstash and Kibana (ELK) Stack is necessary for tracking metrics setting up alerts and analyzing logs.
Networking: Knowledge of networking protocols and configurations is essential for maintaining system health and performance.
Configuration Management: Skills in managing and maintaining system configurations are critical for ensuring system reliability.
Incident Response: Ability to respond quickly and effectively to incidents including documenting and learning from them.
Security Best Practices: Understanding security protocols and best practices to protect systems from vulnerabilities.
These skills are essential for SREs to maintain high availability and performance balancing the demands of development and operations.
Support required during core business hours of 8am 5pm Monday through Friday.
On-call for evenings or weekends if needed for outages application upgrades security patches or other unplanned activities.

Duties:

Administration of accounts (role-based access and rights).
Manage accessibility to the application through EOPs authentication systems.
Manage the workflow templates to ensure consistent and predictable task flows.
Configure workflow management for new or adjustments based on user requests while adhering to EOP template standards.
Maintain configurations and configurable fields for users and workflows.
Maintain the test environment to mimic production and conduct test and evaluation in the environment prior to deployments.
Design and maintain a secure and reliable form of backups ensuring High Availability (HA) and resiliency.
Develop a Disaster Recovery (DR) or Incident Response (IR) plan for specific applications and services in the event of a disaster or unexpected downtime.
Maintain unique instances that support various offices.
Configure and support integrations with complementary systems.
Establish and Improve system monitoring while maintaining established security protocols within development test and production systems.
Architect build and maintain on premise and/or cloud infrastructure to support team and customer initiatives.
Maintain and improve existing infrastructure (build out autoscaling support new services optimize for cost efficiencies/authentication/search etc.).
Administer production staging and development environments.
Manage and aggregate server logs and monitor for security and system related incidents.
Monitor and analyze system performance such as server load and resource usage.
Maintain and improve existing build and deployment processes using CI/CD tools.
Apply configuration management disciplines to maintain software revisions security patches hardening and documentation.
Enforce best practices for security and reliability and drive security initiatives like access control and vulnerability testing.
Maintain up to date documentation of designs/configurations ensuring team members have continuity of recurring tasks.
Maintain status of operations at all times: perform after actions reporting on all outages and work with engineering teams to determine solution and root cause analysis. Present findings to management for prioritization and tasking.
Create and determine required metrics for dashboards and service health.
Follow up on engineering tasks for operational solutions and validate completion
Manage operational readiness board present at weekly meetings and determine if development services are ready for automation based on best practices and maintainability.
Track and ensure routine operations maintenance tasks are completed in a timely manner.
Align to the customers strategies for configuration of workflows without compromising the integrity of the workflow tool and templates.
Build maintain and utilize the customers enterprise Development Security and Operations (DevSecOps) pipeline.
Work with other service providers to support areas of common interest.
On-call support may be required.

Required Experience:

Senior IC

cFocus Software seeks a System Reliability Engineer to join our program supporting the Executive Office of the President. This position is remote. This position requires TS/SCI clearance.Qualifications:5 years and Bachelors Degree in Computer Programming Science Engineering or a related technical di...

5 years and Bachelors Degree in Computer Programming Science Engineering or a related technical discipline or the equivalent combination of education technical training or work/military experience including:
3 years of related systems programming experience
Technical Skills: Proficiency in programming languages (such as Python Go or Bash) is essential for scripting and automation tasks. Knowledge of Linux/Unix systems is also crucial as SREs often work in these environments.
Problem-Solving: analytical and problem-solving skills are necessary to diagnose and resolve complex system issues effectively.
Understanding of SRE Principles: Familiarity with key SRE concepts such as Service Level Indicators (SLIs) Service Level Objectives (SLOs) and error budgets is important for measuring and maintaining system reliability.
Reliability and Availability: SRE practices help ensure that services are consistently available and reliable which is critical for user satisfaction and business success.
Scalability: SREs implement strategies that allow systems to scale efficiently as demand increases ensuring that performance remains optimal even under heavy load.
Cost Management: By optimizing resource usage and reducing downtime SREs contribute to cost savings for organizations.
Programming and Scripting: Proficiency in languages like Python Go or Ruby is crucial for automating tasks and managing infrastructure.
Operating Systems: A strong understanding of Linux/Unix systems is essential for troubleshooting and managing servers.
Cloud Computing: Familiarity with cloud platforms like AWS Azure or Google Cloud is vital for deploying and managing applications in distributed environments.
Containers & Orchestration: Understanding containerization tools like Docker and managing containerized workloads with Kubernetes is crucial for cloud-native applications.
Monitoring and Logging: Proficiency in tools like Prometheus Grafana or Elasticsearch Logstash and Kibana (ELK) Stack is necessary for tracking metrics setting up alerts and analyzing logs.
Networking: Knowledge of networking protocols and configurations is essential for maintaining system health and performance.
Configuration Management: Skills in managing and maintaining system configurations are critical for ensuring system reliability.
Incident Response: Ability to respond quickly and effectively to incidents including documenting and learning from them.
Security Best Practices: Understanding security protocols and best practices to protect systems from vulnerabilities.
These skills are essential for SREs to maintain high availability and performance balancing the demands of development and operations.
Support required during core business hours of 8am 5pm Monday through Friday.
On-call for evenings or weekends if needed for outages application upgrades security patches or other unplanned activities.

Duties:

Administration of accounts (role-based access and rights).
Manage accessibility to the application through EOPs authentication systems.
Manage the workflow templates to ensure consistent and predictable task flows.
Configure workflow management for new or adjustments based on user requests while adhering to EOP template standards.
Maintain configurations and configurable fields for users and workflows.
Maintain the test environment to mimic production and conduct test and evaluation in the environment prior to deployments.
Design and maintain a secure and reliable form of backups ensuring High Availability (HA) and resiliency.
Develop a Disaster Recovery (DR) or Incident Response (IR) plan for specific applications and services in the event of a disaster or unexpected downtime.
Maintain unique instances that support various offices.
Configure and support integrations with complementary systems.
Establish and Improve system monitoring while maintaining established security protocols within development test and production systems.
Architect build and maintain on premise and/or cloud infrastructure to support team and customer initiatives.
Maintain and improve existing infrastructure (build out autoscaling support new services optimize for cost efficiencies/authentication/search etc.).
Administer production staging and development environments.
Manage and aggregate server logs and monitor for security and system related incidents.
Monitor and analyze system performance such as server load and resource usage.
Maintain and improve existing build and deployment processes using CI/CD tools.
Apply configuration management disciplines to maintain software revisions security patches hardening and documentation.
Enforce best practices for security and reliability and drive security initiatives like access control and vulnerability testing.
Maintain up to date documentation of designs/configurations ensuring team members have continuity of recurring tasks.
Maintain status of operations at all times: perform after actions reporting on all outages and work with engineering teams to determine solution and root cause analysis. Present findings to management for prioritization and tasking.
Create and determine required metrics for dashboards and service health.
Follow up on engineering tasks for operational solutions and validate completion
Manage operational readiness board present at weekly meetings and determine if development services are ready for automation based on best practices and maintainability.
Track and ensure routine operations maintenance tasks are completed in a timely manner.
Align to the customers strategies for configuration of workflows without compromising the integrity of the workflow tool and templates.
Build maintain and utilize the customers enterprise Development Security and Operations (DevSecOps) pipeline.
Work with other service providers to support areas of common interest.
On-call support may be required.