cFocus Software seeks a System Reliability Engineer to join our program supporting the Executive Office of the President. This position is remote. This position requires TS/SCI clearance.Qualifications:- 5 years and Bachelors Degree in Computer Programming Science Engineering or a related technical discipline or the equivalent combination of education technical training or work/military experience including:
- 3 years of related systems programming experience
- Technical Skills: Proficiency in programming languages (such as Python Go or Bash) is essential for scripting and automation tasks. Knowledge of Linux/Unix systems is also crucial as SREs often work in these environments.
- Problem-Solving: analytical and problem-solving skills are necessary to diagnose and resolve complex system issues effectively.
- Understanding of SRE Principles: Familiarity with key SRE concepts such as Service Level Indicators (SLIs) Service Level Objectives (SLOs) and error budgets is important for measuring and maintaining system reliability.
- Reliability and Availability: SRE practices help ensure that services are consistently available and reliable which is critical for user satisfaction and business success.
- Scalability: SREs implement strategies that allow systems to scale efficiently as demand increases ensuring that performance remains optimal even under heavy load.
- Cost Management: By optimizing resource usage and reducing downtime SREs contribute to cost savings for organizations.
- Programming and Scripting: Proficiency in languages like Python Go or Ruby is crucial for automating tasks and managing infrastructure.
- Operating Systems: A strong understanding of Linux/Unix systems is essential for troubleshooting and managing servers.
- Cloud Computing: Familiarity with cloud platforms like AWS Azure or Google Cloud is vital for deploying and managing applications in distributed environments.
- Containers & Orchestration: Understanding containerization tools like Docker and managing containerized workloads with Kubernetes is crucial for cloud-native applications.
- Monitoring and Logging: Proficiency in tools like Prometheus Grafana or Elasticsearch Logstash and Kibana (ELK) Stack is necessary for tracking metrics setting up alerts and analyzing logs.
- Networking: Knowledge of networking protocols and configurations is essential for maintaining system health and performance.
- Configuration Management: Skills in managing and maintaining system configurations are critical for ensuring system reliability.
- Incident Response: Ability to respond quickly and effectively to incidents including documenting and learning from them.
- Security Best Practices: Understanding security protocols and best practices to protect systems from vulnerabilities.
- These skills are essential for SREs to maintain high availability and performance balancing the demands of development and operations.
- Support required during core business hours of 8am 5pm Monday through Friday.
- On-call for evenings or weekends if needed for outages application upgrades security patches or other unplanned activities.
Duties:- Administration of accounts (role-based access and rights).
- Manage accessibility to the application through EOPs authentication systems.
- Manage the workflow templates to ensure consistent and predictable task flows.
- Configure workflow management for new or adjustments based on user requests while adhering to EOP template standards.
- Maintain configurations and configurable fields for users and workflows.
- Maintain the test environment to mimic production and conduct test and evaluation in the environment prior to deployments.
- Design and maintain a secure and reliable form of backups ensuring High Availability (HA) and resiliency.
- Develop a Disaster Recovery (DR) or Incident Response (IR) plan for specific applications and services in the event of a disaster or unexpected downtime.
- Maintain unique instances that support various offices.
- Configure and support integrations with complementary systems.
- Establish and Improve system monitoring while maintaining established security protocols within development test and production systems.
- Architect build and maintain on premise and/or cloud infrastructure to support team and customer initiatives.
- Maintain and improve existing infrastructure (build out autoscaling support new services optimize for cost efficiencies/authentication/search etc.).
- Administer production staging and development environments.
- Manage and aggregate server logs and monitor for security and system related incidents.
- Monitor and analyze system performance such as server load and resource usage.
- Maintain and improve existing build and deployment processes using CI/CD tools.
- Apply configuration management disciplines to maintain software revisions security patches hardening and documentation.
- Enforce best practices for security and reliability and drive security initiatives like access control and vulnerability testing.
- Maintain up to date documentation of designs/configurations ensuring team members have continuity of recurring tasks.
- Maintain status of operations at all times: perform after actions reporting on all outages and work with engineering teams to determine solution and root cause analysis. Present findings to management for prioritization and tasking.
- Create and determine required metrics for dashboards and service health.
- Follow up on engineering tasks for operational solutions and validate completion
- Manage operational readiness board present at weekly meetings and determine if development services are ready for automation based on best practices and maintainability.
- Track and ensure routine operations maintenance tasks are completed in a timely manner.
- Align to the customers strategies for configuration of workflows without compromising the integrity of the workflow tool and templates.
- Build maintain and utilize the customers enterprise Development Security and Operations (DevSecOps) pipeline.
- Work with other service providers to support areas of common interest.
- On-call support may be required.
Required Experience:
Senior IC
cFocus Software seeks a System Reliability Engineer to join our program supporting the Executive Office of the President. This position is remote. This position requires TS/SCI clearance.Qualifications:5 years and Bachelors Degree in Computer Programming Science Engineering or a related technical di...
cFocus Software seeks a System Reliability Engineer to join our program supporting the Executive Office of the President. This position is remote. This position requires TS/SCI clearance.Qualifications:- 5 years and Bachelors Degree in Computer Programming Science Engineering or a related technical discipline or the equivalent combination of education technical training or work/military experience including:
- 3 years of related systems programming experience
- Technical Skills: Proficiency in programming languages (such as Python Go or Bash) is essential for scripting and automation tasks. Knowledge of Linux/Unix systems is also crucial as SREs often work in these environments.
- Problem-Solving: analytical and problem-solving skills are necessary to diagnose and resolve complex system issues effectively.
- Understanding of SRE Principles: Familiarity with key SRE concepts such as Service Level Indicators (SLIs) Service Level Objectives (SLOs) and error budgets is important for measuring and maintaining system reliability.
- Reliability and Availability: SRE practices help ensure that services are consistently available and reliable which is critical for user satisfaction and business success.
- Scalability: SREs implement strategies that allow systems to scale efficiently as demand increases ensuring that performance remains optimal even under heavy load.
- Cost Management: By optimizing resource usage and reducing downtime SREs contribute to cost savings for organizations.
- Programming and Scripting: Proficiency in languages like Python Go or Ruby is crucial for automating tasks and managing infrastructure.
- Operating Systems: A strong understanding of Linux/Unix systems is essential for troubleshooting and managing servers.
- Cloud Computing: Familiarity with cloud platforms like AWS Azure or Google Cloud is vital for deploying and managing applications in distributed environments.
- Containers & Orchestration: Understanding containerization tools like Docker and managing containerized workloads with Kubernetes is crucial for cloud-native applications.
- Monitoring and Logging: Proficiency in tools like Prometheus Grafana or Elasticsearch Logstash and Kibana (ELK) Stack is necessary for tracking metrics setting up alerts and analyzing logs.
- Networking: Knowledge of networking protocols and configurations is essential for maintaining system health and performance.
- Configuration Management: Skills in managing and maintaining system configurations are critical for ensuring system reliability.
- Incident Response: Ability to respond quickly and effectively to incidents including documenting and learning from them.
- Security Best Practices: Understanding security protocols and best practices to protect systems from vulnerabilities.
- These skills are essential for SREs to maintain high availability and performance balancing the demands of development and operations.
- Support required during core business hours of 8am 5pm Monday through Friday.
- On-call for evenings or weekends if needed for outages application upgrades security patches or other unplanned activities.
Duties:- Administration of accounts (role-based access and rights).
- Manage accessibility to the application through EOPs authentication systems.
- Manage the workflow templates to ensure consistent and predictable task flows.
- Configure workflow management for new or adjustments based on user requests while adhering to EOP template standards.
- Maintain configurations and configurable fields for users and workflows.
- Maintain the test environment to mimic production and conduct test and evaluation in the environment prior to deployments.
- Design and maintain a secure and reliable form of backups ensuring High Availability (HA) and resiliency.
- Develop a Disaster Recovery (DR) or Incident Response (IR) plan for specific applications and services in the event of a disaster or unexpected downtime.
- Maintain unique instances that support various offices.
- Configure and support integrations with complementary systems.
- Establish and Improve system monitoring while maintaining established security protocols within development test and production systems.
- Architect build and maintain on premise and/or cloud infrastructure to support team and customer initiatives.
- Maintain and improve existing infrastructure (build out autoscaling support new services optimize for cost efficiencies/authentication/search etc.).
- Administer production staging and development environments.
- Manage and aggregate server logs and monitor for security and system related incidents.
- Monitor and analyze system performance such as server load and resource usage.
- Maintain and improve existing build and deployment processes using CI/CD tools.
- Apply configuration management disciplines to maintain software revisions security patches hardening and documentation.
- Enforce best practices for security and reliability and drive security initiatives like access control and vulnerability testing.
- Maintain up to date documentation of designs/configurations ensuring team members have continuity of recurring tasks.
- Maintain status of operations at all times: perform after actions reporting on all outages and work with engineering teams to determine solution and root cause analysis. Present findings to management for prioritization and tasking.
- Create and determine required metrics for dashboards and service health.
- Follow up on engineering tasks for operational solutions and validate completion
- Manage operational readiness board present at weekly meetings and determine if development services are ready for automation based on best practices and maintainability.
- Track and ensure routine operations maintenance tasks are completed in a timely manner.
- Align to the customers strategies for configuration of workflows without compromising the integrity of the workflow tool and templates.
- Build maintain and utilize the customers enterprise Development Security and Operations (DevSecOps) pipeline.
- Work with other service providers to support areas of common interest.
- On-call support may be required.
Required Experience:
Senior IC
View more
View less