Senior Infrastructure Engineer (DevOps Production AWS), Ferra

25madison

Not Interested
Bookmark
Report This Job

profile Job Location:

New York City, NY - USA

profile Monthly Salary: $ 145000 - 190000
Posted on: 17 hours ago
Vacancies: 1 Vacancy

Job Summary

Company: Ferra

Location: Remote or Hybrid

Experience: 8 years

Reports to: CTO / Founding Engineer


About Ferra

Ferra is building AI infrastructure for structural steel estimation. We process large-scale construction drawing PDFs run computer vision LLM pipelines and generate structured steel graphs takeoffs and export-ready models.

Our system includes:

  • Multi-stage ML pipelines (CV LLM)
  • Asynchronous job processing (SQS-driven workflows)
  • Large PDF ingestion and document graph processing
  • Vector-native parsing and algorithmic geometry systems
  • Graph storage export services

Role Overview:

We are hiring a Senior Infrastructure Engineer to own uptime reliability latency and scalability across our entire AWS environment.

You will ensure our AI/ML pipelines run reliably at scale without cloud outages timeouts networking bottlenecks or production instability slowing down our algorithm team.

You will build and maintain production-grade AWS architecture that supports:

  • Large PDF ingestion (100500 sheets)
  • Computer vision pipelines
  • LLM inference workflows
  • Distributed job queues
  • High-volume asynchronous processing

Your mission is to enable the frontend teams to move fast without worrying about infrastructure.

What You Will Own:

Keep things running. You own uptime (99.9%) observability incident response and root cause analysis. When something breaks you fix it and make sure it doesnt break the same way twice.

Own the AWS architecture. Deep AWS stack: EC2 (including GPU) ECS/Fargate SQS Lambda S3 CloudFront API Gateway RDS/DynamoDB plus VPC design IAM autoscaling and monitoring. Youll make the architectural calls not just maintain whats there.

Make ML pipelines reliable. The core workloads are CV LLM inference and long-running batch jobs. Youll build the plumbing: retry logic idempotency checkpointing parallel orchestration. Experience with event-driven or DAG-based pipelines is a plus.

Chase down performance problems. Queue bottlenecks cold starts LLM latency runaway costs: you will find and fix them. Comfortable debugging at the TCP TLS ECS and IAM level.

Help the team ship faster. CI/CD infrastructure-as-code (Terraform/CDK/Pulumi) clean containerization and proper staging environments. The goal: deployments are boring and works on my machine stops being an excuse.

About You:

  • 8 years in infrastructure / DevOps / production engineering
  • Deep AWS expertise (not just used it architected at scale)
  • Experience running production ML or AI systems
  • Experience with asynchronous distributed systems
  • Strong knowledge of: ECS / Fargate EC2 (including GPU instances) SQS S3 VPC networking and IAM best practices
  • Strong understanding of: Containerization (Docker) CI/CD pipelines Infrastructure as Code and observability systems
  • Experience debugging production incidents and designing fault-tolerant systems

Nice to have: Prior exposure to GPU workloads at scale event-driven architectures or PDF/document-heavy pipelines. Bonus if youve done this in a startup environment where the infrastructure and the product were both still being figured out.

Why Ferra:

Youll be building infrastructure for real agentic AI not wrappers around someone elses API. The team is small and technical which means high ownership fast decisions and your work has direct impact on the core product. Competitive comp meaningful equity and a genuine shot at defining how AI agents operate in production.

How to apply: Apply via the breezy application here. Applications will be accepted on a rolling basis.

Target Annual Base Salary Range: $145000$190000

Final salary will be determined based on the candidates experience knowledge and skills. The salary reflected does not include an annual discretionary bonus equity or other benefits offered by the Company as applicable.


Required Experience:

Senior IC

Company: FerraLocation: Remote or HybridExperience: 8 yearsReports to: CTO / Founding EngineerAbout FerraFerra is building AI infrastructure for structural steel estimation. We process large-scale construction drawing PDFs run computer vision LLM pipelines and generate structured steel graphs takeo...
View more view more

Key Skills

  • Active Directory Administration
  • Animal
  • Apparel
  • Entry Level
  • Jboss
  • Inventory Management

About Company

Company Logo

Position: 25m Health Analyst Location: New York, NY or Nashville, TN (hybrid in-office)About 25m Health: 25m Health is a venture studio that incubates and invests in early-stage healthtech startups in partnership with 25madison, Lifepoint Health, ScionHealth, and Apollo Global Manag ... View more

View Profile View Profile