Data Engineer (OCR & Data Pipelines, Contract)

London - UK

Daily Salary: GBP 450 - 750

Posted on: 1 hour ago

Vacancies: 1 Vacancy

Job Summary

Intelance is a specialist architecture and AI consultancy working with clients in regulated high-trust environments (healthcare pharma life sciences financial services). We are assembling a lean senior team to deliver an AI-assisted clinical report marking tool for a UK-based UKAS-accredited organisation in human genetic testing.

We are looking for a Data Engineer (OCR & Pipelines) who can turn messy PDFs and documents into clean reliable auditable data flows for ML and downstream systems. This is a contract / freelance role (2-3 days/week) working closely with our AI Solution Architect Lead ML Engineer and Integration Engineer.

Tasks

Design and implement the end-to-end data pipeline for the project:

Ingest PDF/Word reports from secure storage

Run OCR / text extraction and layout parsing

Normalise structure and validate the data

Store outputs in a form ready for ML and integration.

Evaluate and configure OCR / document AI services (e.g. Azure Form Recognizer or similar) and wrap them in robust retry-safe cost-aware scripts/services.
Define and implement data contracts and schemas between ingestion ML and integration components (JSON/Parquet/relational as appropriate).
Build quality checks and validation rules (field presence format range checks duplicate detection basic anomaly checks).
Implement logging monitoring and lineage so every processed document can be traced from source > OCR > structured output > model input.
Work with the ML Engineer to ensure the pipeline exposes exactly the features and metadata needed for training evaluation and explainability.
Collaborate with the Integration Engineer to deliver clean batch or streaming feeds into the clients assessment system (API CSV exports or SFTP drop-zone).
Follow good security and privacy practices in all pipelines: encryption access control least privilege and redaction where needed.
Contribute to infrastructure decisions (storage layout job orchestration simple CI/CD for data jobs).
Document the pipeline clearly: architecture diagrams table/field definitions data dictionaries operational runbooks.

Requirements

Must-have

3-5 years of hands-on Data Engineering experience.
Strong Python skills including building and packaging data processing scripts or services.
Practical experience with OCR / document processing (e.g. Tesseract Azure Form Recognizer AWS Textract Google Document AI or equivalent).
Solid experience building ETL / ELT pipelines on a major cloud platform (ideally Azure but AWS/GCP is fine if youre comfortable switching).
Good knowledge of data modelling and file formats (JSON CSV Parquet relational schemas).
Experience implementing data quality checks logging and monitoring for pipelines.
Understanding of security and privacy basics: encryption at rest/in transit access control secure handling of potentially sensitive data.
Comfortable working in a small senior remote team; able to take a loosely defined problem and design a clean maintainable solution.
Available for 2-3 days per week on a contract basis working largely remotely in UK or close European time zones.

Nice-to-have

Experience in healthcare life sciences diagnostics or other regulated environments.
Familiarity with Azure Data Factory Azure Functions Databricks or similar orchestration/compute tools.
Knowledge of basic MLOps concepts (feature stores model input/output formats).
Experience with SFTP-based exchanges and batch integrations with legacy systems.

Benefits

Core impact role: you own the pipeline that makes the entire AI solution possible without you nothing moves.
Meaningful domain: your work supports external quality assessment in human genetic testing for labs worldwide.
Lean senior team: work alongside experienced architects and ML engineers; minimal bureaucracy direct access to decision-makers.
Remote-first flexible: work from anywhere compatible with UK hours 2-3 days/week.
Contract / freelance: competitive day rate with potential extension into further phases and additional schemes if the pilot is successful.
Opportunity to build reusable data pipeline components that Intelance will deploy across future AI engagements.

We review every application personally. If theres a good match well invite you to a short call to walk through the project expectations and next steps.

Tasks

Design and implement the end-to-end data pipeline for the project:

Ingest PDF/Word reports from secure storage

Run OCR / text extraction and layout parsing

Normalise structure and validate the data

Store outputs in a form ready for ML and integration.

Evaluate and configure OCR / document AI services (e.g. Azure Form Recognizer or similar) and wrap them in robust retry-safe cost-aware scripts/services.
Define and implement data contracts and schemas between ingestion ML and integration components (JSON/Parquet/relational as appropriate).
Build quality checks and validation rules (field presence format range checks duplicate detection basic anomaly checks).
Implement logging monitoring and lineage so every processed document can be traced from source > OCR > structured output > model input.
Work with the ML Engineer to ensure the pipeline exposes exactly the features and metadata needed for training evaluation and explainability.
Collaborate with the Integration Engineer to deliver clean batch or streaming feeds into the clients assessment system (API CSV exports or SFTP drop-zone).
Follow good security and privacy practices in all pipelines: encryption access control least privilege and redaction where needed.
Contribute to infrastructure decisions (storage layout job orchestration simple CI/CD for data jobs).
Document the pipeline clearly: architecture diagrams table/field definitions data dictionaries operational runbooks.

Requirements

Must-have

3-5 years of hands-on Data Engineering experience.
Strong Python skills including building and packaging data processing scripts or services.
Practical experience with OCR / document processing (e.g. Tesseract Azure Form Recognizer AWS Textract Google Document AI or equivalent).
Solid experience building ETL / ELT pipelines on a major cloud platform (ideally Azure but AWS/GCP is fine if youre comfortable switching).
Good knowledge of data modelling and file formats (JSON CSV Parquet relational schemas).
Experience implementing data quality checks logging and monitoring for pipelines.
Understanding of security and privacy basics: encryption at rest/in transit access control secure handling of potentially sensitive data.
Comfortable working in a small senior remote team; able to take a loosely defined problem and design a clean maintainable solution.
Available for 2-3 days per week on a contract basis working largely remotely in UK or close European time zones.

Nice-to-have

Experience in healthcare life sciences diagnostics or other regulated environments.
Familiarity with Azure Data Factory Azure Functions Databricks or similar orchestration/compute tools.
Knowledge of basic MLOps concepts (feature stores model input/output formats).
Experience with SFTP-based exchanges and batch integrations with legacy systems.

Benefits

Core impact role: you own the pipeline that makes the entire AI solution possible without you nothing moves.
Meaningful domain: your work supports external quality assessment in human genetic testing for labs worldwide.
Lean senior team: work alongside experienced architects and ML engineers; minimal bureaucracy direct access to decision-makers.
Remote-first flexible: work from anywhere compatible with UK hours 2-3 days/week.
Contract / freelance: competitive day rate with potential extension into further phases and additional schemes if the pilot is successful.
Opportunity to build reusable data pipeline components that Intelance will deploy across future AI engagements.

We review every application personally. If theres a good match well invite you to a short call to walk through the project expectations and next steps.

Key Skills

Apache Hive
S3
Hadoop
Redshift
Spark
AWS
Apache Pig
NoSQL
Big Data
Data Warehouse
Kafka
Scala

Apply Now

About Company

Intelance

Intelance is a strategic consultancy specialising in Enterprise Architecture, AI transformation, and cybersecurity. We help organisations design the systems, structures, and operating models needed to scale, secure, and lead in a volatile world. Our team combines TOGAF-based architect ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click

AI Resume Builder

Create an ATS-ready CV in minutes

AI Cover Letter

Write a personalized letter instantly

Data Engineer (OCR & Data Pipelines, Contract)

London - UK

Job Summary

Tasks

Requirements

Benefits

Tasks

Requirements

Benefits

Key Skills

About Company

Related Jobs