Intelance is a specialist architecture and AI consultancy working with clients in regulated high-trust environments (healthcare pharma life sciences financial services). We are assembling a lean senior team to deliver an AI-assisted clinical report marking tool for a UK-based UKAS-accredited organisation in human genetic testing.
We are looking for a Data Engineer (OCR & Pipelines) who can turn messy PDFs and documents into clean reliable auditable data flows for ML and downstream systems. This is a contract / freelance role (2-3 days/week) working closely with our AI Solution Architect Lead ML Engineer and Integration Engineer.
Tasks
- Design and implement the end-to-end data pipeline for the project:
Ingest PDF/Word reports from secure storage
Run OCR / text extraction and layout parsing
Normalise structure and validate the data
Store outputs in a form ready for ML and integration.
- Evaluate and configure OCR / document AI services (e.g. Azure Form Recognizer or similar) and wrap them in robust retry-safe cost-aware scripts/services.
- Define and implement data contracts and schemas between ingestion ML and integration components (JSON/Parquet/relational as appropriate).
- Build quality checks and validation rules (field presence format range checks duplicate detection basic anomaly checks).
- Implement logging monitoring and lineage so every processed document can be traced from source > OCR > structured output > model input.
- Work with the ML Engineer to ensure the pipeline exposes exactly the features and metadata needed for training evaluation and explainability.
- Collaborate with the Integration Engineer to deliver clean batch or streaming feeds into the clients assessment system (API CSV exports or SFTP drop-zone).
- Follow good security and privacy practices in all pipelines: encryption access control least privilege and redaction where needed.
- Contribute to infrastructure decisions (storage layout job orchestration simple CI/CD for data jobs).
- Document the pipeline clearly: architecture diagrams table/field definitions data dictionaries operational runbooks.
Requirements
Must-have
- 3-5 years of hands-on Data Engineering experience.
- Strong Python skills including building and packaging data processing scripts or services.
- Practical experience with OCR / document processing (e.g. Tesseract Azure Form Recognizer AWS Textract Google Document AI or equivalent).
- Solid experience building ETL / ELT pipelines on a major cloud platform (ideally Azure but AWS/GCP is fine if youre comfortable switching).
- Good knowledge of data modelling and file formats (JSON CSV Parquet relational schemas).
- Experience implementing data quality checks logging and monitoring for pipelines.
- Understanding of security and privacy basics: encryption at rest/in transit access control secure handling of potentially sensitive data.
- Comfortable working in a small senior remote team; able to take a loosely defined problem and design a clean maintainable solution.
- Available for 2-3 days per week on a contract basis working largely remotely in UK or close European time zones.
Nice-to-have
- Experience in healthcare life sciences diagnostics or other regulated environments.
- Familiarity with Azure Data Factory Azure Functions Databricks or similar orchestration/compute tools.
- Knowledge of basic MLOps concepts (feature stores model input/output formats).
- Experience with SFTP-based exchanges and batch integrations with legacy systems.
Benefits
- Core impact role: you own the pipeline that makes the entire AI solution possible without you nothing moves.
- Meaningful domain: your work supports external quality assessment in human genetic testing for labs worldwide.
- Lean senior team: work alongside experienced architects and ML engineers; minimal bureaucracy direct access to decision-makers.
- Remote-first flexible: work from anywhere compatible with UK hours 2-3 days/week.
- Contract / freelance: competitive day rate with potential extension into further phases and additional schemes if the pilot is successful.
- Opportunity to build reusable data pipeline components that Intelance will deploy across future AI engagements.
We review every application personally. If theres a good match well invite you to a short call to walk through the project expectations and next steps.
Intelance is a specialist architecture and AI consultancy working with clients in regulated high-trust environments (healthcare pharma life sciences financial services). We are assembling a lean senior team to deliver an AI-assisted clinical report marking tool for a UK-based UKAS-accredited organis...
Intelance is a specialist architecture and AI consultancy working with clients in regulated high-trust environments (healthcare pharma life sciences financial services). We are assembling a lean senior team to deliver an AI-assisted clinical report marking tool for a UK-based UKAS-accredited organisation in human genetic testing.
We are looking for a Data Engineer (OCR & Pipelines) who can turn messy PDFs and documents into clean reliable auditable data flows for ML and downstream systems. This is a contract / freelance role (2-3 days/week) working closely with our AI Solution Architect Lead ML Engineer and Integration Engineer.
Tasks
- Design and implement the end-to-end data pipeline for the project:
Ingest PDF/Word reports from secure storage
Run OCR / text extraction and layout parsing
Normalise structure and validate the data
Store outputs in a form ready for ML and integration.
- Evaluate and configure OCR / document AI services (e.g. Azure Form Recognizer or similar) and wrap them in robust retry-safe cost-aware scripts/services.
- Define and implement data contracts and schemas between ingestion ML and integration components (JSON/Parquet/relational as appropriate).
- Build quality checks and validation rules (field presence format range checks duplicate detection basic anomaly checks).
- Implement logging monitoring and lineage so every processed document can be traced from source > OCR > structured output > model input.
- Work with the ML Engineer to ensure the pipeline exposes exactly the features and metadata needed for training evaluation and explainability.
- Collaborate with the Integration Engineer to deliver clean batch or streaming feeds into the clients assessment system (API CSV exports or SFTP drop-zone).
- Follow good security and privacy practices in all pipelines: encryption access control least privilege and redaction where needed.
- Contribute to infrastructure decisions (storage layout job orchestration simple CI/CD for data jobs).
- Document the pipeline clearly: architecture diagrams table/field definitions data dictionaries operational runbooks.
Requirements
Must-have
- 3-5 years of hands-on Data Engineering experience.
- Strong Python skills including building and packaging data processing scripts or services.
- Practical experience with OCR / document processing (e.g. Tesseract Azure Form Recognizer AWS Textract Google Document AI or equivalent).
- Solid experience building ETL / ELT pipelines on a major cloud platform (ideally Azure but AWS/GCP is fine if youre comfortable switching).
- Good knowledge of data modelling and file formats (JSON CSV Parquet relational schemas).
- Experience implementing data quality checks logging and monitoring for pipelines.
- Understanding of security and privacy basics: encryption at rest/in transit access control secure handling of potentially sensitive data.
- Comfortable working in a small senior remote team; able to take a loosely defined problem and design a clean maintainable solution.
- Available for 2-3 days per week on a contract basis working largely remotely in UK or close European time zones.
Nice-to-have
- Experience in healthcare life sciences diagnostics or other regulated environments.
- Familiarity with Azure Data Factory Azure Functions Databricks or similar orchestration/compute tools.
- Knowledge of basic MLOps concepts (feature stores model input/output formats).
- Experience with SFTP-based exchanges and batch integrations with legacy systems.
Benefits
- Core impact role: you own the pipeline that makes the entire AI solution possible without you nothing moves.
- Meaningful domain: your work supports external quality assessment in human genetic testing for labs worldwide.
- Lean senior team: work alongside experienced architects and ML engineers; minimal bureaucracy direct access to decision-makers.
- Remote-first flexible: work from anywhere compatible with UK hours 2-3 days/week.
- Contract / freelance: competitive day rate with potential extension into further phases and additional schemes if the pilot is successful.
- Opportunity to build reusable data pipeline components that Intelance will deploy across future AI engagements.
We review every application personally. If theres a good match well invite you to a short call to walk through the project expectations and next steps.
View more
View less