Employer Active
Job Alert
You will be updated with latest job alerts via emailJob Alert
You will be updated with latest job alerts via email Build ingestion pipelines for structured/unstructured data using Python
Clean normalize and prepare data formats suitable for LLM finetuning (e.g. JSONL CSV)
Create highquality taskspecific datasets for training and evaluation
Apply versioning to datasets using DVC or LakeFS for reproducibility
Generate embeddings using HuggingFace or Sentence Transformers
Manage vector indexes (FAISS Weaviate) and optimize retrieval workflows
Tokenize and chunk longform data for context window optimization
10 years experience in Data Engineering role
2 years experience in AIadjacent data role
Proficiency in Python pandas and text processing tools
Familiarity with tokenization libraries (HuggingFace Tokenizers SentencePiece)
Experience managing datasets and object storage (MinIO NFS)
Understanding of LLM data constraints (context windows formatting prompt injection)
Full Time