Monitored ETL production pipelines in Azure Data Factory, maintaining ~99% successful daily load completion within SLA.
Refactored legacy SQL Server stored procedures into PySpark on Databricks, improving performance ~25–30%.
Built and tested new ETL pipelines in Microsoft Fabric, ingesting legacy sources into Fabric Lakehouse and OneLake.
Built a Python utility to archive inactive ADLS Gen2 data to cold storage, cutting storage costs ~15–20%.
Implemented logging, alerting, and data quality validations to improve early failure detection.
Jr. Data Scientist (Freelance)
Aug 2023 — Dec 2023
Outlier AI · Client: Scale AI
Wrote 400+ Python & SQL solutions to train generative AI models with focus on correct syntax and logic.
Designed complex prompts to test model handling of database queries and edge cases.
Ran A/B comparisons of model outputs and flagged unsafe responses to reduce hallucinations.
Data Science Intern
Feb 2023 — Aug 2023
AIVariant
Led a customer segmentation project using behavioural and demographic data.
Applied K-Means & Agglomerative clustering; built a KNN classifier reaching ~85% accuracy.
$
RAG & Retrieval Engineering
Event-Driven Code RAG Slack Agent
Built a code-aware RAG agent answering natural-language questions about a GitHub repo in Slack via hybrid retrieval — dense Qdrant + TF-IDF BM25 fused with RRF, then reranked — returning file/line citations.
A GitHub webhook (HMAC-SHA256 verified) auto-rebuilds the vector index on every push so answers always reflect current code; measured hit rate 1.00 and MRR 0.955.
Architected a cost-optimised AI support system with dynamic query routing via Kong AI Gateway — simple queries to Llama-3.3-70b, complex ones to GPT-OSS-120b.
Added real-time sentiment-based CRM escalation, hitting <$0.10 per query and <2s response time.