FlowCV Logo
Profile

Experienced Data Scientist with a demonstrated history of working in IT industry. Collected data and drew insights to improve business operations and solve business problems.

Personal Projects
WeldCraft, Hybrid Search, RAG
Apr 2024 – Apr 2024
  • Deployed Chroma (an open-source vector database) on EC2 with a security layer.
  • Developed a hybrid search (keyword search + semantic search) algorithm for pdfs and web page using langchain
  • Built RAG chatbot with history for these pdfs and web pages using langchain and openai(gpt-3.5-turbo)
  • Deployed hybrid search and RAG chatbot as Fastapi endpoint and nginx
  • Designed a webapp for accessing these endpoints using nextjs, tailwindcss.
  • Built CI/CD pipeline using GitHub actions and ec2 as self-hosted runner.
  • Tech: Langchain, Vector databse, RAG, Openai, llm, github actions, ec2

    Nov 2023 – present
  • Designed and implemented a dynamic document-based Q&A platform, enabling users to upload diverse file formats (PDFs, Excel, JSON, text) for intelligent querying.
  • Integrated OpenAI and LangChain for advanced natural language processing, while securely storing documents on AWS S3.
  • Implemented Clerk for robust authentication, Pinecone for efficient document indexing, and PostgreSQL (Neon DB) for seamless chat and message history storage.
  • Additionally, incorporated Stripe for streamlined payment processes, enhancing the overall user experience.
  • Crafted a responsive and intuitive front-end experience using Next.js, TypeScript, and Tailwind CSS, complemented by a robust Node.js backend. Implemented cutting-edge technologies to ensure a seamless user interface, blending the power of Next.js for efficient rendering, TypeScript for type safety, and Tailwind CSS for a sleek and modern design. The result is a dynamic and user-friendly platform that enhances the overall usability and accessibility of the application.
  • Deployed the webapp on vercel.
  • Tech: OpanAI, RAG, Langchain, Pinecone, AWS S3, NextJs, Typescript, Tailwindcss

  • Aim was to extract questionnaire from given text and Json input without using spacy.
  • Developed a robust solution to extract questionnaires from unstructured text and JSON inputs using advanced NLP techniques.
  • Conducted thorough preprocessing of both text and JSON data to ensure optimal performance of the models.
  • Implemented two classification models:
  • XGBoost classifier for JSON data processing.
  • LSTM for text data analysis.
  • Employed fine-tuning techniques on BERT for Named Entity Recognition (NER) to eliminate irrelevant information from the output of classification models.
  • Achieved enhanced accuracy and efficiency in questionnaire extraction through iterative model training and refinement.
  • Tech: XGBoost, LSTM, Pytorch, TensorFlow, HuggingFace

  • Built a CNN to classify plant seedling images into 10 classes.
  • Performed data augmentation with the help of Image data generator, Achieved 90% accuracy.
  • Tech: CNN, Data Augmentation, TensorFlow, Adam, Relu, SoftMax

  • Aim was to classify the Accident Severity based on vehicle type, driver’s age, sex and 28 more features in three classes as slight, serious and fatal.
  • Most of the accidents with fatal injuries happened on weekends between 2pm to 7pm. Received 90% weighted f1 score over tuned ExtraTree classifier. Implemented Explainable AI with SHAP and deployed it on Heroku cloud.
  • Tech: Oversampling (SMOTE), Recall, Precision, weighted f1, ExtraTree class., Streamlit

  • Implemented regressor models to reduce the energy consumption of buildings by a combination of easy-to-implement fixes and state-of-the-art strategies which is beneficial to reduce 37% of global energy-related and process-related CO2 emissions in 2020 during the lifecycle of buildings from construction to demolition.
  • Features like floor area, energy star rating are affecting the EUI and there is no relationship can be observed in EUI and the weather-related numerical columns. Performed explainable AI with SHAP.
  • Tech: KNN Imputer, Label Encoder, Cross-Validation, Catboost and RF regressor, Optuna

    Skills
    Machine LearningClassification, Regression, clustering, Decision Trees, K-Means Clustering, hierarchical clustering, Deep LearningDNN, CNN, RNN, Transfer learning, LSTM, LLM, GenAI, RAG, Statistical MethodsPredictive analysis, Hypothesis Testing and Confidence Intervals, Principal Component Analysis, LDA and Dimensionality Reduction, Programming Languages and toolsPython, Java, HTML and CSS, Version Control ToolsGit, Python LibrariesSklearn, Scipy, Statsmodel, Pandas, NumPy, Seaborn, Matplotlib, Selenium, BS4, NLTK, TensorFlow, Keras, pytorch, langchain, openai, hugginface, shap, Boto3, Flask, Fastapi, Streamlit, Scripting LanguageUnix, Database LanguageSQL, Data Reporting ToolExcel, GCP Data Studio, Cloud ToolsAWS (sagemaker, s3, Lex, Lamda, Bedrock, RDS, EC2, API gateway, secret manager), GCP (Bigquery,  | Dataproc, Dataflow, Vertex AI, cloud SQL, etc.)
    Professional Experience
    Dec 2022 – present | Remote, India
  • Bank Statement Analyser(BSA):
  • Using computer vision techniques and algorithms such as image processing, pattern recognition, and optical character recognition (OCR), I developed a solution that improved the accuracy of detecting transactions and other details in bank statements by over 95%. (API link)
  • The implementation of the computer vision solution had a significant impact on the organization, reducing manual labor, improving accuracy, and saving time.
  • Deployed and managed a RESTful API on a Google Cloud Platform using Docker for 20+ countries, ensuring scalability, security, and optimized performance.
  • Collaborate closely with clients to understand their unique objectives, translating their requirements into actionable steps for the implementation of our machine learning solution, leading to highly customized and effective deployments.
  • Bank Statement Extraction(BSE):
  • Built an API for extracting basic details like account holder name, A/C number, statement duration, etc.
  • Used OpenAI GPT model to get these basic details from raw OCR output.
  • Successfully deployed the API on Google Cloud Platform (GCP), ensuring scalability and reliability for seamless integration with the application's backend infrastructure.
  • Intelligent Document Retrieval and Chatbot System:
  • Document Processing:
  • Implemented efficient chunking logic to partition documents into manageable segments for streamlined processing.
  • Conducted rigorous testing of chunking efficiency, fine-tuning parameters to optimize performance.
  • Model Deployment and Integration:
  • Selected and deployed embedding and Large Language Model (LLM) models compatible with AWS Bedrock, ensuring seamless integration with the system architecture.
  • Configured and deployed ChromaDB on AWS as a client-server, establishing Lambda triggers for real-time document updates in S3 and efficient embedding addition/update operations
  • Document Access Control and Chat History Management:
  • Designed and implemented a schema for storing chat history in a relational database (RDS), capturing and storing each conversation with precision.
  • Chatbot Pipeline Development:
  • Developed a sophisticated pipeline to process user questions, user IDs/usernames, and chat history seamlessly.
  • Integrated logic to determine document access based on user ID, delivering tailored responses accordingly.
  • Monitoring and Validation:
  • Established CloudWatch metrics for monitoring Lambda calls and outputs, ensuring system reliability and performance.
  • Conducted extensive validation and testing of the end-to-end pipeline, crafting benchmark questions and meticulously assessing retrieval and response accuracy.
  • Documented testing outcomes comprehensively, iteratively refining the system for enhanced performance.
  • API Gateway Setup:
  • Orchestrated the setup of Lambda functions and API gateway to facilitate seamless API interactions, enabling effortless integration with external systems.
  • Oct 2021 – Dec 2022 | Pune, India
  • Developed a machine learning model using Python and scikit-learn library that predicted the likelihood of insurance claims being fraudulent based on claimant information and historical claims data.
  • Evaluated the model's performance using metrics such as precision, recall, and F1-score, and presented the results to stakeholders.
  • The project resulted in a 20% reduction in fraudulent claims, saving the insurance company millions of dollars in losses.
  • Azure cloud migration- Migrating on-prem (ASIS) to Azure cloud (ToBe)
  • Dec 2018 – Oct 2021 | Pune, India
  • Getting alerts insights using pandas and NumPy- Scraping the data using selenium and BS4 to get the raw data from alert reporting tools such as WebGui and analyzing this data to avoid unnecessary failures
  • Primary Bigdata and ETL support- Monitoring Informatica, Abinitio and Bigdata jobs through TWS, SOA and Informatica
  • Triggering automatic emails to client for alerting visa batch start and end by reading live SVG graphs/ pictures from website using python
  • Generating daily reports such as Bigdata (Hound cluster and Omega cluster) job completion using TWS scrapped data with selenium and BS4
  • Certificates
    Education
    Jul 2014 – Jul 2018