resume profile picture
Guglielmo CassiniData Scientist and Machine learning engineer
Email
[email protected]
Phone
3453240798
Location
Milano (MI)
26th August 1998
Italiana
B
LinkedIn
https://www.linkedin.com/in/guglielmo-cassini-05592a189/
GitHub
https://github.com/gCass
Professional Experience
Trustfull, ML and AI Engineer

As a Machine Learning and Artificial Intelligence Engineer, my responsibilities include the management and maintenance of the codebase and the company's core models. Furthermore, I am responsible for the integration of new functionalities through the implementation of cutting-edge AI models.

May 2025Milan, Italy
Lutech, Data scientist

Consultant working with different clients in many data science and artifical intelligence projects.

May 2024 – presentCinisello Balsamo (MI), Italy
Target Reply, Consultant

Junior consultant working as Data Scientist and Data Engineer in different projects and with different clients.

October 2022 – May 2024Milan, Italy
Ticinum Aerospace, Data Scientist - Part time

Development of web crawler using python and frameworks.

Development and analisys of models in wineinformatics and geo spatial data fields.

July 2021 – July 2022Pavia, Italy
Sata Consulting, IT technician stage

Learning and develop in powerbuilder. Learning and usage of SAP HANA. Alpha test execution for internal company software.

Monitoring of company network through the usage of Spiceworks, online network monitoring tool.

June 2017 – July 2017Pavia, Italy
Education
AWS Machine Learning Specialist Certification
November 2023
Master of science in Computer engineering, specialization Data Science, Università degli studi di Pavia

Graduated with 110/110 with Laude

September 2020 – September 2022Pavia, Italia
Bachelor Degree in Electronic and computer engineer with curriculum computer engineering, Università degli studi di Pavia

Graduated with 104/110

September 2017 – November 2020Pavia, Italia
Computer science diploma, ITIS G. Cardano

With grade 100/100

2012 – 2017Pavia, Italy
Languages
Italian

Native speaker

Spanish

Half native speaker

English

B2

Skills
Object oriented programming and design patterns
Statistical and machine learning

Data analysis

Data Mining

Clustering algorithms

Predictive models

Linear Regression

Logistic Regression

Ridge Regression

Classification algorithms

Decision trees

Ensemble methods

Natural Language

Processing & Text Mining

Reinforcement learning and Deep reinforcement learning
Hadoop and Apache Spark

Apache Hive

Apache Spark

Hadoop MapReduce

PySpark

Soft skills

Comunication

Active listening

Negotiation

Infrastructure as Code

Terraform

CDK

Cloud Computing

ML Specialist

Development tools known

Apache Hive; Apache Spark; GIT - GitHub; Hadoop MapReduce;

Jenkins; Jetty Web Server; Jupyter Notebook, Git, Pycharm, Eclipse IDE, Android Studio;

MATLAB & Simulink

Amazon Web Services

Sagemaker, SageMaker Studio, SageMakerEndpoints, ECS, ECR, Lambda, Step functions, Data Pipeline, EMR, AWS Batch, DMS, RDS, Redshift, DynamoDB, S3, EC2, Kinesys (Strem, Analytics, Firehose), Glue (Data Catalog, ETL, Athena), QuickSight, Bedrock

Google Cloud Platform

Bigquery

Google cloud storage

Vertex AI

SQL and NoSQL

MySQL, PrestoSQL, SparkSQL, MongoDB

Deep learning

Keras

Tensorflow

CNN, RNNs and LSTMs networks

Generative Artificial Intelligence
  • Langchain framework
  • Few shot learning
  • Chain of thoughts
  • Agentic AI
Programming languages and Frameworks

Android

Bash scritping

C

Dart, Flutter

HTML, CSS, Bootstrap

Java EE 7

Javascript, Javascript/Ajax

Labview

Matlab

OpenMP

PHP

PySpark

Python (scikit-learn, pandas, numpy, maptplotlib)

SQL

R

Software Engineering and Object oriented programming

Design patterns

Clean code principles follower

CICD
  • GitHub Actions mid level;
  • Jenkins basic level
  • AWS Code deploy & Code pipelines
Containerization
  • Docker and docker compose
Git and Github
Dataiku
Projects
Trustfull Main Product

As a backend, Machine Learning, and Artificial Intelligence Engineer, I'm working on the company main product: a web platform for enriching personal data (such as emails, phone numbers, first names, and last names).

My goal is to process and interpret this data to create digital scoring systems, which help client detect fraudolent users and not fraudolent users.

May 2025
Target Stock, Amplifon

To enhance the service level of products offered by shops for their customer and to provide an automatic solution, the aim of the project is to build a model able to estimate the target stock for each product for each shop of the client.

To do so, the project has been divided in two steps:

September 2024 – present
  • a first classification task in which a model intifies if a product will sell in a shop in the next week
  • a deterministic algorithm to estimate the quatity of so called product in that shop
  • The same project has been applied for dirrent countries in which the client is available (IT, DE, ES, BE) and is in continous expansion being developed for new countries .

    As a Data scientist my principal tasks were:

  • to build a data preparation flow to transform raw data into features
  • to indetify useful features and to train different models and evaluate their performances
  • to evaluate different deterministic algorithms performance through ad hoc analysis
  • to enhance current power bi dashboards to make the model results avaiable to final stakeholders
  • to present results and advancements to stakeholders.
  • The main technologies used where:

  • Dataiku
  • Python (with datascience libraries: pandas, numpy,...)
  • SQL
  • Member get member, Amplifon

    In the context of a promotional campaign, the client has the need to develop an automated flow detecting if a client benefits of campaign and detect the number of discounts the customer is a beneficiary.

    July 2024 – present

    As a Data Scientist, my principal tasks where:

  • to develop a data pipeline to obtain the principal KPIs of the customers related to the process
  • to integrate the new flow with external software developed by a third company
  • to connect the final output to a Power BI dashboard, developed ad hoc to present resul
  • The main technologies used were:

  • Dataiku
  • SQL
  • Demand Supply Forecasting, Amplifon

    To reduce costs associated to buy industrial components to produce its products, the client required a demand supply forecasting system.

    As a Data scientist my principal tasks were:

    May 2024 – present
  • to build a data preparation flow to transform raw data into features
  • to indetify useful features and to train different models to evaluate
  • to evaluate the models performance through ad hoc analysis
  • to build automated flows to automatize the training of the model each month
  • to enhance current power bi dashboards to make the model results avaiable to final stakeholders
  • to present results and advancements to stakeholders.
  • The same project has been applied for dirrent countries in which the client is available (IT, DE, ES, BE) and is in continous expansion being developed for new countries .

    The main technologies used where:

  • Dataiku
  • Python (with datascience libraries: pandas, numpy,...)
  • SQL
  • Delta lake migration, Cortilia

    The client owns a dataplatform, hosted on AWS Data lake, used for analysis and for extract data used by an external advertising platform to provide targeted advertising directly to their possible new customers. To do so it uses a huge number of glue jobs, each one applying different logics

    The objective of the project was to design and provide a framework to generalize their glue jobs through a single code base and to migrate the data lake to a delta lake structure.

    March 2024 – May 2024

    The main technologies used are:

  • Python, Pyspark
  • AWS S3, Glue, Lambda, SQS
  • Scorecard project, Mediaset

    The client owns a scorecard project, consisting in a data lake hosted on AWS cloud platform. The data lake is composed of different tables in the data lake oriented to measure through a series of KPIs characteristics of its customer base.

    The objective of the project is to apply fixes to the code and extend it adding new classes and new tables to enrich the KPIs used by the bunisses unit to analyze the customer base.

    January 2024 – March 2024

    The main technologies used in the project are:

  • AWS Data lake suite: EMR, S3, lambda
  • Python,Pyspark
  • Document Ranking, Reply Holding

    In order to facilitate the selection of a certain number of proposals for an event, an artificial intelligence model has been developed with the aim of predicting whether a specific proposal for an event is worthy of being chosen or not. The model has been developed using various Machine Learning techniques capable of working with both numerical and textual data, leveraging modern natural language processing techniques. The model was trained using the AWS Sagemaker cloud platform.

    October 2023 – January 2024

    Additionally, using both open-source Language Model (LLM) and OpenAI APIs, software was developed to outline the proposal at certain key points to allow for quicker evaluation by the organizing committee.

    Applied technologies

  • AWS Sagemaker (Notebook, Pipeline, Experiments, Hyperparameter tuning)
  • Python, NLTK
  • XGBoost framework, skopt package
  • Hugging Face, Transformers, OPENAI, LLama2, LLamaCPP, LlamaIndex
  • Backend reporting, Axa, Data analyst & Data Engineer

    Development in Presto-SQL language on AWS Athena, for QLIK frontend, Key Performance Indicators (KPIs). Analysis of business requirements.

    Managed client interaction and definition of business requirements.

    October 2023 – December 2023

    Technologies used:

  • SQL (PrestoSQL and SparkSQL)
  • AWS Athena
  • AWS Glue
  • AWS Data Catalog
  • Terraform
  • Customer loyalty automation, YNAP

    The customer loyalty automation project consists of automatizing the following two business process:

    - automatize the customer loyalty level assinment in a ecommerce platform querying the data platform and applying ETL Jobs to check if they satisfy the business logic

    July 2023 – September 2023

    - automatize the process of detecting high value new customers.

    Both the project sections have been developed in python with pyspark, running on emr cluster. The jobs have been scheduled with airflow.

    Goegraphic analysis, YNAP

    Using the geo purchasing power dataset created in a previous project, the aim of the project consists in an analysis of how client ecommerce's customers are distributed along the United Kingdom country, with a focus on England, and how different KPIs of a city, like the households mean income and square meter price of houses of a city, is related with the amout spent by customers in a city.

    The project lasted one week, and it has been presented in a client's internal conference of the growth strategic unit, held by the CGO.

    July 2023

    Developed in Python on jupyter notebooks to produce the report.

    Geo purchasing power, YNAP - Dataset creation

    The Geo purchasing power project aim is to create a table with a kpi representative of the social income power of a certain postal code in three different countries: United Kingdom, United States and Italy. The kpi used are many and different for the 3 different countries. Each kpi is integrated with a description of the kpi granularity and a decile value from 1 to 10 of the quantity.

    For the project the followging technologies have been used:

    April 2023 – July 2023

    - Pyspark

    - Hive

    - Hue

    - Jupyter notebooks

    - Airflow

    - Jira for Agile organizzation

    Tasks:

    - Data seek

    - Data cleaning and wrangling

    - Data analysis

    - Product code creation

    - Table creation

    Machine learning engineer, Axa

    The client required the creation of single template repository that can be used for each data science project by development of a docker image.

    This image runs on an AWS SageMaker Endpoint, and handles calls to different templates based on the project, abstracting which underlying template and libraries are required.

    March 2023 – April 2023

    To do this, the docker image consists of running a python script that initializes a first web server thread via FastAPI.

    Through an appropriate http request made to that web server, it is possible to specify which libraries a data scientist requires within his project by loading wheel files, so that his model relies on these for execution.

    At this point the web server instantiates a second web server thread within the endpoint. Once an http request is made to the inference endpoint in order to make a prediction using the desired model, the first web server thread redirects the request to the second one, which handles the interaction with the model as if it were a normal endpoint.

    Activities performed and technologies used:

    - introduction to the client's core services

    - introduction to example infrastructures in the customer's AWS cloud and how they are connected

    - introduction to artifactory

    - Python script development and related unit testing

    - development and modification of existing Jenkins pipeline

    CICD implementation and overview, Stellantis

    The aim of the project was to create continuous integration and continuous development pipelines for different existing projects and their migration to benefit from the usage of this pipeline.

    My task included:

    February 2023 – 2023
  • Develop utilities in python code to implement checks in continuous integration and continuous delivery pipeline
  • Designing the pipeline
  • CICD pipelines tester.
  • Main technologies used:

  • Azure Devops
  • Azure Datafactory (base level)
  • Python
  • Github
  • Backend reporting, AXA

    The aim of the project was to develop and maintain different QLICK reports used by business analyst to observe key performance indexes indicating insurance products sales and usages, enriching the client data platform.

    The main activities were:

    October 2022 – March 2023
  • Analysis of business requirements
  • Development of KPIs new table in Presto-SQL language
  • Query engineering from Presto-SQL to SparkSQL
  • The main technologies used are:

  • SQL, PrestoSQL, SparkSQL
  • AWS Athena, AWS Glue
  • Privacy protection in IoT: A deep reinforcementlearning approach., Master degree thesis

    Privacy protection in IoT: A deep reinforcementlearning approach.

    Electronic communications are always exposed to privacy risks: in any interac-tion, based on messages, an endpoint can exploit both the data exchanged andthe metadata in the message to disclose information about the sender. In theInternet of Things context, a dynamic context where a huge number of devicesexchanges messages, the privacy risk is a crucial and timely issue. Still in thiscontext, service discovery is the process of finding the services offered by IoTdevices according to clients’ requests. Many solutions have been proposed forit, but privacy protection is still an important aspect to investigate. The con-sidered environment consist of a mobile or wearable device, which aims to findand to obtain a certain number of services, which can be offered from differentproviders, with a target deadline. The device moves along a path where it canencounter the services providers. The interaction with services providers oftenrequires the exchange of data that can be sensitive for the device owner. In ad-dition, service providers can collude, combining the data gained from differentproviders, leveranging their value and raising the privacy risk for the user. Theobjective of this thesis is to develop a solution to improve privacy protectionapplying deep reinforcement learning techniques: the deep Q learning and ac-tor critic method. The evaluation of the performance of the agent is examinedthrough ad-hoc defined metrics for the problem. In addition, the objectives ofthe thesis include the development of the simulator used for the experiments andthe creation of a dataset of IoT mobile service providers, available for futureresearches.

    AWS Super, University project in Master degree

    Develop of platform in Cloud for DNA substring computing through an on deman HCP based on docker containers. Developped on AWS.

    ARXIV, University project in Master degreee

    Analysis of paper category popularity of Arxiv dataset using graph analysis techniques and development of a recommender system based on NLP and clustering techniques.

    Developed usign Python (pandas, numpy, scikit-learn), Apache Spark, PySpark, Map reduce, MongoDB.

    QUICK, University project in Master degree

    Performance analysis of HTTP3 protocol based on QUIC protocol, compared with HTTP1 and HTTP2 with different metrics and under different conditions of instabiity of the network.

    Substring parallel, University project in Master degree

    Development of parallelized version of a substring matching algorithm: the longest common subsequence algorithm.

    The algorithm has been developped in C language, parallelized with open MP and tested on google cloud platform using instances with different settings, for example changing the number of processors and the amount of available memory.

    CDN LOGMININ, Bachelor degree thesis

    Starting from anonimous log entries coming from a content delivery network server of an entertainment service, a learning algorithm has been developed and applied to cluster which entries belong to the same user.

    Developped in Python, with Pandas, Numpy, Matplotlib and Scikit-learn.

    NQUEEN, University project in Bachelor degree

    Development of a multithread software written in Java language. Given the size of a chessboard of size NxN, the software computes all the possible solutions of the N queen problem.

    GASSIX, University project in Bachelor degree

    Development of optimal one step predictor able to predict the house gas consumption given the consumption of the past six days. It was developeed evaluating different models:

    polynomial model, multi layer neural network, radial basis neural network.

    The model has been developped in matlab.

    RIS8, University project in Bachelor degree

    Development of web platform to configure computer, manually or in automatized way. Developped in Java using the Jetty framework.

    ArduFit & Heartz, Diploma project

    Development of ArduFit, hardware device based on Arduino Nano, wearable device which measures the heartbit and body temperature of the user.

    Development of Heartz, android application for smartphones which communicates with ArduFit reading the measurements, storing and preprocessing them for future functions.