The Result
Created an API for ICD-10 codes prediction using ML models. The models are hosted on the Hugging Face platform. Created API that automates new data preparation and model retraining cycle.
The Challenge
Task: Automate ICD-10 classification of pathologist medical reports. Solving text classification tasks using BERT Transformer technology and ML classification algorithms: Logistic regression, CSV, X-boost, and MLP. Models were deployed, stored, and accessed using the Hugging Face AI platform.
Our NLP engineer was in charge of data processing for training, testing, and validation datasets creation. Data preparation required profound data analysis and visualization of n-grams (python libraries: Spacy, NLTK, Plotly, Pandas, regex, transformers).
This project also required MLOps skills: working with Google AI products. Such as the creation and usage of VM instances, SQL database, buckets (for information storage and retrieval).
Part of our NLP team responsibility was also consulting and supporting the client in the model’s deployment and usage. A great part of our work was a visualization of trained model predictions. This task requires using SHAP python library, creation, and execution of SQL requests to the database with API requests output. Such analyses were visualized and used for further model improvements.
The Solution
Creation of ML pipeline that includes:
Big data processing (data visualization, analyzing, cleaning, post-processing)
BERT transformer text embedder,
Group voting of three classification algorithms (X-boost, CVS and MLP)
API for model prediction
API for auto training.
Technology Stack
- Google ML: VM instances, Google database, Google buckets, Google colab, pymySQL
- ML libraries: Transformers, Scikit-learn, NumPy, x-boost, torch
- Data processing and visualization: SHAP, Pandas, Plotly, Spacy, NLTK, regex
- IDE: DBeaver, PyCharm, VS Code, DataSpell