Mediclaim Processing

Project to build and test classification models and neural networks to predict medical insurance claim acceptances.

NOTE: This project was done over a duration of 8 weeks to fulfill the Capstone Project requirement at BITS Pilani – Hyderabad Campus for the course: PGP in Artificial Intelligence and Machine Learning. The following is a paraphrased version of the 92 page project report that was written in conjunction with three peers: Gandhi Gannamaneni, Mohammed Riaz, Sai Gudipati.

Time Period: April, 2020 – May, 2020

Contents


Introduction


The objective of the project is to develop multiple machine learning models for the healthcare industry that will help the insurance providers in lowering potential denials of claims and reduce the operational costs in back and forth communications between the claim submitter and the provider. This in turn will result in accelerating the claim disbursement from the provider, thereby saving time for both the claim submitter and the provider.

To this end, historical medical claim data was analyzed and various machine learning models were developed using this claim data as the basis for training. After multiple models were built, they were compared using different performance parameters like Recall, Accuracy, AUC, F1 Score, Precision etc., on the test data.

Dataset Information


The dataset provided resulted in a binary classification task as the result of the claims can be divided into two categories: Accepted, Rejected. The dataset contained information about 470k claims, with each claim containing containing information about 21 features. The data imbalance was extremely significant, with 99.6of the dataset containing Accepted claims, and the remaining 0.4being Denied claims.

Data Preprocessing


Multiple data preprocessing activities were performed on the dataset before the models were trained. The following are a few of the data preprocessing activities:

  • Deleting the following:
    • Rows with invalid Denial Code
    • Fields with over 60null values
    • Dropping irrelevant features based on the input from domain expert
  • Outlier Analysis: To conduct outlier analysis, a few critical features were identified to generate boxplots for. After generating boxplots, outliers were identified and the corresponding data points were removed from the dataset. For example, generating a scatter plot between the Claim Charge Amount and Provider Payment Amount resulted in some rows where the provider paid more than the claim request. These rows are outliers and were removed.

After imputing missing values, removing a few features and data points based on various parameters and inferences, the resultant dataset contained about 327k rows.

Generating Machine Learning Models


To generate the best prediction model for the project, the following tasks were performed:

  • 70-30 train-test split
  • Data scaling
  • Training multiple different models
  • Hyperparameter tuning
  • Handling overfitting
  • GridSearchCV and KFold techniques to further finetune hyperparameters
  • Model evaluation on metrics like Recall, Precision, F1, and AUC curves

Models Built


Building the models happened in two phases: the first phase involved building classification models using scikit-learn. scikit-learn plays host to many different classification algorithms, and the following were chosen:

  • Decision Tree
  • Random Forest
  • SVM
  • XGBoost – gbtree
  • XGBoost – gblinear

The above models were built with a second dataset, which was generated by applying SMOTE on the cleaned dataset to tackle the issue of class imbalance.

The second phase consisted of using the H2O library, AutoML, and a DNN. The following models were built:

  • H2O – Gradient Boost
  • H2O – Gradient Boost (balance_classes = True)
  • H2O – Random Forest
  • H2O – Random Forest (balance_classes = True)
  • H2O – XGBoost
  • Stacked Ensemble – 1
  • Stacked Ensemble – 2
  • AutoML
  • DNN

Results


scikit-learn Models Comparison Report:

The following is a table listing the performance metric scores of different models.

Decision Tree Decision Tree (SMOTE) Random Forest Random Forest (SMOTE) SVM SVM (SMOTE) XGBoost - gbtree XGBoost - gblinear XGBoost - gbtree (SMOTE) XGBoost - gblinear (SMOTE)
Training Accuracy 99.72 99.65 99.72 99.65 1.26 64.31 99.53 98.74 96.59 86.59
Testing Accuracy 99.36 97.42 99.40 97.45 1.26 52.83 99.44 98.74 96.52 91.73
ROC 89.70 93.63 95.97 97.75 49.32 74.70 99.35 85.74 99.26 95.45
Recall 71.13 85.21 72.89 85.56 100.00 77.29 70.95 0.00 93.84 86.09
Precision 76.23 30.93 77.97 31.21 1.26 2.03 82.08 0.00 25.74 11.79
F1 73.59 45.38 75.34 45.74 2.48 3.96 76.11 0.00 40.39 20.75

There’s two ways of looking at this: First, if the objective of the project is to reduce administrative costs, the focus is to correctly predict the denials i.e. to reduce False Negatives. So, using Recall as the defining evaluation metric, following are the best models:

  • XGBoost with gbtree (SMOTE)
  • Random Forest (SMOTE)
  • Decision Tree (SMOTE)

Second, if we are supposed to take the provider concerns into consideration, correctly predicting the accepted claims is important i.e. to reduce False Positives. So, using F1 score as the defining evaluation metric, following are the best models:

  • XGBoost with gbtree
  • Random Forest
  • Decision Tree


H2O Models Comparison Report:

The following is a table listing the performance metric scores of different models.

H2O-GB H2O-GB (BC=True) H2O-RF H2O-RF (BC=True) H2O-XG Stacked Ensemble-1 Stacked Ensemble-2 AutoML DNN
Training Accuracy 99.35 93.23 99.16 92.31 99.43 99.52 99.23 99.37 92.25
Testing Accuracy 99.25 99.15 99.18 99.12 99.32 99.32 99.12 99.34 99.14
ROC 98.01 98.11 97.54 98.10 98.96 98.19 96.35 98.16 97.83
Recall 71.75 52.58 70.56 52.84 72.98. 74.32 64.03 70.05 68.86
Precision 63.53 80.88 55.57 73.98 68.84. 68.14 57.34 67.07 57.16
F1 67.35 63.73 62.17 61.65 70.02 71.09 60.5 68.53 62.47
Logloss 2.23 2.76 2.72 2.89 2.23 2.73 3.77 2.33 4.08

Taking the same two perspectives as above for H2O models as well, we can determine the following. For the first perspective, where administrative costs are to be reduced, then the following are the best models:

  • H2O Gradient Boosting – H2O-GB (BC=True)
  • H2O RandomForestEstimator – H2O-RF (BC=True)
  • H2O XGBoostEstimator – H2O-XG

For the second perspective, where correctly predicting accepted claims is important, then the following are the best models:

  • Stacked Ensemble-1
  • H2O XGBoostEstimator - H2O-XG
  • AutoML