Machine learning based Data Classification for a Healthcare Customer

DATADIGM
Sep 29, 2021
2 min read

Updated: Mar 18, 2022

The Challenge

Following a data security breach, a leading healthcare company wanted to implement universal data classification model across the organization following the least privilege access and controls over all the data in organization. Lack of business metadata and improper table/attribute naming standards require deep analysis to correctly categorize all attributes into following data classification levels,

Public: Public information can be openly shared on your website and does not require any additional controls when used.

Internal: Internal information is companywide. It includes the employee handbook, various policies and company-wide memos. If disclosed, it has a minimal impact to the business.

Confidential: Confidential information is team-wide, and its use should be contained within the business. If disclosed, it could negatively affect your business.

Restricted: Highly sensitive and limited on a need-to-know basis. It is protected with a Non-disclosure Agreement (NDA) to minimize legal risk. If disclosed, there would be a significant financial or legal impact to the business.

Classification Approach:

Automatic Machine Learning (ML) based process to classify an attribute in a faster and cost-effective way. It also continuously improves the prediction process and achieve better results in every iteration.

Text classification popularly known as text tagging or text-categorization is the most suitable ML solution. It breaks the attributes in different ways to classify them into appropriate protection groups.

The steps to classify table, attribute names and their definition into correct protection groups:

Data Preparation – Following pre-processing actions are performed,
1. Data standardization – Handle special characters, extra spaces, remove duplicates, remove punctuations
2. Elimination of common words that do not add any value
Data Classification Models executed in waterfall,
1. Full match - Compare exact text match with training texts, classify matched ones
2. Boyer-Moore algorithm
3. Tokenization in NLP – Breaks into list of tokens/words and matched
4. Processing engine – Categorize based on class word and keyword
Refine the models based on feedback

To keep the model evergreen, it should be continuously feed from approved dataset into training dataset. Accuracy of the model is highly dependent on the quality and quantity of the training data provided

Process Workflow:

Benefits

Consistent data classification allowed for more effective operations while also lowering the expenses of maintaining acceptable data protection.
Understanding the data's sensitivity helped to determine who should and shouldn't have access to it both inside and outside the company.
By classifying data, the company was better prepared to assess the risk and effect of an incident based on the type of data involved.

DATADIGM

Machine learning based Data Classification for a Healthcare Customer

Related Posts

Comments