×

You are using an outdated browser Internet Explorer. It does not support some functions of the site.

Recommend that you install one of the following browsers: Firefox, Opera or Chrome.

Contacts:

+7 961 270-60-01
ivdon3@bk.ru

  • Personal data recognition in unstructured texts using neural networks

    This paper describes the development of a hybrid system for recognition of various types of personal data in unstructured texts in Russian language. The system is based on neural network and regular expressions. Regular expressions were used to detect structured entities such as telephone and passport numbers. In order to detect named entities, including persons, locations and organizations, the neural network was used. For training and validation, a specialized Russian-language dataset for named entity recognition was created based on Nerus and WiNER labeled datasets. The proposed neural model is using contextualized ELMo embeddings and includes bidirectional LSTM layers with conditional random field layer (ELMo-BiLSTM-CRF). The performance of the resulting model was analyzed on the validation set, including accuracy on individual classes. During the evaluation, 4 different metrics were used, including precision, recall, f1-score and macro-f1. For more detailed analysis, a confusion matrix was created. The resulting hybrid model can be utilized to reduce the cost of storing and processing textual data, as well as preserve user privacy in case of leaks.

    Keywords: personal data, natural language processing, named entity recognition, conditional random field, neural network, recurrent neural network, regular expression