×

You are using an outdated browser Internet Explorer. It does not support some functions of the site.

Recommend that you install one of the following browsers: Firefox, Opera or Chrome.

Contacts:

+7 961 270-60-01
ivdon3@bk.ru

Text information extraction from images of modified text

Abstract

Text information extraction from images of modified text

Misyukov G.I.

Incoming article date: 17.06.2023

This article describes development of a module which provides opportunity to extract text from images of modified text, which can be used to bypass existing information security software and spread sensitive information out of company. The developed module is based on Python programming language with additional libraries expanding basic functional. After creating a module, additional module allowing user to create modified text by themselves was made. Additional module uses a special dictionary that can change any letter to alternative and generate more modified texts in order to test and find the weak spots of a module. To integrate the module into company’s information infrastructure DLP-systems were chosen, because of their popularity and ease of the integration method. To integrate DLP-system and text extraction module we used a mail-server with BCC copies of a mail traffic to send text and images to our module local mail server, additional mechanisms extracts pictures and process them within the module, after what it sends back the image and the text from it. A few rounds of testing were done resulting in nearly 97% accuracy. Future development consider expanding for multi-row processing and adding new alternative symbols after first mention them in text by using a CNN or standard deviation of images pixel and pixel comparison.

Keywords: information security, data leakage, text analisys, image analisys, modified data analisys, protection against steganography