You are using an outdated browser Internet Explorer. It does not support some functions of the site.

Recommend that you install one of the following browsers: Firefox, Opera or Chrome.


+7 961 270-60-01

  • Implementation of the LSH algorithm using Pl/PgSQL

    Our lives are permeated by data, with endless streams of information passing through computer systems. Today it is impossible to imagine modern software without interaction with databases. There are many different DBMSs depending on the purpose of using the information. The article discusses the Locality-sensitive hashing (LSH) algorithm based on the Pl/PgSQL language, which allows you to search for similar documents in the database.

    Keywords: LSH, hashing, field, string, text data, query, software, SQL

  • Optimizing the database-based deduplication process

    It is impossible to imagine the present time without software. Huge flows of information pass through computer computing systems. It is absolutely impossible to process unstructured, endlessly incoming data, so it is necessary to identify specific tasks and prepare information for processing. One such action is deduplication. This article discusses possible optimizations for the method of removing duplicates using databases.

    Keywords: deduplication, database, field, string, text data, query, software, unstructured data

  • Applying DIANA hierarchical clustering to improve text classification quality

    The article presents ways to improve the accuracy of the classification of normative and reference information using hierarchical clustering algorithms.

    Keywords: machine learning, artificial neural network, convolutional neural network, normative reference information, hierarchical clustering, DIANA

  • Large data deduplication using databases

    To date, a huge amount of heterogeneous information passes through electronic computing systems. There is a critical need to analyze an endless stream of data with limited means, and this, in turn, requires structuring information. One of the steps in solving the problem of data ordering is deduplication. This article discusses the method of removing duplicates using databases, analyzes the results of testing work with various types of database management systems with different sets of parameters.

    Keywords: deduplication, database, field, row, text data, artificial neural network, sets, query, software, unstructured data

  • Using segment tree in PostgreSQL

    The article considers an approach to solving the problem of optimizing the speed of aggregating queries to a continuous range of rows of a PostgreSQL database table. A program module based on PostgreSQL Extensions is created, which provides construction of a segment tree for a table and queries to it. Increased query speed by more than 80 times for a table of 100 million records compared to existing solutions.

    Keywords: PostgreSQL, segment tree, query, aggregation, optimization, PosgreSQL Extensions, asymptotics, index, build, get, insert