APPLICATION OF CLUSTER ANALYSIS MECHANISMS FOR THE EXPLORATION OF CONTAINER TRANSPORTATIONS FUNCTIONING AT SELECTED RANGES OF THE TRANS-SIBERIAN RAILWAY

Recently, one of the most powerful toolkits that help to extract previously unknown knowledge from various, including large databases, is Data Mining Tools (DMT). Data Mining Tools, also called Knowledge Discovery In Data, allow to significantly expand the range of practical management tasks that are solved using computers. The discovery of new knowledge by means of data mining is carried out using a wide range of tools, among which an important place is occupied by cluster analysis. The task of cluster analysis is to identify a natural local condensation of objects, each of which is described by a set of variables or characteristics. In the process of cluster analysis, the investigated set of objects represented by multidimensional data is divided into groups of objects similar in a certain sense, called clusters. Cluster analysis is the basis of any intellectual activity and is a fundamental process in science. Any facts and phenomena must be ordered or grouped according to their similarity, i.e. are classified before general principles are developed that explain their behavior and mutual connection. The result of cluster analysis is both the selection of the clusters themselves, and the determination of the belonging of each object to one of them. Often the results of the cluster analysis performed are the starting point for further data mining. With the help of this further analysis, we are trying to establish: what is the revealed clustering and what is it caused by; who is a typical "representative" of each cluster; with the help of which "representatives" of clusters should solve various problem problems, etc. In our case, the application of cluster analysis will allow us to reveal general patterns in the functioning of certain polygons of the Trans-Siberian Railway on some of their grounds, and thus to perform more efficient management, both by the safety systems of individual polygons, and by their functioning as a whole.


INTRODUCTION
Recently, one of the most powerful toolkits that help to extract previously unknown knowledge from various, including large databases, is Data Mining Tools (DMT).
Data Mining Tools, also called Knowledge Discovery In Data, allow to significantly expand the range of practical management tasks that are solved using computers.
The discovery of new knowledge by means of data mining is carried out using a wide range of tools, among which an important place is occupied by cluster analysis.
The task of cluster analysis is to identify a natural local condensation of objects, each of which is described by a set of variables or characteristics.In the process of cluster analysis, the investigated set of objects represented by multidimensional data is divided into groups of objects similar in a certain sense, called clusters.
Cluster analysis is the basis of any intellectual activity and is a fundamental process in science.Any facts and phenomena must be ordered or grouped according to their similarity, i.e. are classified before general principles are developed that explain their behavior and mutual connection.
The result of cluster analysis is both the selection of the clusters themselves, and the determination of the belonging of each object to one of them.Often the results of the cluster analysis performed are the starting point for further data mining.
With the help of this further analysis, we are trying to establish: what is the revealed clustering and what is it caused by; who is a typical "representative" of each cluster; with the help of which "representatives" of clusters should solve various problem problems, etc.
In our case, the application of cluster analysis will allow us to reveal general patterns in the functioning of certain polygons of the Trans-Siberian Railway on some of their grounds, and thus to perform more efficient management, both by the safety systems of individual polygons, and by their functioning as a whole.

FORMALIZATION OF THE CLUSTERING PROBLEM
In the process of clustering, objects are grouped, to which anything, including observations and events, can be assigned.
The state of the object under study can be described using a vector of descriptors or a multidimensional set of attributes fixed on it: (1) Then Xi is the result of measuring these attributes on the i-th object.Part of the signs can be quantitative and take any real values.The other part is of a qualitative nature and allows to order objects by the degree of manifestation of any quality (for example, a binary feature that reflects the presence or absence of this property).
Any multidimensional observation can be geometrically interpreted as a point in a p-dimensional space.It is natural to assume that the geometrical proximity of two or more points in this space means that these points belong to the same cluster.
To solve the problem of clustering algorithmically, it is necessary to quantify the concept of similarity and heterogeneity of objects.Then the objects Xi and Xj will be assigned to the same cluster, when the distance between these objects is sufficiently small, and to different ones -if it is large enough.Thus, to determine the "similarity" of objects, it is necessary to introduce a measure of proximity or distance between objects.
There are different ways of calculating distances.The most commonly used is the Euclidean metric, which is related to the intuitive notion of distance. (2)

Fig.1. Example of clustering
Hemming distance is used as a measure of the difference in objects given by dichotomous (binary) traits.This measure is the number of mismatches of the values of the corresponding characteristics in the i-th and j-th objects under consideration:

CLUSTERING ALGORITHM
Let the results of measurements of n objects be represented as a data matrix of size p × n, in which a set of rows represents objects, and a set of columns -signs.(5) Then the closeness between pairs of objects can be represented in the form of a symmetric distance matrix: The general algorithm of cluster analysis, using a sub-algorithm for constructing a minimal spanning tree, contains the following main steps: Step 0.
[Initialization] Constructing the distance (proximity) matrix R from the measurement results of n objects represented by a data matrix of size p × n.
Step 1. [Construction of the minimal spanning tree] Using the matrix R, a minimal spanning tree T is constructed.To construct the minimal spanning tree, the Kruskal and Prim algorithms are used Step 2. [Grouping objects into clusters] Vertices -the objects of the minimal spanning tree are grouped into clusters.

Fig.2. Grouping Sequence of objects into clusters
The order of combining objects into clusters can be specified using a parenthesis description.For the example under consideration, this bracketed entry has the following form: The most convenient and common way of describing the results of hierarchical clustering is the dendrogram:

Fig.3. Example of a dendrogram
The dendrogram has a special tree structure consisting of layers of vertices, any of which represents one cluster.Each layer of vertices is characterized by its level of proximity.The location of an arbitrary cluster vertex relative to the layers of the dendrogram is determined by its level of proximity, which is measured by the weight of the last contracted edge in the formation of this cluster.
The formation of the dendrogram begins with a layer of zero level of proximity, in which each of the original objects is placed in a separate cluster.The lines connecting the vertices form clusters that are nested one in the other.
In general, the dendrogram reflects the nesting order of clusters, in which the number of clusters is successively reduced until a single cluster is formed that combines all the source objects.
The dendrogram cut, determined by its proximity threshold Δ, is used to perform cluster analysis on a given number of clusters.For this purpose, the proximity threshold Δ gradually decreases from the maximum possible value to zero.With such a decrease Δ, the dendrogram decomposes first into two clusters, then into three, etc., until the required number of clusters is met.

MECHANISMS FOR CONSTRUCTING A MINIMAL SPANNING TREE
The actions of the algorithms for constructing a minimal spanning tree T are considered on concrete examples of the distance matrix R.
Suppose we are given a symmetric distance matrix R, which can be associated with a weighted full-connected network G with n = 5 vertices and m = 10 edges: Then the minimal spanning tree T of the network G is the cheapest subnet, i.e. a subnet of minimal weight that covers all vertices of the network G and contains no cycles.Obviously, such a subnetwork is a tree.

Fig.4. Example of constructing a minimal spanning tree
To construct a minimal spanning tree T in a weighted, connected and complete network G with n vertices and m edges, a number of algorithms can be used, among which the Kruskal and Prim algorithms are the most famous.

CHARACTERISTICS OF THE TRANS-SIBERIAN RAIL-WAY POLYGONS
It is well known that the Trans-Siberian railway is a railway through Eurasia, connecting Moscow with the largest East Siberian and Far Eastern industrial cities of Russia.The length of the highway is 9288.2km, it is the longest railway in the world.The highest point of the way: Yablonovyy pereval (1019 m above sea level).
Historically, the Trans-Siberian is only the eastern part of the highway, from Miass (Chelyabinsk region) to Vladivostok.Its length is about 7 thousand km.This part was built from 1891 to 1916.
The result of the construction of the Trans-Siberian Railway was the opportunity created by 1905: for the first time in history, trains were used only on rails, without the use of ferry crossings, from the shores of the Atlantic Ocean (from Western Europe) to the shores of the Pacific Ocean (to Vladivostok).
Transsib connects the European part, the Urals, Siberia and the Far East of Russia, as well as Russian western, northern and southern ports and railroad exits to Europe, on the one hand, with Pacific ports and railway exits to Asia.
The starting point is the station Moscow-Passenger-Yaroslavl.The terminal station is Vladivostok.Throughput: 100 million tons of cargo per year.

AUTOBUSY 6/2018
For this we take the characteristics of Trans-Siberian Railway specific sections: In order to illustrate the possibility of using the Kruskal and Prim algorithms for cluster analysis of the operation of the Trans-Siberian Railway polygons, the author wrote a program that implements this algorithm using the Delphi 10 Seattle development environment.
The results of the clustering program in the screenshot below.

CONCLUSION
In conclusion, it should be noted that cluster analysis is one of the effective methods that allows automating the grouping as separate polygons or stations of the Trans-Siberian Railway in order to provide visual models for management to improve service or upgrade existing railway lines.

( 3 )
There are other more abstract measures of intimacy.If the investigated characteristics are mixed (quantitative and qualitative), then the normalization of all the values of xi k of the quantitative characteristics xk is necessary: (4) which leads to a common Euclidean proximity measure.When developing models and methods of clustering, it is usually assumed that objects within one cluster should be close to each other and far from objects that have entered into other clusters.The accuracy of clustering is determined by how close the objects of one cluster are and how far objects belonging to different clusters are deleted.