Informatics and Applications
2016, Volume 10, Issue 4, pp 89-95
GENERALIZED STATISTICAL METHOD OF TEXT ANALYSIS BASED ON CALCULATION OF PROBABILITY DISTRIBUTIONS OF STATISTICAL VALUES
- A. K. Melnikov
- A. F. Ronzhin
Abstract
A lot of data streams are a mixture of random and unique data. One of the properties of unique data is the
nonuniform distribution of probability of encountering the data on the set of the values. The procedure of two steps
is implemented for distinguishing unique data. On the first step of candidate selection, the criterion of consensus
with the uniform distribution is implemented. On the second step, resource-intensive calculation in a condition of
indeterminacy is performed in order to check other unique attributes of the candidates. The choice of the size of
the criterion depends on the amount of resources given for the second step. The accuracy of calculation determines
the quantity of overhead of the second term for processing random data and, therefore, a part of unique data loss.
The paper analyzes the values of boundary parameters for which at the current level of computer technology, one
can calculate the exact distribution. A generalized statistical method of text analysis, which can be used for a wide
spectrum of text parameters, is developed.
[+] References (7)
- Kalyaev, I. A., 1.1. Levin, E.A. Semernikov, and
V. I. Shmoylov. 2008. Rekonfiguriruemye mul'tikonveyernye
vychislitel'nye struktury [Reconfigurable multiconference
computational patterns]. Rostov-on-Don: YuNTs RAN.
397 p.
- Pearson, K. 1900. On the criterion that a given system of
deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed
to have arisen from random sampling. Philos. Mag. Ser. 5
50(302):157-175.
- Smith, P. F, D. S. Rae, R.W. Manderscheid, and S. Silbergeld. 1979. Exact and approximate distributions of the chi-squared statistic for equiprobability. Commun. Stat. 8(2):131-149.
- Cramer, G. 1946. Mathematical methods of statistics. Princeton, NJ: Princeton University Press. 592 p.
- Fisher, R. A. 1954. Statistical methods for research workers. 12th ed. Edinburgh: Oliver and Boyd. 356 p.
- Kendall, M. G., andA. Stuart. 1967. The advanced theory of statistics. Vol. 1: Distribution theory. 3rded. London: Charles Griffin Co. 439 p.
- Hutchinson, T. P. 1979. The validity of the chi-squared test when expected frequencis are small: A list of recent research refernces. Commun. Stat. A Theor. 8(4):327-335.
[+] About this article
Title
GENERALIZED STATISTICAL METHOD OF TEXT ANALYSIS BASED ON CALCULATION OF PROBABILITY DISTRIBUTIONS OF STATISTICAL VALUES
Journal
Informatics and Applications
2016, Volume 10, Issue 4, pp 89-95
Cover Date
2016-12-30
DOI
10.14357/19922264160409
Print ISSN
1992-2264
Publisher
Institute of Informatics Problems, Russian Academy of Sciences
Additional Links
Key words
probability; exact distribution; limit distribution; statistics; criterion; frequency; algorithm complexity;
performance of multiprocessor computer system; analysis method
Authors
A. K. Melnikov and A. F. Ronzhin
Author Affiliations
STC CLSC "InformInvestGroup;" 125, Bld. 17 Varshavskoye Shosse, Moscow 117587, Russian Federation
S. A. Lebedev Institute of Precision Mechanics and Computer Engineering of the Russian Academy of Sciences,
51 Leninsky Prosp., Moscow 119991, Russian Federation
|