Similarity distance based approach for outlier detection by matrix calculation

User Rating:  / 0


Ou Ye, Xi’an University of Science and Technology, Xi’an, China

Zhanli Li, Xi’an University of Science and Technology, Xi’an, China


Purpose. In client information, string outliers need to be detected and cleaned. At present, many outlier detection algorithms only focus on the semantics of data, and ignore the structure, so it is difficult to ensure the accuracy of outlier detection. In order to address this issue, outlier detection method based on similarity distance is suggested in this paper.

Methodology. We formulated the similarity calculation model of string data by combining with semantic and structure factors. According to the outlier detection theory in data cleansing, one-dimensional string data were projected to two-dimensional space and string outlier data were detected by using a new similarity measurement mechanism in the two-dimensional space.

Findings. We first got the word frequency of string data by using the matrix calculation. Then the semantic similarity and structure similarity were calculated by using word frequency. After the string data mapping from one-dimensional to two-dimensional space, we obtained the outlier data by using the similarity distance.

Originality. We made a study of string outlier detection in data cleansing. Firstly, we formulated the similarity calculation model by considering the semantic factor and structure factor. Secondly, by constructing the similarity cell to project the string data, we fulfilled the similarity distance measurement in the similarity cell.

Practical value. The method can be used to clean the outlier string data in client information for any enterprise so that to ensure the data quality of client information, and reduce the costs of data maintenance. Extensive simulation experiments have been conducted to prove the feasibility and rationality of this method. The results showed that this method allows improving the accuracy of string outlier detection.

Список литературы / References

1. Barnabe-Lortie, V., Bellinger, C. and Japkowicz, N., 2014. Smoothing Gamma Ray Spectra to Improve Outlier Detection. In: IEEE. Computational Intelligence for Security and Defense Applications (CISDA), 2014 Seventh IEEE Symposium, pp. 1‒8.

2. Pardo, M.C. and Hobza, T., 2014. Outlier detection method in GEEs. Biometrical Journal, vol.56, no.5, pp. 838‒850.

3. Knorr, E.M., Ng, R.T., and Tucakov, V., 2000. Distance-based outliers: algorithms and applications. VLDB Journal: Very Large Databases, pp. 237‒253.

4. Ramaswamy, S., Rastogi, R. and Shim, K., 2000. Efficient algorithms for mining outliers from large data sets. Proc. of the ACM SIGMOD International Conference on Management of Data, pp. 427‒438.

5. Yang, Z. and Zhang., M., 2013. Research of algorithm forming outlier based on double distance application in coal mining. Manufacturing Automation, 237‒253. vol.35, no.8, pp. 40‒42.

6. S. Fan, S., 2011. The outlier detection based on semantics. Inner Mongolia Coal Economy, vol.7, no.7, pp. 19‒21.

7. Cong, Y., Yuan, J. and Tang, Y., 2013. Video anomaly search in crowded scenes via spatio-temporal motion context. IEEE Transactions on Information Forensics and Security, vol.8, no.10, pp. 1590‒1599.

8. Guo-Hui, L., Xiao-Kun, D., Fang-Xiao, H., Bing, Y. and Xiao-Hong, T., 2009. Structure matching method based on functional dependencies. Journal of Software, vol.20, no.10, pp. 2667‒2678.


Date 2016-06-21 Filesize 970.21 KB Download 526


This Month
All days

Guest Book

If you have questions, comments or suggestions, you can write them in our "Guest Book"

Registration data

ISSN (print) 2071-2227,
ISSN (online) 2223-2362.
Journal was registered by Ministry of Justice of Ukraine.
Registration number КВ No.17742-6592PR dated April 27, 2011.


D.Yavornytskyi ave.,19, pavilion 3, room 24-а, Dnipro, 49005
Tel.: +38 (056) 746 32 79.
e-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.
You are here: Home Archive by issue 2016 Contents No.2 2016 Information technologies, systems analysis and administration Similarity distance based approach for outlier detection by matrix calculation