1405-5546

S1405-55462014000300005

10.13053/CyS-18-3-2035

Estados Unidos de América

China

00 09 2014

18 3 467 475

Artículos regulares

Spotting Fake Reviews using Positive-Unlabeled Learning

Huayi Li¹, Bing Liu¹, Arjun Mukherjee², and Jidong Shao^³

^¹ Department of Computer Science, University of Illinois at Chicago, Chicago, IL, USA. hli47@uic.edu, liub@cs.uic.edu

^² Department of Computer Science, University of Houston, Houston, TX, USA. arjun@cs.uh.edu

^³ Dianping Inc., Shanghai, China., jidong.shao@dianping.com.

]]>

Article received on 22/08/2014.
Accepted on 18/09/2014.

Abstract

Fake review detection has been studied by researchers for several years. However, so far all reported studies are based on English reviews. This paper reports a study of detecting fake reviews in Chinese. Our review dataset is from the Chinese review hosting site Dianping, which has built a fake review detection system. They are confident that their algorithm has a very high precision, but they don't know the recall. This means that all fake reviews detected by the system are almost certainly fake but the remaining reviews may not be all genuine. This paper first reports a supervised learning study of two classes, fake and unknown. However, since the unknown set may contain many fake reviews, it is more appropriate to treat it as an unlabeled set. This calls for the model of learning from positive and unlabeled examples (or PU-learning). Experimental results show that PU learning not only outperforms supervised learning significantly, but also detects a large number of potentially fake reviews hidden in the unlabeled set that Dianping fails to detect.

Keywords: Fake reviews, Positive-Unlabeled learning, PU-learning.

DESCARGAR ARTÍCULO EN FORMATO PDF

]]> Acknowledgments

The authors would like to thank the spam detection team in Dianping for sharing the Chinese review dataset. This research paper is made possible through the help and support from engineers and scientists in Dianping who provided valuable suggestions and indispensable efforts in evaluation.

References

1. Akoglu, L., Chandy, R., & Faloutsos, C. (2013). Opinion fraud detection in online reviews by network effects. In ICWSM.

2. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society, series B, 39(1), 1–38. [ Links ]

3. Denis, F. (1998). PAC learning from positive statistical queries. In ALT. 112–126.

4. Elkan, C. & Noto, K. (2008). Learning classifiers from only positive and unlabeled data. In KDD. 213–220.

5. Fei, G., Mukherjee, A., Liu, B., Hsu, M., Castellanos, M., & Ghosh, R. (2013). Exploiting burstiness in reviews for review spammer detection. In ICWSM.

]]>

6. Feng, S., Banerjee, R., & Choi, Y. (2012). Syntactic stylometry for deception detection. In ACL (2). 171–175. [ Links ]

7. Feng, S., Xing, L., Gogar, A., & Choi, Y. (2012). Distributional footprints of deceptive product reviews. In ICWSM.

8. Hancock, J. T., Curry, L. E., Goorha, S., & Wood-worth, M. (2007). On lying and being lied to: A linguistic analysis of deception in computer-mediated communication. Discourse Processes, 45(1), 1–23. [ Links ]

9. Hernández, D., Guzmán, R., Móntes y Gomez, M., & Rosso, P. (2013). Using PU-learning to detect deceptive opinion spam. In Proc. of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. Association for Computational Linguistics, Atlanta, Georgia, 38–45. [ Links ]

10. Jindal, N. & Liu, B. (2008). Opinion spam and analysis. In WSDM. 219–230.

11. Jindal, N., Liu, B., & Lim, E.-P. (2010). Finding unusual review patterns using unexpected rules. In CIKM. 1549–1552.

12. Joachims, T. (1999). Making large scale SVM learning practical. [ Links ]

13. Lee, W. S. & Liu, B. (2003). Learning with positive and unlabeled examples using weighted logistic regression. In ICML, volume 3. 448–455. [ Links ]

14. Li, F., Huang, M., Yang, Y., & Zhu, X. (2011). Learning to identify review spam. In Proc. of IJCAI International Joint Conference on Artificial Intelligence, volume 22. 2488. [ Links ]

15. Li, J., Ott, M., & Cardie, C. (2013). Identifying manipulated offerings on review portals. In EMNLP. 1933–1942.

16. Lim, E.-P., Nguyen, V.-A., Jindal, N., Liu, B., & Lauw, H. W. (2010). Detecting product review spammers using rating behaviors. In Proc. of the 19th ACM international conference on Information and knowledge management. ACM, 939–948. [ Links ]

17. Liu, B., Lee, W. S., Yu, P. S., & Li, X. (2002). Partially supervised classification of text documents. In ICML, volume 2. Citeseer, 387–394. [ Links ]

]]> 18. Lowe, L. (2010). http://officialblog.yelp.com/2010/03/yelp-review-filter-explained.html.

19. Mihalcea, R. & Strapparava, C. (2009). The lie detector: Explorations in the automatic recognition of deceptive language. In Proceedings of the ACLIJCNLP 2009 Conference Short Papers. Association for Computational Linguistics, 309–312. [ Links ]

20. Mukherjee, A., Liu, B., & Glance, N. (2012). Spotting fake reviewer groups in consumer reviews. In Proc. of the 21st international conference on World Wide Web. ACM, 191–200. [ Links ]

21. Mukherjee, A., Venkataraman, V., Liu, B., & Glance, N. S. (2013). What yelp fake review filter might be doing? In ICWSM.

22. Newman, M. L., Pennebaker, J. W., Berry, D. S., & Richards, J. M. (2003). Lying words: Predicting deception from linguistic styles. Personality and social psychology bulletin, 29(5), 665–675. [ Links ]

23. Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using em. Machine learning, 39(2-3), 103–134. [ Links ]

]]> 24. Ott, M., Choi, Y., Cardie, C., & Hancock, J. T. (2011). Finding deceptive opinion spam by any stretch of the imagination. In ACL. 309–319.

25. Stoppelman, J. (2009). http://officialblog.yelp.com/2009/10/why-yelp-has-a-review-filter.html.

26. Wang, G., Xie, S., Liu, B., & Yu, P. S. (2011). Review graph based online store review spammer detection. In IEEE 11th International Conference on Data Mining (ICDM). IEEE, 1242–1247. [ Links ]

27. Wu, G., Greene, D., Smyth, B., & Cunningham, P. (2010). Distortion as a validation criterion in the identification of suspicious reviews. In Proc. of the First Workshop on Social Media Analytics. ACM, 10–13. [ Links ]

28. Xie, S., Wang, G., Lin, S., & Yu, P. S. (2012). Review spam detection via temporal pattern discovery. In Proc. of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 823–831. [ Links ]

29. Yu, H., Han, J., & Chang, K. C.-C. (2002). PEBL: positive example based learning for web page classification using SVM. In Proc. of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 239–248. [ Links ]

]]>

30. Zhou, L., Shi, Y., & Zhang, D. (2008). A statistical language modeling approach to online deception detection. IEEE Transactions on Knowledge and Data Engineering, 20(8), 1077–1081. [ Links ]

]]>

1977 39 1 1

1-38

2012 (2)

171-175

2007 45 1 1

1-23

2013

38-45

1999

2003 3

448-455

2011 22

2488

2010

939-948

2002 2

387-394

2009

309-312

2012

191-200

2003 29 5 5

665-675

2000 39 2-3 2-3

103-134

2011

1242-1247

2010

10-13

2012

823-831

2002

239-248

2008 20 8 8

1077-1081