Recognition-free Retrieval of Old Arabic Document Images

Sari, Toufik; Kefali, Abderrahmane

Serviços Personalizados

Journal

Artigo

Indicadores

Citado por SciELO
Acessos

Links relacionados

Similares em SciELO

Mais
Mais

Permalink

Computación y Sistemas

versão On-line ISSN 2007-9737versão impressa ISSN 1405-5546

Comp. y Sist. vol.15 no.2 Ciudad de México Out./Dez. 2011

Artículos

Recognition–free Retrieval of Old Arabic Document Images

Recuperación de documentos árabes antiguos a partir de imágenes sin usar reconocimiento de caracteres

Toufik Sari and Abderrahmane Kefali

Laboratoire de Gestion Electronique de Documents (LabGED), University Badji Mokhtar, Annaba, Algeria. E–mail: sari@labged.net, kefali@labged.net

Article received on 11/15/2010.
Accepted 05/06/2011.

Abstract

Searching of old document images is a relevant issue today. In this paper, we tackle the problem of old Arabic document images retrieval which form a good part of our heritage and possess an inestimable scientific and cultural richness. We propose an approach for indexing and searching degraded document images without recognizing the textual patterns in order to avoid the high cost and the difficult effort of the optical character recognition (OCR). Our basic idea consists in casting the problem of document images retrieval from the field of document analysis to the field of information retrieval. Thus, we can combine symbolic notation and semic representation and exploit techniques from the two fields, in particular, the techniques of suffix trees and approximate string matching. Each document of the collection is assigned an ASCII file of word codes. Words are represented by their topological features, namely, ascenders, descenders, etc. So, instead of searching in the image, we look for word codes in the corresponding file code. The tests performed on two types of documents, Arabic historical documents and Algerian postal envelopes, have showed good performance of the proposed approach.

Keywords: Document retrieval, Arabic handwriting recognition, approximate string matching, document analysis.

Resumen

La búsqueda en imágenes de documentos antiguos es en la actualidad un tema relevante. En este artículo abordamos el problema de recuperación de documentos árabes antiguos a partir de imágenes sin usar el reconocimiento de caracteres (OCR). Dichos documentos forman una buena parte de nuestra herencia y poseen una riqueza científica y cultural invaluable. Nosotros proponemos un enfoque para indexar y buscar imágenes degradadas de documentos sin recurrir al reconocimiento de patrones textuales para así evitar el esfuerzo considerable y el alto costo que conlleva el OCR. La idea básica consiste en migrar el problema de la recuperación de estos documentos, desde el campo del análisis de documentos hacia el campo de la recuperación de información. Así, podemos combinar la notación simbólica y la representación sémica y explotar las técnicas que provienen de ambos campos de investigación, particularmente, las técnicas de árboles de sufijos y búsqueda aproximada de cadenas. A cada documento de la colección se le asigna un archivo en ASCII con códigos de palabras. Las palabras son representadas por sus características topológicas; ej. ascendientes, descendientes, etc. De esta forma, en vez de buscar en la imagen, nosotros buscamos en los códigos de palabra dentro del archivo de códigos correspondiente. Las pruebas se realizan en dos tipos de documentos: documentos históricos árabes y sobres postales argelinos. El enfoque propuesto muestra un buen rendimiento.

Palabras clave: Recuperación de documentos, reconocimiento de manuscrito árabe, búsqueda aproximada de cadenas, análisis de documento.

DESCARGAR ARTÍCULO EN FORMATO PDF

References

1. Adamek, T., O'Connor, N.E. & Smeaton, A.F. (2007). Word matching using single closed contours for indexing handwritten historical documents, IJDAR, 9, 153–161. [ Links ]

2. Bai, S., Li, L. & Tam, C.L. (2009). Keyword Spotting in Document Images through Word Shape Coding. 10^th International Conference on Document Analysis and Recognition ICDAR, Barcelona, Spain. [ Links ]

3. Baird, H.S. (2004). Difficult and Urgent Open Problems in Document Image Analysis for Libraries. Third International Workshop on Document Image Analysis for Libraries DIAL. [ Links ]

4. Boyer, R.S & Moore, J.S. (1977). A fast string searching algorithm. Communications of the ACM, 20(10), 762–772. [ Links ]

5. Camillerapp, J., Pasquer, L. & Coüasnon, B. (2004). Indexation automatique de formulaires anciens par reconnaissance du patronyme manuscrit. Reconnaissance des Formes et Intelligence Artificielle RFIA, Toulouse, France, 1493–1502. [ Links ]

6. Chen, F. & Bloomberg, D. (1998). Summarization of imaged documents without OCR. Computer Vision and Image understanding, 70(3). [ Links ]

7. Kefali, A., Sari, T. & Sellami, M, (2009). Implémentation de plusieurs techniques de seuillage d'images de documents arabes anciens, 5^th International Symposium Images Multimédias Applications Graphiques et Environnements IMAGE, Biskra, Algeria, 123–134. [ Links ]

8. Khurshid, K., Faure, C. & Vincent, N. (2008). Recherche de mots dans des images de documents par appariement de caractères. 10^th Colloque International Francophone sur l'Écrit ET le Document CIFED. [ Links ]

9. Khurshid, K., Siddiqi, A., Faure, C. & Vincent, N. (2009). Comparison of Niblack inspired Binarization methods for ancient documents. 16^th Document Recognition and Retrieval Conference DRR, USA. [ Links ]

10. Knuth, D.E., Morris, J.H. & Pratt, V.R. (1974). Fast pattern matching in strings. TR CS–74–440, Stanford University, ford, California. [ Links ]

11. Leedham, G., Varma, S., Patankar A. & Govindaraju, V. (2002). Separating Text and Background in Degraded Documents Images – A Comparison of Global Thresholding Techniques for Multi–Stage Thresholding, Proc. Eighth IWFHR, Niagara–on–the–Lake, 244–249. [ Links ]

12. Mahmoud, A.S. (1994). Arabic Character Recognition Using Fourier Descriptors and Character Contour Encoding. Pattern Recognition, 27(6), 815–824. [ Links ]

13. Manmatha, R., Han, C. & Risemen, E. (1996). Word spotting: a new approach to indexing handwriting. IEEE Conference on Computer Vision and Pattern Recognition CVPR 96, 631–637, 1996. [ Links ]

14. McCreight, E.M. (1976). A Space–Economical Suffix Tree Construction Algorithm. Journal ACM, 23(2), 262–272. [ Links ]

15. Mitra, M. & Chaudhuri, B.B. (2000). Information Retrieval from Documents: A Survey. Information Retrieval, Kluwer Academic Publishers, 2, 141–163. [ Links ]

16. Navarro, G. (2001). A guided tour to approximate string matching. ACM Computing Surveys, 33(1), 31–88. [ Links ]

17. Plamondon, R. & Srihari, S.N. (2000). On–line and off–line handwriting recognition: A comprehensive survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1), 63–84. [ Links ]

18. Pujari, A.K., Naidu, C.D. & Jinaga, B.C. (2002). An adaptive character recogniser for Telugu scripts using multiresolution analysis and associative memory. 3^rd Indian Conference on Computer Vision, Graphics and Image Processing ICVGIP, Ahmadabad, India. [ Links ]

19. Ramel, J.Y. (2007). User driven page layout analysis of historical printed books. International Journal of Document Analysis and Recognition IJDAR, 05–21. [ Links ]

20. Rath, T.M. & Manmatha, R. (2003). Features for Word Spotting in Historical Manuscripts. Seventh International Conference on Document Analysis and Recognition ICDAR. [ Links ]

21. Rath, T.M. & Manmatha, R. (2007). Word Spotting for historical documents. International Journal of Document Analysis and Recognition, 9, pp. 139–152. [ Links ]

22. Sari, T. & Sellami, M. (2007). State of the art of Offline Arabic Handwriting Segmentation, International Journal of Computer Processing of Oriental Languages. [ Links ]

23. Sari, T. & Kefali, A. (2008V A search engine for Arabic documents. Proc. 10^th Colloque International Francophone sur l'Écrit et le Document CIFED, Rouen, France, 97–102. [ Links ]

24. Smeaton, A.F. & Spitz, A. (1997). Using character shape coding for information retrieval, Fourth 208 Toufik Sari and Abderrahmane Kefali International Conference on Document Analysis and Recognition 97, IEEE Computer Society Press, 974–978. [ Links ]

25. Spitz, A. (1995). Using character shape codes for word spotting in document images. Dori, D. & Bruckstein A. (Eds.), Shape, Structure and Pattern Recognition, World Scientific 95, Singapore, 382–389. [ Links ]

26. Ukkonen, E. (1985). Finding approximate patterns in strings. Journal of Algorithms, 6, 132–137. [ Links ]

27. Weiner, P. (1973). Linear pattern matching algorithm. 14^th IEEE Symposium on Switching and Automata Theory 1973, 1–11. [ Links ]

28. Winkler, W.E. (1999). The state of record linkage and current research problems. Technical report, Statistics of Income Division, Internal Revenue Service Publication R99/04. [ Links ]