The Challenges of Republishing Native Language Dictionaries online

The ILLA and its predecessor Runasimipi.org has republished over two dozen dictionaries of native languages online and formated those dictionaries for use in StarDict and GoldenDict (which are electronic dictionary applications for Windows, Linux and Mac) and in SimiDic (which is a mobile dictionary app for Android and iOS which we helped create with KetanoLab). The process of republishing a native language dictionary is a very challenging task, which involves more work and tribulations that might be expected.

The first challenge is trying to obtain legal permission to republish a dictionary. In some cases, we haven’t been able to contact the authors of the dictionaries. In some cases the authors are dead or nobody knows how to contact them. At the ILLA, we try to respect the wishes of living authors. If we are able to contact the author of a dictionary and he/she does not want the dictionary to be published on our website, we do not publish it.

Most authors of native dictionaries want to spread their works as widely as possible, including in digital formats, since they can be widely distributed at zero cost, so it allows a wider public to use their dictionaries. Most dictionaries in native languages are labors of love rather than profit and the goal is to revive the use of a dying language.

In some cases authors authorize us to put the entire dictionary online in any format. In that case, we publish the dictionary in PDF, DOC (MS Word) and ODT (OpenOffice/LibreOffice) formats and place it on our website (www.illa-a.org/wp/diccionarios) for public download. We promise that we will not charge anybody to use the dictionary or try to sell it for commercial gain (although we may sell CDs with the dictionary to cover the cost of burning CDs).

In other cases, the authors are concerned that someone may download the dictionary and republish the dictionary and sell it for a profit. Another concern is that the dictionary may be republished without the name of the author, so the author may loose the public recognition that he or she deserves. These are serious concerns in Latin America, especially in Peru and Bolivia, where illegal republication of books is commonplace. To allay this concern, we suggest that the dictionary be published with a Creative Common-Non Commercial-Attribution license, which would prohibit the republication for profit and the removal of the author’s name.

Many republishers in Latin America, however, pay no more attention to a Creative Commons license than they do to a traditional copyright. Placing the entire text online for public download facilitates the illegal republication of the text, since downloading the text is much easier than obtaining a physical copy and photocopying or scanning each page in the dictionary. For example, the Quechua dictionary of the AMLQ was illegally republished in Peru after we placed the text online. Even worse, the republishers removed the name of the AMLQ.

Many authors such as Rodolfo Cerron-Palomino, Saturnino Callo, Elio Ortiz and Elias Caurey were concerned about the pirating of their dictionaries, so they asked us to only publish their text for use in electronic dictionaries, since these applications only allow the user to access one entry at a time, so it is much harder to copy the entire text. It is technically possible to use StarDict-Tools or SQLite3 to extract the text from StarDict or SimiDic, but most republishers don’t have the technical knowledge, nor do they want to take the time to reformat the dictionary after extracting the text for the entries, so distribution of dictionaries for StarDict, GoldenDict and SimiDic poses little risk for the authors. It allows their dictionaries to be distributed in an digital format to a wider audience which will not compete with the sales of their paper dictionaries. In our experience, the types of people who will download dictionaries and use them in their PCs and mobile devices are not the same people who will buy paper dictionaries, so publication in electronic dictionaries does not effect sales. Moreover, the diffusion of dictionaries via the internet helps publicize the existence of the dictionaries, so more people will then search for physical copies of the dictionary to buy.

Unfortunately, most dictionaries of native languages have very limited publication, so it is almost impossible to buy them or even obtain an illegal copy. In this case, publication of these dictionaries online is the only way for the public to obtain the dictionaries. For example, we know of 8 dictionaries for Bolivian Quechua (Jesus Lara, Angel Herbas, Teofilo Laime, Alfredo Quiroz, Marcelo Grondín, Donato Gómez Bacarreza, Maryknoll, and Herrero and Sanchez de Lozada), plus 5 specialized vocabularies (Arusimiñee, Maria Elena Pozo Tapia, Donato Gómez Bacarreza). Of these, only the 2 dictionaries by Jesus Lara and Donato Gómez Bacarreza can be easily found in bookstores in Bolivia, and these two dictionaries do not use the current spelling system approved by the Bolivian Ministry of Education. All the others are out of print. Alfredo Quiroz might republish his dictionary in the future, but it is unlikely that any of the others will ever be republished. Very few of the libraries in Bolivia have copies of these dictionaries, and most will not allow them to be photocopied, so it is impossible to obtain a copy.

There are 2.2 million speakers of Bolivian Quechua, so there is a small commercial market for its dictionaries, although the vast majority of its speakers will never buy a dictionary, since most are unable to read Quechua, despite being literate in Spanish. The situation is much worse for the rest Bolivia’s 36 languages. For most of these languages, the number of speakers is less than a couple thousand, so there is no commercial market for the print publication of dictionaries for these languages. In this case, the only realistic way to distribute the dictionaries is for some entity such as the Bolivian Ministry of Education to subsidize the publication of paper dictionaries or for the dictionaries to be published online, so that the few people who want them may download them and print them at home and then photocopy them for their friends and relatives.

Given the perilous state of most native languages and the lack of distribution of their dictionaries, we have decided to not always respect the letter of copyright law. We consider consider it important to respect the wishes of authors and their descendants, but it is vitally important that dictionaries for native languages be made available to the public. In a few cases, we have decided to violate copyright law.

For example, Arusimiñee, which is a vocabulary of educational terms for Aymara, Guarani and Quechua, was published by the Bolivian Ministry of Education in 2004, yet today it is impossible to obtain a copy, since there were only 5000 copies and its commercial sale was prohibited. Nobody in the Bolivian Ministry of Education had a digital copy of the text, since all the hard drives in the Ministry were wiped clean when the previous administration turned the computers over to the MAS party. Today, when personnel at the Bolivian Ministry of Education want to consult the Arusimiñee, they download it from our website.

Current staff in the office of Educación Intra e Intercultural (including one of the authors of Arusimiñee) in the Ministry of Education have encouraged us to include Arusimiñee in SimiDic. We submitted a written request to the Bolivian Ministry of Education asking permission to republish Arusimiñee for use in electronic dictionaries, but the legal department of the Ministry of Education never approved the republication. Staff at Educación Intra e Intercultural told us that it would be OK to republish the dictionary as long as it was not sold, so we decided to digitalize the text and publish it online for public download. The text had to be substantially reformatted to use it in StarDict, GoldenDict and SimiDic, so it probably violates copyright law, since we were not just republishing for non-commercial uses, which is permitted, but also modifying the text, which is not permitted.

The republication of the Vocabulario Poliglota Incaico and the 4 children’s dictionaries published by the Peruvian Ministry of Education is even more questionable on a legal basis. The original text of the Poliglota was published in 1905 by Franciscan missionaries and has now passed into public domain. In 1998, the text was converted into the current spelling system and republished by the Peruvian Ministry of Education, which holds a copyright for that altered text. We talked with 3 of the people who were involved with the republication of the dictionary. All of them encouraged us to republish the Poliglota online, but all of them said that they did not have the authority to authorize the republication of the text. We wrote numerous requests via email to the director of Educación Bilingüe Intercultural of the Peruvian Ministry of Education, asking permission to republish the dictionary. None of our emails were ever answered. We made two trips to Lima to meet with the director, who gave us a physical copy of the Poliglota and was very encouraging of the work we were doing, but said that he could not formally give us approval due to bureaucratic reasons. Four years later we hoped that the situation had changed with new staff at the Ministry. We submitted a written request to obtain approval to republish the Poliglota, plus the 4 children’s dictionaries for Cuzcan Quechua, Ayacuchan Quechua, Ancash Quechua and Aymara. We also met with several staff members at the Peruvian Ministry of Education in Lima, who who were eager to distribute copies of StarDict and SimiDic with their Quechua and Aymara dictionaries, including the Poliglota. We never received any reply to our written request. In the end, we decided to republish the Poliglota and the 4 children’s dictionaries on our website since these texts were paid for by the people of Peru, so they deserved to have access to the texts.

The republication of the dictionaries of Movima and Tsimane’ are more problematic. First of all, we aren’t even sure who legally holds the copyright to these texts, so we don’t even know who to ask. The text says “Prohibida la reproducción total o parcial y venta.” (“The total or partial reproduction and sale is prohibited.”), but there is are several institutions listed on the inside cover of these dictionaries and none of them have © to identify them as the copyright holder.

According to the introduction of the Diccionario Tsimane’, the bulk of the text comes from the second edition of the Diccionario Tsimane’-Castellano y Castellano-Tsimane’ which was published by La Mission Nuevas Tribus in 1993. The text was revised and updated by a workshop in 2007 for bilingual educators at the offices of the Gran Consejo Tsimane’ in San Borja, Beni, but the workshop had the financial support of the Programa de Educación Intercultural Bilingue de Tierras Bajas of the Bolivian Ministry of Education and Culture and the dictionary was published with funds from that same Ministry in 2007. In this case, who has the authority to authorize the republication? Since a staff member of the office of Educación Intra e Intercultural at the Bolivian Ministry of Education and Culture asked us to republish the dictionary in SimiDic and even gave us photocopies of the Movima and Tsimane’ dictionaries so we could digitalize the dictionaries, we decided that that was sufficient authorization. Given the bureaucratic nightmare we have had trying to get any written response to our requests to the Bolivian Ministry of Education in the past, we decided to not waste our time submitting written requests for formal permission for publication, since we know that we won’t get a straight response out of the legal department at the Ministry. As for the La Mission Nuevas Tribus or the Gran Consejo Tsimane’, who also have a claim to the Diccionario Tsimane’, we have never tried to contact them, so we have no idea whether they would like to see their dictionary online or not.

In the case of all these dictionaries, they were published by government ministries which are putatively dedicated to promoting these languages, so we doubt that anyone will raise any objections to what we are doing with their dictionaries, even if we don’t have formal approval for republication.

The situation is more problematic when the copyright holder is a private institution or an individual author. For example, the President of the Academia Mayor de la Lengua Quechua (AMLQ) in Cusco gave Amos Batto oral authorization to put the AMLQ’s dictionary online in 2006, but the authorization was never written. Other members of the AMLQ have since expressed their disapproval of this decision and even started a lawsuit against the president of the AMLQ in later years regarding several matters, including the republication of the dictionary. Nonetheless, the AMLQ received US$20,000 from UNESCO to create the dictionary, so many might question whether the dictionary shouldn’t be made available to the whole world since it was financed by the people of the world.

One of the most troubling cases has been the Qechua dictionary by Angel Herbas. Mr. Herbas gave Amos Batto oral permission to publish online his dictionary for public download, but the children of Mr. Herbas later questioned whether their father was mentally capable of authorizing the republication, since the elderly gentleman was often confused and his memory was failing. After his death, Mr. Herbas’ son expressed his disapproval of the republication and the ILLA decided to respect his wishes.

The best dictionary for Bolivian Quechua is arguably the Diccionario Quechua: Estructura semántica del quechua cochabambino contemporáneo by Joaquín Herrero S.J. and Federico Sánchez de Lozada, which had a very limited publication in the 1980s. We have been unable to find a copy anywhere, although both Alfredo Quiroz and Angel Herbas have copies. UCLA has placed a PDF of the text on their website for public download. We have been unable to find anyone who knows how to contact Joaquín Herrero S.J. and Federico Sánchez de Lozada and we doubt that UCLA obtained permission from the authors to put their text online.

Nonetheless, we suspect that Joaquín Herrero S.J. and Federico Sánchez de Lozada would approve our republication of their out-of-print dictionary, since they obviously didn’t create it for commercial gain. Out of all the authors whom we have contacted asking permission to republish their dictionaries online only two have turned us down. We would like to believe that the authors would be delighted that Quechua speakers are now able to use their dictionary in their cell phones, 3 decades after the dictionary was originally published. At any rate, we decided to republish the dictionary on our website, since UCLA was already distributing it from their website.

Unfortunately, the PDF published by UCLA only contains images of the dictionary, not digitalized text. The images in the PDF file had to be painstakingly converted into text. It is often necessary to digitalize old dictionaries for republication online. At the ILLA and previously at Runasimipi.org, roughly half of the dictionaries we have republished online since 2006 we had to first digitalize, including the dictionaries of AMLQ-Cusco, Teofilo Laime, Arusimiñee, Golzalez Holguín, Ludovio Bertonio, HABLE Guaraní, Vocabulario Poliglota Incaico, Tsimane’, Movima, and Herrero and Sánchez de Lozada.

Even when authors would like us to republish their out-of-print dictionaries online, they often no longer have the original files, especially for any text published over a decade ago. For example, Xavier Albo who coordinated the republication of the dictionary by Ludovico Bertonio told us that Radio San Gabriel no longer had the original text file for Bertonio’s dictionary. It took volunteers at the ILLA two years to digitalize the text, so it could be republished online. Each page had to be photographed, then passed through an OCR program to convert it into text. Because no OCR program is trained to recognize Aymara or any other native language, the volunteers had to go through the text and correct the errors in the digitalization line by line, which is a very painstaking process. Since the ILLA has no funds to pay for the digitalization of dictionaries, all the work is done by volunteers, who contribute their time when they can. Thus, the progress has been very slow, despite the public need for these dictionaries.

At the ILLA we are committed to the use of free and open source software, but we have found that none of the OCR programs available in Linux are adequate. Instead, we use FineReader 8.0 running in a Windows virtual machine to digitalize the texts. It takes hundreds of hours of mind-numbing work to manually correct the texts. FineReader generally does an excellent job of converting Spanish texts, especially if the images are clean and flat. We originally used a digital camera to take photos of each page and pressed the pages under a piece of glass to keep them flat. The glass would often cause reflections, but the digitalization in the OCR was better. The other option was to pull all the pages out of their binding, so the pages could be placed flat on a table and photographed. Eventually we bought a scanner, which has substantially improved the quality of the images, despite the fact that scanning each page is slower than photographing. The extra time to scan a page is much less than the time spent manually correcting texts which come from bad images.

FineReader allows the user to define the letters for a custom language, so we can substantially improve the OCR by limiting the number of letters that FineReader will recognize. The problem is that the texts are bilingual in Spanish and the native language, so FineReader first tries to scan the native language as Spanish, then tries to scan it as the custom language which we defined. FineReader has no idea what sequence of letters makes sense for a word in the native language. For example, many native languages use “¡” (an upside down exclamation mark) but FineReader has no idea that that character should only appear at the beginning of phrases, so it is prone to place it in the middle of words in place of an “i”. Every time FineReader reads the word “misi” (cat) as “m¡si” a volunteer has to tediously correct the error. FineReader typically makes thousands such mistakes when digitalizing a dictionary. For example, it reads “ri” as “n” and “m” as “rn” since it has no dictionary for the native language to guide it as to what is a proper word in the language. Dictionaries often place native words in cursive, which dramatically increases the number of errors, especially if the text is in a San Serif font such as Arial, which makes it difficult to distinguish letters such as a capital I (vowel) from a lower case l (consonant).

FineReader 8.0 is also unable to be programmed to recognize many of the specialized characters used in native languages, such as (n, U+301) and ḿ (m, U+301) in Movima, ɨ (U+268) in Guaraní and Mosetén, ɨ̈ (U+268, U+308) in Guaraní, and (p, U+302), (q, U+302), (U+1EA1), ạ́ (U+1EA1, U+301), (U+1EB9), ẹ́ (U+1EB9, U+301), (U+1ECB), ị́ (U+1ECB, U+301), (U+1ECD), ọ́ (U+1ECD, U+301), (U+1EE5), ụ́ (U+1EE5) in Tsimane’. Each occurrence of these letters has to be painstaking corrected in the text after OCR.

When FineReader makes a mistake with odd combinations of characters such as “{e” or “¡m”, it is easy to search for them and replace them, but other errors are much harder to search for, because the error might be a common combination of letters in Spanish, but not in the native language. Since the language contains both Spanish and the native language, there are hundreds of occurrences to search through to find the errors. FineReader doesn’t allow for regular expression searches which would help locate the errors. The only way to deal with this problem is to export the text to another editor such as LibreOffice Writer which does support regular expression searches to find an error in the text, then flip back to FineReader to correct it.

After going through the text line by line in FineReader, then the text needs to be exported to a word processor for manual editing and formatting. Most dictionaries have two or three columns of text per page, which means that the text is full of short lines terminated by line breaks, which need to be manually eliminated in the text editor. In many cases, the exported text also doesn’t have a line breaks to separate dictionary entries, so line breaks have to be manually inserted. Many hours have to be spent editing the text after it has been exported from FineReader. If the dictionary entries need to be color coded or certain text needs to be placed in italics or bold, that has to also be manually added, since FineReader rarely reads the formatting of a text correctly. All headers and footers in the exported text have to be manually deleted, then recreated in the text editor.

After formatting the dictionary entries, section headers need to be added for each letter, so the digital text can be easily navigated when read as a PDF. That way the reader can open the Bookmarks sidebar in Adobe Acrobat and click on a letter to go to that section of the dictionary.

Next a script has to be written to pass through the dictionary and convert each entry so it can be used in GoldenDict and StarDict. The key word(s) of each entry needs to be separated from the definition. The key word(s) need to be converted into plain text, but the definition can contain formatting in Pango Markup for StarDict and HTML for GoldenDict. It is easiest to process plain text, but if the text contains formatting, the text has to be exported from the word processor as HTML. Then the script has to go through the text and eliminate all the HTML code that GoldenDict can’t recognize and then convert it into Pango Markup for StarDict. Writing code to correctly process HTML code can be tricky and often involves a great deal of painful trial and error.

The other option is to export the script as plain text, then use regular expression searches to insert HTML and Pango Markup formatting into the definitions. For example, abbreviations such as “adj.” or “Gram.” might be placed in green, definition numbers might be placed in bold, and examples might be placed in blue italics. It is possible to insert formatting in this way only if the text is consistent. Authors often shorten the same word with 2 or 3 different abbreviations or do not consistently indicated elements in the same way. It is often necessary to spend many hours correcting a text in order to standardize all the entries, so the script can correctly insert the proper formatting.

Each dictionary has a different format for its entries, so the script has to be customized for each dictionary. In many cases, the authors are inconsistent in the format of the entries, so it is necessary to go through the entire dictionary and correct the text so it can be processed by a script. For example, an author might use a dot to terminate each key to separate it form the following definition. Sometimes there is no dot to separate the key word(s) from the definition. It might be a question mark, an exclamation mark, an opening parenthesis or square bracket to terminate the key word(s) in an entry. In many cases the key word is separated from the definition by a space, but key words also contain spaces, so the script can’t just search for the first space to find the ending of the key word(s). We try to promise authors that we won’t alter their dictionaries, but that promise is impossible to keep because the text often needs to be processed so that a script can properly process the text and separate keys from their definitions.

If the dictionary only contains entries in one language, then the script should flip the entries so that there are also entries in the other language. For example, the script convertDicMovima.php uses the content of the Movima-Castellano dictionary to create a Castellano-Movima dictionary as well, so users can search for words in both Movima and Spanish.

The script will create a plain text TAB file, which uses new lines to separate each entry in the dictionary and tabs to separate the keys from their definitions in each entry. The script will then call StarDict’s tabfile utility to create the digital files used by StarDict, GoldenDict, and many other electronic dictionaries such as Babiloo, ColorDict, etc. The TAB file for StarDict needs to contain Pango Markup, whereas the TAB file for GoldenDict needs to contain HTML code. Later, either of those TAB files can be passed to SimiDic-Builder to create the SQLite3 dictionary file which is used by SimiDic.

Then the dictionaries have to be tested in each each electronic dictionary to verify that the entries are displayed correctly. Each electronic dictionary program has its own distinctive quirks which have to be worked around. In the case of SimiDic, we often have to change the code of SimiDic-Builder so it will import the TAB files correctly. In the case of GoldenDict and StarDict, we often spend hours of trial and error to figure out what works. For example, some formatting codes don’t work as we expect or look ugly, so we have to go back and change them. StarDict will refuse to display an entry if the definition doesn’t have a closing tag for a formatting code. For example, an entry with an opening tag for italics but no closing tab will cause the definition to be blank. The only way to detect these problems is to manually go through all the entries in the dictionary and check each one in StarDict.

Once all the dictionary files have been tested, we then post them on the ILLA website and the SimiDic website for public download and send out an email and post on facebook announcing the release of the dictionary.

SimiDic is very easy to install and import the dictionaries into the program and most people who have an Android or iPad/iPhone/iPod are savy enough to figure out how to download SimiDic from Google’s Play Store or Apple’s App Store. We have had over 10,000 downloads thus far, which tells us that the dictionaries are reaching the public. Sadly, most people don’t have the technical skill to read the online instructions to figure out how to download and install GoldenDict or StarDict so they can use the dictionaries on their PC. It is not that complicated to download the dictionary files, decompress them, and then place them in the proper directory so GoldenDict and StarDict can read them, but people just aren’t accustomed to reading directions. They expect to click on a simple “Install” button, and anything more complicated is considered too difficult. We don’t have any funds to create an installer to install the dictionaries in Windows and Linux. Given our lack of funding, we are still looking for a volunteer to create a Windows installer and Linux packages for Ubuntu and Debian for the dictionaries.

Although the ILLA had now digitalized 10 dictionaries and has written scripts to convert over 2 dozen dictionaries for use in electronic dictionaries, the process of digitalization and the creation of conversion scripts is still fraught with many unexpected trials and tribulations since each dictionary poses unique problems.

For example, the digitalization of the quechua dictionary by Herrero and Sánchez de Lozada proved to be a nightmare, because we couldn’t export the text from FineReader. The Quechua-Spanish volume contains 904,000 words in 586 pages with 3 columns of 6 point Arial text per page. After deleting all the page headers and page numbers in the text, the exported DOC file created by FineReader is 24.3MB. For some reason, LibreOffice 3.5 crashes when it tries to open that DOC file. WordPad and AbiWord, however, are able to open the DOC file. The problem is that each page contains a table with 3 columns of text. Inside the columns, the hard returns to separate the dictionary entries and to break the definitions with new lines were all in the wrong places. Converting the tables to text and then going through all the text and inserting and deleting hard returns would have been a very tedious task in text processor.

Rather than attempting this manually, we thought that it would be easier to just export the text as HTML and then write a script to extract the text from the HTML tables and then parse through the text and insert paragraphs <p> and line breaks <br> in the right places. Amos Batto wrote a PHP script to process the HTML text, but after writing the script, we discovered that the the text output by the script was different from the order in the original text. Sections of text were in different places. When FineReader scanned the pages, it did not always scan the text on each page as 3 columns, but often scanned it as multiple blocks and then it created a separate cell in an HTML table for each block when the text was exported. Not all the cells were in the same order in the table as they had appeared on the original page. Attempting to go through each page and figure out where the text had been reordered proved to be an insurmountable task, so we threw away the script and decided that we would have to manually copy the text out of each table in the HTML file. After going through a third of the dictionary manually copying text out of each cell in the tables and pasting them into a new file, we discovered that FineReader doesn’t place the text in the correct order inside the exported HTML tables, so we had to throw away a day of copying and pasting and go back to the drawing board.

Then, we tried exporting the text from FineReader as an RTF file and then opening it in WordPad, which doesn’t support tables, so the text would automatically loose the tables, when we saved the text in WordPad. The text in the exported RTF file was in the correct order and Word Pad was able to save the text without any tables, but LibreOffice was unable to open the RTF file after it had been saved in WordPad. We then tried opening the RTF text in AbiWord, but AbiWord wasn’t designed to handle 24MB files and it proved impossible to edit the text in AbiWord. When we tried exporting the text from AbiWord as an ODT file, LibreOffice still wasn’t able to open the file. However, we found that it was possible to save the text in AbiWord as an HTML file, and then import that HTML file into LibreOffice. Unfortunately, when AbiWord exported the text as HTML, it removed all the spaces around the <i> and <u> tags, so all the italics and and underlined words in the text were joined to other words. To edit the HTML so we could reinsert the spaces, we tried opening the text with BlueFish. Searching and replacing in a 30MB HTML file proved to be impossible inside BlueFish, since its search function first highlights every occurrence of a search. In a large 30 MB file, searches caused BlueFish to hang for several minutes. Then, its replace function was agonizingly slow, taking up to 15 minutes to complete one global search and replace. The same search and replace in GEdit took only a couple seconds, but GEdit doesn’t support regular expression searches, which we needed. After spending 2 hours reinserting spaces around the HTML codes, we then imported the fixed HTML file into LibreOffice. At this point, we thought that we had the problem licked by using this sequence:

FineReader (RTF) -> WordPad (RTF) -> AbiWord (HTML) -> BlueFish (HTML) -> LibreOffice

Once we had the file open in LibreOffice, we started fixing all the misplaced hard returns and formatting the text. Then we discovered that AbiWord’s exporter hadn’t just deleted the spaces around the HTML tags, it had also randomly deleted some of the text. Trying to hunt through all the text and find all the deletions would have been impossible with a text of that size. We then tried exporting the text from AbiWord as an SXW file, which is an old StarOffice format which LibreOffice still supports. LibreOffice was able to import the SXW file without any problems and there were no random deletions of text.

At this point we thought that we had licked the export problem, but then we discovered that there was no space between words that ended a line in a column and the first word in the next line. The text looked fine when it was in a column in a table, but when it was taken out of the table, the 2 words were joined without a separated space. At first we thought that it was another AbiWord export problem, but then we found out that this is simply a feature of FineReader when it exports to DOC or RTF format. Maybe this isn’t a problem when the DOC file is opened by Microsoft Word. We don’t have a copy of MS Word to test it, but we saw it in both AbiWord and WordPad.

At this point, we threw up our hands and thought that the only way to export the text would be as plain text, but that would mean loosing all the formatting in the dictionary. The dictionary has thousands of words in italics, and it would take many hours to manually add italics to the text. Then we tried manually copying the text out of FineReader and pasting it into LibreOffice. The copy and paste between FineReader in a Windows virtual machine and LibreOffice in Linux was only supported as plain text. Nonetheless, we discovered that it was possible to copy and paste formatted text between FineReader and LibreOffice in the same Windows virtual machine. We then spent the next 4 hours manually cutting and pasting each page between FineReader and LibreOffice in Windows.

We saved the text in LibreOffice as HTML. Then, we wrote a PHP script to go through the HTML text and delete all the paragraph breaks, then re-parse the text to add paragraphs <p> to separate the dictionary entries and add line breaks <br> in the proper places inside the entries. After running the script, the output text still wasn’t prefect. We had to spend 2 days inserting and deleting line breaks inside the entries because the script wasn’t able to correct guess where the quechua examples started and were the Spanish translations for each example started. Nonetheless, using the script probably cut the formatting time in half compared to manually formatting the text.

In all we guesstimate that it took roughly 40 hours of work to figure out how to export the dictionary from FineReader and then format it in LibreOffice. We still haven’t written the script to convert the text so it can be used in GoldenDict, StarDict and SimiDic, but we expect the conversion to take just as much work. In addition we spent roughly 80 hours manually correcting the text in FineReader and the text still has many minor errors, but we think that it is more important to make the text available to the public even with errors rather than wait for years trying to find volunteers to check every line. Thankfully we didn’t have to scan the pages of the text, since UCLA had already done that task for us, but 160 hours of work to digitalize and convert a single dictionary is a great deal of time and energy to ask of volunteers.

We hope that we can find some funding to pay people to manually check the text and eliminate the errors, but it is unlikely that any institution is willing to give us financial support for this work. Digitalizing and republishing old dictionaries online simply isn’t considered important by funding institutions. These institutions may be willing to fund the creation of new texts in native languages, but there simply isn’t any support for the diffusion of existing texts, especially when the legal rights to republish those texts are often questionable. Nonetheless, there is little hope for the revitalization of native languages in Latin America, if the speakers don’t have access to dictionaries and other texts in their native languages. The academics and the specialists are willing to spend great time and energy hunting down old copies of out-of-print dictionaries, but the general public is not. These dictionaries to be readily available at the click of a button on their cell phones and PCs. They need to be accessible to anyone who wants to download them at zero cost. We encourage everyone who is passionate about preserving native languages to join us in this effort. We need volunteers willing to donate their time and we ask people to help us pay for the costs of this effort.

Tupananchikkama,
Jikisiñkama,
Amos Batto
General Coordinator
Instituto de Lenguas y Literaturas Andinas-Amazónicas (ILLA)

Agregue un comentario

Su dirección de correo no se hará público.