Imron's Chinese Text Analyzer Review

This is a series of posts from this thread at the Chinese Forums website. I think this is a good forum for people interested in tools and study aids to help in the study of Chinese, and I would highly recommend you join. My username is Kikosun. Here I have done a review of a new Chinese Text Analyzer, and I highly recommend you take a look at it.


Chinese Text Analyzer review:

I had the pleasure of reviewing Imron's new Chinese Text Analyzer program upon receiving  a free license courtesy of Chinese Forums. You can download the program here.

I'm an upper intermediate level self study student. I'm a heritage learner, so my spoken chinese is much better than my reading. I thought this tool would be a great asset to helping me acquire better reading skills. I planned on using it with Pleco, so this review will discuss the integration of both applications.

So I have a desktop running windows and a macbook pro running Mac OSX, and I installed the Chinese Text Analyzer on both. I used Wine to install Chinese Text Analyzer on the Mac, and there were no problems with installation. The windows installation was much easier, as the program runs natively on windows. It was fast, and I had no issues. It takes literally two button clicks to install on windows. Most of this review will be based on my experience with the program in the Windows setting.


So the first step the program recommends is for you to import a list of your known words so the text analyzer would be able to identify known and unknown words appropriately.


I use Pleco for all of my study and flashcards, so it has a complex list of all the words that I've been tested on and know. In Pleco you can define your known words however you like. I define my "known" words as words that have a score of greater than 1000 points.


So the first thing I did was figure out how to export my list of known words from Pleco into the Text Analyzer. In Pleco, I did this by going to Organize Cards, and making a New Category called "Known". I then used the search function in Organize cards to search for score >= 1000, then I batch added all the cards that came up to the Known category.


I then used the Import/Export selection, changed Export cards to "cards in categories" instead of "all cards", and selected my Known category. I exported as a text file in UTF-8, and exported words only (no definitions), as I think Text Analyzer only requires a list of words.


I think the only down side to doing it this way is that you have to do a manual search to add new cards to your Known category and manually export the known card list into Text Analyzer each time if you want to keep your known words updated. I don't know if there is an easier way of exporting your known words out of Pleco, but this way worked for me.


Anyway, I then used the File Manager in Pleco to upload the file via wifi to my computer. (I love this feature of Pleco). The exported file is just a txt file that you can then import into the Chinese Text Analyzer.


When you first install the Chinese Text Analyzer, it has a popup that says "Welcome to Chinese Text Analyzer! Before you begin you should import lists of words that you already know. Chinese Text Analyzer can read files exported by popular flashcard programs such as Pleco and Anki, or you can import words from pre-made lists of HSK vocabulary. Later on you can manually add words while you are reading Chinese content."


In this window, you can either click "Import..." or you can import words using File --> Import...


I imported my list of known words from Pleco, and it imported, but I would have liked to see a success message or something to let me know it worked ok. Rather, it just took me back to a blank screen, and I wasn't sure if anything had happened. I was able to get confirmation by going to Word Lists --> View Known and seeing a list of words there.


I then tried opening some reading practice files. By going to File --> Open, I was able to find my txt files and they open very very quickly. I was even able to shift & select an entire folder's worth of files and open them at once. Even opening 10+ files, the program was very very snappy. If you open multiple files at once, they open in individual tabs in the program which is very nice. I did try to overload the program with a bunch of longest texts I have, and it was still amazingly fast to analyze. However, I did notice that if you have more than about 7 or so tabs open, you will be unable to maneuver to the tabs on the right, since there is no way to access tabs that don't fit on the screen. I don't know how important this is, as people theoretically won't be reading 10 books at once, but I thought I would note this finding.


I am very impressed by how fast the program opens and analyzes the documents. Here are some well known novels that I've tested with their processing time in seconds  (I opened all 4 at the same time using shift & select in the file--> open box):

Journey to the West

0.14 seconds

Red Chamber

0.17 seconds

The Three Kingdoms

0.11 seconds

Water Margin

0.09 seconds

The processing time is taken from the upper left statistics window which I will describe in more depth later. It probably does vary based on your computer specs, and I have to admit my computer is pretty decked out for photo and video processing. But I imagine the program will run pretty fast on all computers, and I think the segmenting a novel in under 1 second claim is definitely true.


The default font that the program uses is ok, but not my favorite. You can go to Format --> Font, and there are a few other font options that you may like more. I'm not sure where the program gets its fonts from - if it is using pre-installed fonts on your computer, or if the program has a set of fonts that comes with it - but I went through the font options that I had, and there are quite a lot of font options that do not display Chinese characters correctly or at all (white boxes). Given that the sole purpose of this program is to display Chinese text, I think it would be really helpful if you curated the available fonts to only those that display Chinese text. I didn't go through all of the options, but in my cursory look I would say that 80-90% of the font options are not suitable for Chinese characters. Again, I'm not sure if this varies based on what fonts you have pre-installed on your computer or not.


I don't know if this is an option, but it might be nice if you could include a more brush script-y type font. I like the FZKaiTi font available as an add on in Pleco.


Now on to the statistics windows on the right side. The top window appears to have statistics for the entire document, including total number of words, total known words, percent known words, number of unique words. I noticed that the headings "Known" and "Percent Known" are used under the "Total" and "Unique" categories, and I recommend you make a clearer division between the "Total" section and the "Unique" section. Otherwise, it might look like the "Known" and "Percent Known" are duplicates, but they have different numbers.


The program also lists some character statistics and File statistics. I'm not really sure how important the File Statistics are, but I guess it doesn't hurt to have them there. I probably would never really look at it though in real usage.


One additional statistic that I think would be good to have is Number of Unknown words in a document. This way you could get an idea of how many words are left to learn for any particular text. I guess you could always calculate this yourself with number Unique minus number Known unique, but it shouldn't be hard to implement the Number of Unknown as well, which may be more helpful than the number of known words.


The bottom right window has statistics broken down for each word. For each word, it lists Frequency, % Frequency, Cumulative % Frequency, and First Occurance. I think the Frequency and % Frequency columns are the most important, especially if you want to prioritize vocabulary studying. You can very easily sort words by frequency.


I'm honestly not sure what "Cumulative % Frequency" means, and I was not able to figure it out.

I'm not sure how helpful the "First Occurrence" column is either. I haven't determined a use for it.


I did notice that if you double click anywhere in the row for a word in the bottom right window, it will automatically take you to the first occurrence of that word and will highlight all other occurrences in pink. I think an additional feature I would like to see would be a set of left right arrows so you can go to the next occurrence of the word of interest fairly easily if you are in a long document, and see each place the word is used in context. I think you can use the Edit ---> Find feature for this as well, but it would be nicely streamlined if a set of left/right arrows popped up when you double clicked a row in the word statistics window, without having a window blocking your text. Or even better, have the left and right keyboard keys move between each instance of the word.


There are three tabs on the bottom of the window to look at All words, Known words only, or Unknown words only. I have no issues with that layout. There is also a search field, which I have not used extensively. I think it only works if you type the characters. Maybe one future feature could be allowing pinyin search as well.


Now my review of the reading experience. I imported a few documents, and there were quite a few words marked in red as unknown that I already knew, perhaps I just never made Pleco cards for them. I found it very annoying to have to right click a known word and mark it as known.  I think it would be nice if there was a keyboard shortcut for marking words as known - maybe hitting the spacebar or enter key or something to make this process  easier and less intrusive on the reading experience. I just don't like having to right click and select from a text list to mark words as known, it really does take some of the flow away from reading that I think hitting a keyboard key would improve.


I'm not sure if Imron had in mind designing the Chinese Text Analyzer as just a tool to aid in picking which books/texts to read , or as a stand alone reader, or both. But I've been spending some time with it, and I find that it is very helpful in determining how appropriate a text is to your vocabulary level. This is of course assuming you update your known words list periodically which may be kind of a hassle.


However I'm not sure if I will spend most of my time doing dedicated reading on it. I have to admit that I miss having a pop up dictionary feature. I understand that Imron left this out purposefully to discourage bad habits. I'm sure over time I can get used to not having a pop up dictionary and studying the unknown words independently, but as of now I'm finding it hard to give up the crutch. I think especially in cases where not knowing a few key words in a sentence completely prevents you from understanding the meaning of the sentence. I do find it more challenging to read without a pop up dictionary, and there is somewhat of a mental block knowing that you don't have something convenient to fall back on.


One thought that I had for people who may choose to use the Chinese Text Analyzer as a dedicated reader is the fact that there is no Bookmark feature in the program. Especially for longer novel length books, it would be immensely helpful to have a bookmark feature so you don't have to find your place again if you stop reading and close the program. It may also be helpful to have an option to record notes in certain sections of the books or mark up places that you had difficulty reading and may want to go back and re-read after studying the vocabulary in that section.


Now I'll review the Export settings. I have tried the File --> Export --> To File settings. I have never tested the To Email, because I think it works through Microsoft Outlook, and I do not use Outlook.


When you go to File --> Export --> To File, a dialog box opens up with two tabs. The Document tab is first, and is sectioned into Document, Paragraph, and Word sections with "Pre" and "Post" under each section with a text box field. I actually do not know what these options do, as it was entirely unclear in the program. I think there should be a sentence or two of explanation here. I left all the fields blank, and it exported the entire document I had open with no changes. I do not know what the Pre and Post mean and what that tab is meant to do.


The second tab under Export is labeled Word List. This tab seemed much more intuitive. You can export All words, Known words, or Unknown words. I personally think that the default should be set to Unknown instead of All, as I think that is how most people will be using the program. I for one intend to use it to identify unknown words that need further study in Pleco, and I found that I very easily accidentally exported "All" words instead of "Unknown" words since All is currently the default. You can sort by Frequency, First Occurrence, or Word in ascending or descending order. I think the Frequency (Descending) as the default is appropriate for this one. You can select to export All rows, or the Top X number of rows (in case you want to just study the most frequently used 100 words in a novel for example). I think this is a very useful feature.


There are lots of fields available for export: Word, Simplified, Traditional, Simplified[Traditional], Pinyin (Tones), Pinyin (Numbers), English Definition, Sentence, Cloze Sentence, Frequency, % Frequency, Cumulative Frequency, First Occurrence. And you have the option of selecting as many fields as you want to export, so there is a lot of flexibility.


I'm not really sure what the difference between Word and Simplified is, since I exported both fields and they are the same in my test set. Perhaps it depends on what format the original document that the word came from uses. All of my texts were imported in Simplified.


Most of the other fields seem self explanatory. I'm not sure what dictionary is used, but each word has several of the most common definitions separated by "/".  My test set seemed to import fine into Microsoft Excel as a Tab delineated file. 


One very interesting part of the Chinese Text Analyzer is its ability to export Sentences where your word is found. It seems to be exporting the sentence that has the first occasion of the word. I did notice that when I exported the "Sentences" and "Cloze Sentences" fields, some of the fields exported with the previous sentence's period preceding it.  An example:

众人

。许多道[...]等,送到后山,指与路径。

Other than the leading period, it seems to parse the sentences well.  Not all of the 100 words I exported in my test set had a leading period, but the majority of them did. It may have something to do with the source document I used, so this may vary for other people, I don't know.


I also tested the export function with both the Sentences field and Cloze Sentences field exported. I am not sure why, but some of the rows imported weird into Excel. As in the Cloze sentence was cut off and put into a second row. I don't know if this has to do with tabs being in the actual text giving it problems or not.


This example was exported with the fields: Word, Simplified, Traditional, English Definition, Sentence, Cloze Sentence. You can see that the Cloze sentence got put on the second row.


/to walk/to go/to run/to move (of vehicle)/to visit/to leave/to go away/to die (euph.)/from/through/away (in compound verbs, such as 撤走)/to change (shape, form, meaning)/

楔子  张 瘟疫  洪 妖魔



楔子  张 瘟疫  洪 [...] 妖魔






This is the sentence in context of the actual text. Note that it is not an actual sentence, there are no periods (and no leading period) but it is followed by a return.

水浒传


楔子 张天师祈禳瘟疫 洪太尉误走妖魔


     纷纷五代乱离间,一旦云开复见天!草木百年新雨露,车书万里旧江山。      寻常巷陌陈罗绮,几处楼台奏管弦。天下太平无事日,莺花无限日高眠。


I think the problem occurs when there is an Enter/return at the end of the sentence that the program picks.  I haven't investigated this extensively, but I thought I would let you know there may be a slight bug with exporting of sentence fields. This of course is probably dependent on the quality of the text you are deriving the sentences from, I understand. But I think it might be worth investigating and seeing if these small issues are repeatable and can be fixed before the big release.


Overall I think the program is pretty useful and seems good. These were just some comments that I had while extensively exploring the program for a full day or so. I will try some more extensive reading using the program in the next few weeks and I'll give an update if necessary.


Thanks

Kikosun


Thanks for the comprehensive review!
 

Quote

It takes literally two button clicks to install on windows

I put in a lot of effort to make it only take two clicks! Nice to see this appreciated.
 
"I think the only down side to doing it this way is that you have to do a manual search to add new cards to your Known category and manually export the known card list into Text Analyzer each time if you want to keep your known words updated.

 
Now that you have Chinese Text Analyser set up, going forward, the alternative is to export words from Chinese Text Analyser, and then import them in to Pleco.  When you export, there is an option to automatically mark exported words as known (with the expectation that if you don't know them yet, you will after importing them to another program like Pleco) and this way the lists should stay relatively in sync.  Once every few weeks or months you could then do an export from Pleco just to catch any you had added outside of CTA.
 
 
"I imported my list of known words from Pleco, and it imported, but I would have liked to see a success message or something to let me know it worked ok. Rather, it just took me back to a blank screen, and I wasn't sure if anything had happened."
 
Added to my list of things to do.
 
 
"However, I did notice that if you have more than about 7 or so tabs open, you will be unable to maneuver to the tabs on the right, since there is no way to access tabs that don't fit on the screen."
 
Improving this is on my list of things to do.  In the meantime, you can use Ctrl-Tab and Ctrl-Shift-Tab to cycle between tabs (even those off-screen).
 
"I think it would be really helpful if you curated the available fonts to only those that display Chinese text."
 
Windows has a mechanism for doing this, however it can be a little too strict, and sometimes results in no fonts shown.  Fixing this up is on my list of things to do, but for the pre 1.0.0 releases I figured better to have too many fonts, than too little.  Remembering the last font used across sessions is also on my list of things to do.
 
 
"but it might be nice if you could include a more brush script-y type font."
 
Unfortunately, due to font licensing costs, including a nicer font is probably not really doable at this price point - at least not until there's a much higher volume of users.
 
 
"I'm not really sure how important the File Statistics are, but I guess it doesn't hurt to have them there."
 
Not really important, but possibly interesting for some people.
 
 
"I recommend you make a clearer division between the "Total" section and the "Unique" section"
 
Already on my list of things to do.
 
 
"One additional statistic that I think would be good to have is Number of Unknown words in a document."
 
Added to my list of things todo.
 
 
"I'm honestly not sure what "Cumulative % Frequency" means, and I was not able to figure it out."
 
It's similar to column 4 of the Jun Da frequency lists.  It's just the sum of all frequencies for that word and all words more frequent than it.
 
 
"I'm not sure how helpful the "First Occurrence" column is either. I haven't determined a use for it."
 
First occurrence is actually really useful for prioritising words.  For example, you can export the top 100 unknown words by frequency, and then sort them by first occurrence.  That way you can learn the most frequent words in the order that they will appear in the text you are reading.
 
"I think an additional feature I would like to see would be a set of left right arrows so you can go to the next occurrence of the word of interest fairly easily if you are in a long document, and see each place the word is used in context."
 
Double click again, and it will take you to the next instance and so on.  If you have used the Edit->Find dialog, then 'F3', 'Ctrl-G' or 'n' will take you to the next occurrence of the word without needing to have the dialog open (no shortcuts for previous words yet, that is on my list of things to do).

"there is no Bookmark feature in the program.
 
Good idea.  Added to my list of things to do.
 
 
"It may also be helpful to have an option to record notes in certain sections of the books or mark up places that you had difficulty reading and may want to go back and re-read after studying the vocabulary in that section.
 
Added as a low-priority to do item.
 
 
"Now I'll review the Export settings. I have tried the File --> Export --> To File settings. I have never tested the To Email, because I think it works through Microsoft Outlook, and I do not use Outlook."
 
The program just opens your default email application as specified by the OS.  Actually this feature is not that useful because it only works with small amounts of data, and I may end up removing it all together.
 
 
"I do not know what the Pre and Post mean and what that tab is meant to do.
 
A document has one or more paragraphs, a paragraph has one or more words.  Pre and Post allow you to add things before (Pre) or after (Post) each document, paragraph, and word during the export process.  So for example, you could set:
document pre: <html><head><title>CTA Export</title><meta charset="UTF-8"><style>.word:hover {color:red;}</style></head><body>
document post: </body></html>
paragraph pre: <p>
paragraph post: </p>
word pre: <span class="word">
word post: </span>
 
Then you'll get an html file split in to paragraphs and words, with the color changing red when you hover over a word.  Or you could just have them all empty, except have a single space for 'word post' and then you'd get a segmented file with words separated by spaces and so on.
 
 
"I think it would be nice if there was a keyboard shortcut for marking words as known"
 
Unfortunately I can't know where your eyes are looking, and I don't want the reading process to require the user to keep going next, next, next, next with arrow keys or something.  What I will probably do is make double click toggle the known/unknown status.
 
"I'm not sure if Imron had in mind designing the Chinese Text Analyzer as just a tool to aid in picking which books/texts to read , or as a stand alone reader, or both"
 
The hint is in the name: Chinese Text Analyser, not Chinese Text Reader.  I also have plans for a separate reader program, but Chinese Text Analyser can work as a basic reader.  Actually the plan has always been to develop the reader, but the reader required a segmenting engine and so I wrote that first, and thought it was useful enough to release as a standalone program in the interim.
 
 
"I have to admit that I miss having a pop up dictionary feature.
 
Popup definitions will be in the next release (0.99.3), with looked up words automatically being marked as unknown.
 
 
"but as of now I'm finding it hard to give up the crutch"
 
It might be an indication that you need to look at easier texts.  One of the design goals of CTA was to make shortcomings in your ability obvious, rather than letting you gloss over them.  The logic being that by making such things obvious, you can know what you need to focus on to improve.
 
So if there is a point of pain, then it is a possible indication of something in your learning that you need to address.  CTA will not coddle you and is meant to give you an accurate view of your real ability.
 
 
"I personally think that the default should be set to Unknown instead of All"
 
Added to my list of things to do.  Note however that currently the program will save your last choices, so once you set the list as 'Unknown' it will remember this the next time you export.
 
 
"I'm not really sure what the difference between Word and Simplified
 
Word is the actual word as it appears in the text.  If your text is all in Simplified, then Word and Simplified will be the same.  If however your text was all in Traditional, then Word and Simplified would be Traditional and Simplified respectively.
 
 
"I'm not sure what dictionary is used,"
 
Currently CEDICT, but with the possibility of supporting other dictionaries later.
 
 
"some of the fields exported with the previous sentence's period preceding it."
 
This is a bug, it should not be doing this and should stop on full-stops and/or newlines.  Can you please provide me a complete paragraph of text with the problem (or send me the source text via email), and I'll look in to it.
 
 
"You can see that the Cloze sentence got put on the second row."
 
Likewise a bug that I will look at if you send through example source text.

Thanks again for such detailed feedback. I'll try to address many of those issues in the next release. 


Thanks for responding to all my comments, Imron. 
I'm glad a lot of them have made it to your list of things to do. It seems like that list is getting really really long! 

I think I would most like to see the Bookmark feature implemented in a coming version. I'm sort of a slow reader right now, so I can only really read a few pages at a time in my short stories and such. 

 

Thanks also for explaining what the Pre and Post mean. It sounds like a pretty complicated but powerful feature. It might be nice if you had some examples of how to utilize it (like the one you gave in your post) on your website or on a help document or something. I think that this would be one of those features that gets underutilized if there isn't enough guidance on how to use it. 

 

I think your double click to mark words as known will be a good idea. I guess I'm used to keyboard shortcuts, but I see how that would be hard to implement. 

 

I'm curious what you plan to have in your future Chinese Text Reader project? Will it be significantly different from your Text Analyzer? I'm sure its probably years off, but I was just wondering. 

 

I'm emailing you separately with the complete source text that I used in my export tests. It was just a copy of Water Margin that I downloaded from somewhere. Maybe you can see if you can replicate the leading period bug and the putting the cloze sentence on the second line bug that I saw in some of the test cards. 

 

Thanks