LaTeX word count: script and web-interface

Version 2.2 released now with support for UTF-8, Chinese, and Japanese!

TeXcount is a Perl script for counting words in LaTeX documents. It parses any valid LaTeX document, interpreting the text as text words, headers, formulae (mathematics) and floats/begin-end groups.

To run the scipt, you can either download it and run it on your own computer, or you can use this web-interface.

Click for example!

View the Quick Reference Guide for an overview of options and TeXcount instructions, or the full Documentation for more details.

Examples, hints, tricks, and frequently answered questions are available in the TeXcount FAQ.

Download

Download TeXcount (version 2.2).

The documentation, included in the download package, explains how TeXcount works, including its strengths and short-comings. It also explains the options and how you can add new macro handling rules.

The script requires Perl. This is standard, I think, under Unix/Linux. A Perl version for Windows users is ActivePerl which only requires about 60 MB for a full installation, much less if you drop documentation and other niceties.

What it does

The script may be run on one or more TeX/LaTeX files, although the online service only takes one file at a time. Elements counted separately are:

By using options -v1, -v2, -v3 or -v4 (option -v being -v3), the LaTeX code will be output using colour and style codes to display how the document has been interpreted at different levels of detail: return text words and mark formulae, include ignored text and macros, include comments and macro options, or include internal state at the beginning of every parsed unit (word or group). The ANSI colour codes may not work under Windows: instead, output details in HTML format using the -html option, and view this with a browser.

By default, macros options (\macro[...]) are ignored; for some macros, their parameters (\macro{...}) are also ignored or given special treatment. Some begin-end groups have been defined to be floats (not counted as text) or mathematics, others are defined to contain ordinary text to be counted.

Be aware that TeXcount has some limitations. First, the LaTeX document need to be a syntactically valid LaTeX document, otherwise TeXcount may get very confused. Also, TeXcount relies on a set of macro handling rules that assume that macros take a given set of {}-enclosed parameters; these also allow []-enclosed options so long as these are not too long and take values that "look like option values". Macros that do not fit this format generally cannot be expected to be handled correctly. Check the detailed output to make sure TeXcount has done the right thing!

Versions, corrections and known issues

To ensure that macros and environments you use are handled appropriately, I strongly recommend using the -v options and looking over the results.

Here are previous versions with an overview of changes made.

Version 2.2 updated Apr 30, 2009
Only minor corrections relative to 2.2.β: the Unicode, Chinese and Japanese seems to work well, and I have even received reports that it works well with other languages such as Greek and Hebrew.
Version 2.2.β updated Mar 17, 2009
The main change is the support for Unicode (UTF-8), Chinese and Japanese. Some problems have also been fixed: e.g. there was a problem with the %TC:endignore which could cause failure.
Version 2.1 updated Nov 02, 2008
In addition to the upgrades of 2.1α and 2.1β, there are a few minor fixes. Most importantly, the help has been improved with a presentation of the output style and colour codes added at the start of the output which explains briefly how to interpret the output. The code has also been refactored and cleaned somewhat.
Nov 08, 2008: The original Perl file had been saved in Windows format, which did not run under Linux without running dos2unix on it first. This has been fixed, and the default -sub also changed to subsection, which is what the documentation says it should be.
Version 2.1β updated Oct 30, 2008
Main change is that help on the colour codes of the verbose output is now provided: this will be added at the top, but may be suppressed using -nocode. Some minor fixes and improvements.
Version 2.1α updated Jul 09, 2008
The zip file contains the Perl script, documentation, and a quick reference manual. Some problems have been fixed: file type adding and path for included files should now be correct, and somehow the use of $$...$$ for displayed formulae had become conditionally broken. Options are added to reduce output from complete summary per file, to brief summary (one line) per file or only total summary. It is also now possible to get only to total sum, i.e. one number, which may be convenient if using as input to other programs (e.g. Emacs). The total sum can be customized with respect to which words (text, header, caption) and formulae (inlined, displayed) are counted. The detailed output can also give the cumulative sum, and it is possible to get subcounts, e.g. per section. The rules for what TeXcount may consider a word or a macro option may also be relaxed, which may help users who f.ex. have special characters or macros as parts of options, or use character modifying macros (e.g. \"). This version has still to be tested well, which is why it is labeled as an alpha version; it will be replaced by a β version when I expect most problems to be corrected, and by a final version when I decide it is ready to replace version 2.0. The documentation may also need some improvement and proof-reading.
Version 2.0, updated Feb 10, 2008: Some fixes
A few minor problems have been fixed. I have changed the name of the name of the exclude rule to the more intuitive name macro. Also, the first line of the script used to give the path to the perl command, which is used for running the script under Linux/Unix. This has been changed to #!/usr/bin/env perl which should be more robust...provided you have env installed there. The documentation has been claiming incorrectly that the default treatment of begin-end groups has been to treat them as floats, i.e. not count them as text, but this has been wrong: the default is to count the contents as the surrounding text.
Version 2.0.beta, updated Jan 31, 2008: Major upgrade
This new version has some major improvements. First, adding new macro handling rules can now be done though comments in the tex files. There is also increased flexibility in specifying the handling rules: i.e. how to handle the different parameters. The preamble (between \documentclass and \begin{document}) is now handled properly: previously this was parsed much like the rest of the document. However, the script may still be used on LaTeX files that do not contain \documentclass and \begin{document}. The script may now also automatically count included documents, although this is turned off by default. When the script has been in use for some time and problems, if any, have been addressed, the final version 2.0 will be released here and on CTAN.
Version 1.9, version uploaded to CTAN, not released on this page: Major improvements
In addition to a number of minor fixes and improvements, this version was a large part of the step towards version 2.0: it added support for macros to be counted as words (\LaTeX is recognized as a word) and special handling of the preamble. On request, it was uploaded to CTAN while still awaiting the testing and final release of version 2.0.
Version 1.8, updated Jun 20, 2007: Minor improvements
First, I've removed the -T option from the script so it will run more easily from the command line under windows: previously, the command perl -T needed to be used (at least some places). I've added support for a very few more macros. I've fixed the problem that options [...] were not allowed to contain special characters: if they did, they were not ignored as they normally should. And finally, \$ were interpreted as starting inlined maths, which was wrong. If any of the fixes has caused new problems, please tell me!
Version 1.7, updated Jul 2, 2006: Bug-fig
When fixing the \{...\} bug last time, I had the foresight to included \[...\] in the fix. Bad idea as these indicate displayed maths!
Version 1.6, updated Mar 6, 2006: Locale and bug-fix
The locale on the online web service did not handle special letters like å; this is now fixed. Though it's now running on the norwegian locale, this seems to handle not only the special norwegian letters, but also the swedish and german ones. Another problem I found was that \{ was interpreted as start of a group, hence requiring a matching \}; this is also fixed.
Version 1.5, updated Jan 9, 2006: extensive upgrading
The list of macros and environments handled by the scipt has been extended: from containing only the most used, I have gone through a list of macros found in a documentation. Though it is most likely not complete, and there will be additional macros and environments declared in various packages, it should be a major improvement. Environments are no longer treated as floats by default: thus, if within the text, it will be treated as text, etc.

There is now a HTML mode included which will mark the interpretation using HTML code. Some improvements have been made in how mathematics within excluded (not counted) regions are marked. The HTML code may be produce only for the parsed text and the word count, or as a complete HTML document.

Version 1.3, updated Jan 2, 2006: some upgrading
It was pointed out to me that some special letters used in some languages would not be identified as letters and cause words to be split in two. I have also added a few more macros, but many users may find the need to add further macros, so I have added some more documentation in the Perl script. Another change made is that options on the form [...] after macros are now ignored not only immediately after the macro name, but also between and after tokens and {...} that have been specified not to be counted.
Version 1.2, updated Oct 10, 2005: one error corrected
This one's a bit embarrasing. The script would ignore everything between \begin{document} and \end{document} treating the environment as a float. Anyway, it's fixed now.
Version 1.1, updated Sept 9, 2005: some errors corrected
Options containing commas were counted as text. Environments having a zero as last token failed to find the \end.
Last modified April 30, 2009.