LaTeX word count: script and web-interface
Version 2.2 released now with support for UTF-8, Chinese, and Japanese!
TeXcount is a Perl script for counting words in LaTeX documents. It parses
any valid LaTeX document, interpreting the text as text words, headers, formulae
(mathematics) and floats/begin-end groups.
To run the scipt, you can either download it and run it on your own
computer, or you can use this web-interface.
View the Quick Reference Guide for an
overview of options and TeXcount instructions, or the full
Documentation for more details.
Examples, hints, tricks, and frequently answered questions are available in the
TeXcount FAQ.
Download
Download TeXcount
(version 2.2).
The documentation, included in the download package,
explains how TeXcount works, including its strengths
and short-comings. It also explains the options and how you can add new macro handling
rules.
The script requires Perl. This is standard, I think, under Unix/Linux.
A Perl version for Windows users is
ActivePerl which only requires about 60 MB for a full installation, much
less if you drop documentation and other niceties.
What it does
The script may be run on one or more TeX/LaTeX files, although the online service only takes one file at a time. Elements
counted separately are:
- words in text
- words in headers (chapter, section, etc.)
- words in float captions, footnotes, etc.
- number of headers
- number of floats/figures/tables
- number of inlined formulae
- number of displayed formulae
By using options
-v1,
-v2,
-v3 or
-v4
(option
-v being
-v3), the LaTeX code will be output
using colour and style codes to display
how the document has been interpreted at different levels
of detail: return text words and mark formulae, include ignored text and
macros, include comments and macro options, or include internal state
at the beginning of every parsed unit (word or group). The ANSI colour codes may not
work under Windows: instead, output details in HTML format using the
-html
option, and view this with a browser.
By default, macros options (\macro[...]) are ignored; for
some macros, their parameters (\macro{...}) are also ignored or given
special treatment. Some begin-end groups have been defined to be
floats (not counted as text) or mathematics, others are defined to contain
ordinary text to be counted.
Be aware that TeXcount has some limitations. First, the LaTeX document need to
be a syntactically valid LaTeX document, otherwise TeXcount may get very confused.
Also, TeXcount relies on a set of macro handling rules that assume that macros take
a given set of {}-enclosed parameters; these also allow []-enclosed options so long as
these are not too long and take values that "look like option values". Macros that do not fit
this format generally cannot be expected to be handled correctly. Check the detailed output
to make sure TeXcount has done the right thing!
Versions, corrections and known issues
- Known issues:
- Though most commonly used macros should be handled, there may be some that
are not. This is particularly true for macros comming from various packages.
- Numbers in the text are counted as words. This may not always be
appropriate. Note that while numbers are counted as words, numbers enclosed
by $...$ are considered inlined formulae and not counted as words.
- If the document contains non-Latin letters, TeXcount may have difficulty recognizing
these as letters. Also, letter modifiers such as \" which may be used with some languages
will be interpreted as macros and will cause words containing them to be split in two giving
rise to exagerated word counts. However, using the option -relaxed may help TeXcount
recognize some letter modifiers.
- Nested macros and begin-end groups need not behave as desired, so users should
have to check the details.
- Complex macros may confuse TeXcount. This is particularly true of [...]-options that have
unexpected or unexpectedly long values, although the -relaxed option may help
TeXcount handle these better .
To ensure that macros and environments you use are handled
appropriately, I strongly recommend using the -v options
and looking over the results.
Here are previous versions with an overview of changes made.
- Version 2.2
updated Apr 30, 2009
- Only minor corrections relative to 2.2.β: the Unicode, Chinese and Japanese seems to work well,
and I have even received reports that it works well with other languages such as Greek and Hebrew.
- Version 2.2.β
updated Mar 17, 2009
- The main change is the support for Unicode (UTF-8), Chinese and Japanese. Some problems have also
been fixed: e.g. there was a problem with the %TC:endignore which could cause failure.
- Version 2.1
updated Nov 02, 2008
- In addition to the upgrades of 2.1α and 2.1β, there are a few minor fixes. Most
importantly, the help has been improved with a presentation of the output style and colour codes added
at the start of the output which explains briefly how to interpret the output. The code has also been
refactored and cleaned somewhat.
- Version 2.1β
updated Oct 30, 2008
- Main change is that help on the colour codes of the verbose output is now provided: this will be
added at the top, but may be suppressed using -nocode. Some minor fixes and improvements.
- Version 2.1α
updated Jul 09, 2008
- The zip file contains the Perl script, documentation, and a quick reference manual. Some problems
have been fixed: file type adding and path for included files should now be correct, and somehow the
use of $$...$$ for displayed formulae had become conditionally broken. Options are added to
reduce output from complete summary per file, to brief summary (one line) per file or only total summary.
It is also now possible to get only to total sum, i.e. one number, which may be convenient if using as
input to other programs (e.g. Emacs). The total sum can be customized with respect to which words
(text, header, caption) and formulae (inlined, displayed) are counted. The detailed output can also give
the cumulative sum, and it is possible to get subcounts, e.g. per section. The rules for what TeXcount may
consider a word or a macro option may also be relaxed, which may help users who f.ex. have special
characters or macros as parts of options, or use character modifying macros (e.g. \"). This
version has still to be tested well, which is why it is labeled as an alpha version; it will be replaced
by a β version when I expect most problems to be corrected, and by a final version when I decide
it is ready to replace version 2.0. The documentation
may also need some improvement and proof-reading.
- Version 2.0,
updated Feb 10, 2008: Some fixes
- A few minor problems have been fixed. I have changed the name of the name of the
exclude rule to the more intuitive name macro. Also, the first line of the
script used to give the path to the perl command, which is used for running the
script under Linux/Unix. This has been changed to
#!/usr/bin/env perl which should be more robust...provided you have env
installed there. The documentation has been claiming incorrectly that the default
treatment of begin-end groups has been to treat them as floats, i.e. not count them as text,
but this has been wrong: the default is to count the contents as the surrounding text.
- Version 2.0.beta,
updated Jan 31, 2008: Major upgrade
- This new version has some major improvements. First, adding new macro
handling rules can now be done though comments in the tex files. There is also
increased flexibility in specifying the handling rules: i.e. how to handle the different
parameters. The preamble (between \documentclass and
\begin{document}) is now handled properly: previously this was parsed much
like the rest of the document. However, the script may still be used on LaTeX files
that do not contain \documentclass and \begin{document}. The script
may now also automatically count included documents, although this is turned off by
default. When the script has been in use for some time and problems, if any,
have been addressed, the final version 2.0 will be released here and on
CTAN.
- Version 1.9,
version uploaded to CTAN, not released on this page: Major improvements
- In addition to a number of minor fixes and improvements, this version was a
large part of the step towards version 2.0: it added support for macros to be
counted as words (\LaTeX is recognized as a word) and special
handling of the preamble. On request, it was uploaded to
CTAN while still
awaiting the testing and final release of version 2.0.
- Version 1.8,
updated Jun 20, 2007: Minor improvements
- First, I've removed the -T option from the script so it will run
more easily from the command line under windows: previously, the command
perl -T needed to be used (at least some places). I've added support
for a very few more macros. I've fixed the problem that options [...]
were not allowed to contain special characters: if they did, they were not
ignored as they normally should. And finally, \$ were interpreted
as starting inlined maths, which was wrong. If any of the fixes has caused
new problems, please tell me!
- Version 1.7, updated Jul 2, 2006: Bug-fig
- When fixing the \{...\} bug last time, I had the foresight to
included \[...\] in the fix. Bad idea as these indicate displayed maths!
- Version 1.6, updated Mar 6, 2006: Locale and bug-fix
- The locale on the online web service did not handle special letters like
å; this is now fixed. Though it's now running on the norwegian locale, this
seems to handle not only the special norwegian
letters, but also the swedish and german ones.
Another problem I found was that
\{ was interpreted as start of a group, hence requiring a matching
\}; this is also fixed.
- Version 1.5, updated Jan 9, 2006: extensive upgrading
- The list of macros and environments handled by the scipt has been
extended: from containing only the most used, I have gone through a list of
macros found in a documentation. Though it is most likely not complete, and
there will be additional macros and environments declared in various
packages, it should be a major improvement.
Environments are no longer treated as floats by default: thus, if within the text, it
will be treated as text, etc.
There is now a HTML mode included which will mark the interpretation using
HTML code. Some improvements have been made in how mathematics within excluded (not counted)
regions are marked. The HTML code may be produce only for the parsed text and
the word count, or as a complete HTML document.
- Version 1.3, updated Jan 2, 2006: some upgrading
- It was pointed out to me that some special letters used in some languages
would not be identified as letters and cause words to be split in two. I have
also added a few more macros, but many users may find the need to
add further macros, so I have added some more documentation in the Perl
script. Another change made is that options on the form [...] after macros
are now ignored not only immediately after the macro name, but also between
and after tokens and {...} that have been specified not to be counted.
- Version 1.2, updated Oct 10, 2005: one error corrected
- This one's a bit embarrasing. The script would ignore everything
between \begin{document} and \end{document} treating the environment
as a float. Anyway, it's fixed now.
- Version 1.1, updated Sept 9, 2005: some errors corrected
- Options containing commas were counted as text. Environments
having a zero as last token failed to find the \end.