A while back, I posted some TextWrangler scripts that allow for commenting and uncommenting blocks of code for codeless language modules such as Stata, R and LaTeX. The problem with my scripting approach was that it required two different scripts—one for commenting, one for uncommenting.
Steven Samuels has generously corrected this problem by writing one script that handles both functions — based on whether the line is already commented, running the script will either remove the existing comment or comment out the (currently uncommented) selection.
Download the new and improved script here. Visit this page for information on how to assign a keyboard shortcut to the script.
Note: The old script is still available. The only reason I can think of to use the old version is in case you want to apply multiple comments to the same line (for example, if you use a comment to create a heading in your code, then you comment out the entire section).
Many thanks to Steve for supplying the script!
Alex Yuffa has created an excellent (and beautiful) set of LaTeX tutorials: LaTeX 101, which introduces LaTeX, and LaTeX 102, which introduces advanced material through a series of exercies. The first document is based on my tutorial LaTeX: From Beginner to TeXPert, but Alex’s contains more examples and detail. If you’re new to LaTeX, these tutorials are a great place to start.
The word is more or less official that OOXML, the specification that Microsoft Office 2007 uses to write documents, has been approved as an ISO standard (the same body that approved ODF, the specification used by OpenOffice). The standardization process was very controversial, largely because (i) many felt that Microsoft was bucking the already approved standard (ODF) in order to compete with OpenOffice and other products, and, to a lesser extent, (ii) some felt that the format was unsatisfactory, in some cases because of its complexity. That being said, Microsoft and ECMA (another standards organization) worked closely to bring the OOXML specification up to the standards of the ISO, and Microsoft has made a pledge to work towards interoperability, adding the OOXML specifications to their open-specification promise.
What does standardization mean for you?
- For most people, it means (i) that it will be easier to share documents between applications, and (ii) that you can create OOXML documents without worrying that you will not be able to access them in the future.
- For people that prefer ODF to OOXML, it means that you will not be able to count on OpenOffice overtaking Microsoft Office on standards-compliance grounds alone (though price and performance competition, as always, will remain fair game).
- For people that write mathematical documents, it means that you can add Microsoft Word to the list of applications/formats that make it easy to create such documents and save them in a standardized, interoperable format [i.e., LaTeX (and LyX, Scientific Word, etc.) and ODF (and OpenOffice, KOffice, etc.)].
I hope I don’t sound like a Microsoft shill — as it happens, I just really like the way Office 2007 handles math; it combines LaTeX’s powerful ability to encode math with the convenience of WYSIWYG editing. And now you get that combination with the comfort of knowing that your documents are saved in an industry-standard format (at least, you will once Microsoft rolls out the next update to Office).
USBTeX is a complete, portable LaTeX environment for Windows. It includes portable versions of MikTeX, TeXMaker, Ghostscript, Ghostview, and Sumatra PDF. As the name suggests, you can install it on a USB drive and run it on any Windows machine (provided that you have a decent USB drive). Since LaTeX tends to be “special interest” software, USBTeX could be a very convenient resource.
If this interests you, you may also be interested in using an Online LaTeX Compiler.
Hopefully this post will be preaching to the choir, but lately I’ve been noticing that many students, programmers, professors, etc. write their statistical code as though there were no such thing as missing data. Unless your data have been cleaned and missing values imputed (I think the Public Use Census Data are like this), overlooking missing data is a mistake that could lead to, inter alia, biased estimates that act like measurement error.
The most common error that I see occurs in the creation of new variables. Variables are created according to rules regarding other variables (e.g., if x>5, set y=1). But what happens if x is a missing value? Do you know how your statistical software treats missing values? I’m going to go through this example using Stata, but the same logic applies to any programming environment.
I have created a random variable c that takes 3 values (1, 2 and 3). There are 3 missing observations for the variable c. I want to generate a new variable that takes the form
.
Consider the following two approaches to creating this variable:
Approach 1: Initialize everything to zero
. gen x=0 . replace x=1 if (c==1 | c==2) (9 real changes made)
Approach 2: Initialize everything to missing
. gen y=. (25 missing values generated) . replace y=1 if (c==1 | c==2) (9 real changes made) . replace y=0 if c==3 (13 real changes made)
Results
The variable x, created using approach 1, has too many observations. There should be 22, but there are 25:
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
x | 25 .36 .4898979 0 1
y | 22 .4090909 .5032363 0 1
The variable y, on the other hand, has the correct number. The difference arises because what I want to do is define NewVar to be 0 whenever c is 3. Approach one defines NewVar to be 0 whenever c is not in {1,2}. This is a big difference from an econometric/statistical perspective. If we define NewVar to be 0 when we do not know the value of c, we are essentially imputing values. These imputed values will reduce the accuracy of estimates and future calculations.
A note about missing values in Stata
Stata saves missing values as very large numbers. This means that if you used the command
replace y=0 if c>2
Stata would set y=0 for observations with c=3 as well as observations with c=. (missing). In this case, you could just use the command in Approach 2 above. For continuous variables, however, this is not an option. The optimal workaround, in my opinion, is to use the following:
replace y=0 if c>2 and !missing(c)
This would set y=0 for all observations where c is not missing and the value of c is greater than two.
The American Institutes for Research offers a free application called AM Statistical Software (Windows only). The application is still in early development, but it already offers several modeling and replication procedures, including OLS, logit, probit and marginal maximum likelihood procedures. Most impressive, however, is the application’s optional Data Transfer Component, which can import and export SAS, Stata, SPSS, and many other data formats. I was able to use the application to read an SPSS file that the foreign package for R couldn’t handle. AMSS is worth a try, even if all you need to do is convert a dataset.
The only trouble I had installing and running AM Statistical Software was that the help system didn’t install. [Update: Earlier I wrote that the UI was buggy, but I now think that was due to Vista, not AMSS.]
Roger Koenker, a quantile regression crusader, has an R package that implements the procedure. It is called quantreg, and it is documented here. This package has apparently been around for quite some time, but I was only recently turned on to quantile regression, so it was under my radar.
Rattle is a graphical interface to R. It supports basic data management tasks, as well as a number of different modeling functions. I haven’t had a chance to test it out yet since I don’t have my R installation completely up to date, but it looks promising. For all of R’s strengths, one weakness is the inconsistency of the syntax from command to command and package to package (not that commercial applications don’t have the same problem).
Also see RCommander.
The STIX mathematical fonts are available as a public beta. From the website:
The mission of the Scientific and Technical Information Exchange (STIX) font creation project is the preparation of a comprehensive set of fonts that serve the scientific and engineering community in the process from manuscript creation through final publication, both in electronic and print formats. Toward this purpose, the STIX fonts will be made available, under royalty-free license, to anyone, including publishers, software developers, scientists, students, and the general public.
I’m not sure if these work on all platforms, but they are TrueType fonts.
These fonts can be used to augment the symbols available in the OpenOffice.org equation editor (see this tutorial for more).
The plm package for R lets you run a number of common panel data models, including
- The fixed effects (or within) estimator
- The random effects GLS estimator
It also allows for general GLS estimation, as well as GMM estimation, and includes a feature for heteroscedasticity consistent covariance estimation.
It’s very easy to use, it simply requires that you use a special type of dataframe that specifies which variable is the individual and which variable is the cluster/group (this is done using the pdata.frame) command. Once this is done, you can estimate models using the plm command and its options.
See the documentation (PDF) for more.
Scientific Word is a commercial application that allows you to create LaTeX files using a graphical interface. For those that don’t enjoy writing plain LaTeX, it’s an indispensable application. Unfortunately, its table editing capabilities are somewhat cumbersome — you can’t simply cut and paste from Excel like you can if you are creating a document in Word.
However, Scientific Word allows you to import RTF files into your LaTeX document. Theoretically, the process goes like this:
- Create your table in Excel
- Paste it into Word and save your document as an RTF file
- Import the RTF file into your Scientific Word document to access your table
Unfortunately, with newer versions of Word, this process doesn’t seem to work. I suspect that this is because the RTF speficiation that newer versions of Word use has changed since the RTF2LaTeX utility that Scientific Word uses under the hood was written.
However, I have discovered that if you use WordPad, the basic text editor that comes with every version of Windows, to save your RTF file, table import is much more successful (probably because WordPad uses the same RTF specification as RTF2LaTeX).
An additional tip: once a table has been imported, rather than the cell contents, you will see the cell widths that the RTF file was using:

However, if you select the table, right click on it, select Properties > Column Width > Use Automatic Width you will now be able to see the the cell contents (as well as some LaTeX alignment commands):

Quick-R, by Robert Kabacoff, is a wonderful R introduction site. It covers data management, basic and advanced statistics, and graphing in R, and it is aimed at an audience that has previous experience using other packages (such as SAS or Stata) that work differently than R.
Many of the convenient syntax shortcuts for Word 2007’s equation editor are, to my knowledge, completely undocumented. A discussion of most of the commands for different symbols and operators is available here. In addition, I have compiled a list of shortcuts that I have discovered mainly through trial and error.
Update: While the symbol shortcuts aren’t documented per se, you can access them (and add new ones) by starting a new equation (alt+=), selecting Tools under the Equation Tools/Design menu (top left), and clicking on Math AutoCorrect. Murray Sargent’s Math In Office blog has some tips on adding new symbols.
Gerhard Riener has a very comprehensive set of links regarding getting your Stata output into LaTeX. If you’ve ever written an empirical paper, you know that formatting the tables is the most mind-numbing part of the research/writing process. These tools can make that job almost automatic.
German Rodriguez has a very nice Stata tutorial online. It’s one of the few unofficial guides to Stata that I’ve seen that includes more than a few sentences about macros and programming.
OpenOffice.org, StarOffice, its commercial sister product (which is free for researchers), and NeoOffice, the current native OS X version, all come with an equation editor which uses a TeX-like syntax to create formulas. The editor is easy to use and can typeset most formulas (read the Formula command howto for advanced use). There are, however, some symbols that aren’t included in the default editor. In this tutorial, I show how to easily add these symbols, making the equation editor more robust.
Step 1. Get the fonts. MIT offers a freely available set of math fonts for Windows, Mac and Linux. Installation differs depending on your O, but there are instructions on the donwload page. For Windows, there is a simple two-click installer.
Step 2. Open the equation editor. The easiest way to do this is to open a Writer document and select Insert: Object: Formula.
Step 3. Open the symbol catalog. Just go to Tools: Catalog. The Catalog has a dropdown menu for the symbol set to which you want to add your symbol. Most of the Greek characters are already there, so you’ll probably want to select Special. Click edit to launch the edit symbols dialog.
Step 4. Add the symbol.
(a) In the “Edit symbols” dialog, delete the entry under “Old symbol”
(b) In the symbol field, type the command that you want to use to refer to the symbol. For example, say you want to use the binary order symbol . If you type “succ” then you will be able to access the symbol using “%succ” in the equation editor.
(c) Select the font that you want to use under the font dropdown menu. The MIT math fonts are named Math1 through Math5. You will need to browse through the catalog to find your symbol.
(d) Select the symbol that you want to add and click add.
Note: If the modify button is not grayed out, you forgot to remove the old symbol. Clicking add will replace the currently selected old symbol with the new symbol, which might not be what you want.
Now you’re done. You can use the symbol in all of your documents. It may not show up correctly if you edit your document on another computer, though, so you may want to use PDF files to share your document with others.
The symbol editor dialog on Windows
If you have read much of this blog, you know that I’m mildly obsessed with mathematical typesetting. I don’t know why — I think it’s a combination of (a) having worked in desktop publishing, (b) using math everyday, (c) liking LaTeX, but (d) thinking that typing plain markup of any kind is inefficient. So when I found out that I could get a copy of Office 2007 from my school, even though I’d rather that everyone just used OpenOffice or LyX, I had to try it out. Here are my thoughts:
- Word finally respects math. Equations are no longer just pictures. They are parts of the text that can wrap within paragraphs, can have their fonts/sizes changed easily, etc. I always thought it was strange that math was neglected before, since programmers deal with a lot of math.
- There is similarity to TeX. The linear format is essentially an easier-to-read form of TeX that uses Unicode characters for symbols instead of TeX commands like “\beta.” Some things are different — for example, “(x)/(y)” produces
instead of TeX’s “\frac{x}{y}.” It’s obvious that some Microsoft programmers wanted to show TeX some love, since Vista apparently has changes specifically for MikTeX.
- However, almost all of the TeX-like linear input format is undocumented, at least from within the program. I wouldn’t have known about the format if I hadn’t seen this blog post with examples of the linear format. This document more thoroughly documents the linear input method. Unfortunately, casual users, or even not-so-casual ones (absent minded professors, say) may never find out about linear, and will be stuck pointing-and-clicking.
- One thing that really annoys me: In LaTeX, LyX. OpenOffice, etc. you type “hat” then “x” to get
. In Word 2007, you type “x” then “hat.” This actually makes more sense, but I’ve trained my brain to put the hat first. Also, this easy method of putting accents on characters is undocumented to my knowledge.
- You can’t use the new equation editor in PowerPoint. It makes no sense — I hate making PowerPoint (and Impress for that matter) presentations with any math at all because you have to manually align all of the characters. However, you could just make an outline in Word 2007, change the orientation to landscape and increase the font for some quick slides.
- If you save to the Word 2003 (or before) format, the equations become images — they aren’t converted to MathType/MS Equation Editor. This probably isn’t a huge deal, but it could be very annoying. Design Science, makers of MathType and the old Microsoft Equation Editor have instructions for creating a quick-access shortcut to the old editor, which is still included with Office 2007.
- I crashed it while playing with the new equation editor. The document was recovered, but all of the times that I complained about older versions of LyX crashing on me, I always had to work harder to get it to sputter out.
A word on the .docx file format: There’s a little fight going on between Microsoft and the ODF alliance (which includes Sun, IBM, and Adobe, among others) over formatting issues. Everybody wants an open format, but Microsoft won’t use ODF — they want OOXML, the format used in Office 2007. Right now, we’re juggling ODT, DOCX and the old DOC formats. Ultimately, Microsoft’s market power will probably make OOXML the standard, and OpenOffice, AbiWord, KOffice, etc. will support it as best they can. I think that having a couple of open standards will probably be a good thing, and hopefully having open specifications will make it easier for programmers to write good converters, so formatting won’t be much of an issue, and Word, Writer, KWord, AbiWord, etc. documents will be more interoperable.
The new LyX
Earlier this month, LyX version 1.5 was released. This release included a number of significant changes to LyX, both cosmetic and under the hood. The key changes include:
- An outliner
- Tabs for different documents
- A new math panel that automatically turns on when in math mode and has floating menus which you can leave docked
- Unicode support
- Source view
- QT4 under the hood (which looks much better on Windows at least)
You can see the entire release notes file here. With these changes, LyX seems (to me at least) much more stable. And even after using to import and edit a long document created in Scientific Word, it hasn’t crashed on me. I’m finally ready to use LyX as my primary typesetting application (at least until I find some bug that breaks the deal). Since fate ate my iBook, I haven’t been able to test the OS X version, so I can’t speak to its stability.
Why use LyX?
I’ve written about LyX before, but the program is very innovative and deserves a little more publicity. LyX is a document preparation system that uses LaTeX to produce high-quality documents. It is essentially a visual frontend for LaTeX, allowing users to create LaTeX documents without coding in the LaTeX markup language. Some people don’t mind writing LaTeX documents. Others, myself included, find it distracting. With LyX, you get the power of LaTeX coupled with the visual editing power of traditional word processors. You also get documents that can be saved in the LaTeX format — an open, human readable standard — so you don’t have to worry about proprietary binary file formats eating your intellectual property. Oh, and you get it all for free.
LyX is essentially the open-source version of Scientific Word, the premier LaTeX-based scientific word processor. Unlike Scientific Word, however, LyX features:
- Access to many document classes including FoilTex, PowerDot, Beamer, etc.
- Changeable document classes — change from slides to article, for example, to make a quick handout
- Revision tools such as track changes
- Tables that can be pasted from spreadsheet applications like Excel
- Integration with open-source software such as Maxima and Octave
- Easily nestable environments (lists with lists, etc.)
- Math input using LaTeX syntax (\beta) and keyboard shortcuts (M-m g b)
Some quick LyX tips
In math mode, type “\bmatrix” to enter an amsmath bracketed matrix. Then use “M-m w i” to insert a column and “M-m c i” to insert a column (the M stands for “Meta” which is “alt” on Windows). As far as I know, this shortcut is undocumented.
Keyboard shortcut lists can be found here. The shortcuts in LyX are fully customizable, and pre-made customization for specific purposes (for example, compatibility with Scientific Word) come with the download. But as long as you know LaTeX, you can get away with only knowing three shortcuts:
- Ctrl+m inserts inline mathematics
- Ctrl+M inserts displayed math (and Ctrl+Shift+Enter inserts a multiline equation)
- Alt+Space moves the focus to the environment drop down menu, from which you can type the first few letters of any environment, for example Section or Itemize, to change the current paragraph to that environment
Getting LyX
LyX can be downloaded for Linux, Windows and OS X here.
Update: 24 August 07. I have encountered a spell-checking related crasher (Windows) that makes me hesitant to use LyX for important documents. Stability is still much better, though. Hopefully in a few iterations it will get there — I’d really like to have LyX as a cross-platform, open-source graphical LaTeX frontend.
Update: 12 November 07. LyX 1.5.2 is out. So far, no crashers. Looks promising.
I have found a number of relatively complete statistics and econometrics textbooks that are freely available online. The ones that I’m posting here are more relevant for applied work, though books concerning advanced probability topics are also available.
For introduction or a brush-up, try:
- A computational approach to statistics (PDF), by Kloch, for basic statistics
- Econometrics, by Hansen
- General linear models, by Rodriguez
- Multilvel statistical models (PDF), by Goldstein
- Exploratory factor analysis, by Tucker and MacCallum
- Discrete choice methods with simulation, by Train
I recently lost my documentation folder (oops), so I had to go online and retrieve the documentation files and tutorials that I find indispensible for working. I decided I’d save myself and everyone else the trouble by posting the list here. All of the files are available in PDF format.
- All R manuals
- Scilab documentation. Currently the “old” documentation is available in PDF format
- AMS-LaTeX. The amsmath and amscls packages, plus a guide to typesetting math (including the amslatex functionaity)
- ltxprimer. A very modern (uses common packages, tricks, etc.) tutorial on basic-to-intermediate LaTex (by the Indian TeX Users Group)
- Intro to Stata 8, by Svend Juul
- Matlab documentation. Basic guides + toolboxes. Click the topic you want and scroll down to find the PDF files
- gretl manual. This actually comes automatically when you install gretl, but it’s comprehensive and worth including
One of the nicest things about LaTeX is it can be installed anywhere. Write on your home Mac and you can still work with your document on a Linux box at work. But what about those times when you’re working on a computer without LaTeX? No problem. There are several sites offering online LaTeX compilation.
And if the links above don’t work, try a search — new compilers keep cropping up on the web.
This is a sad day for Dataninja. I’m sorry to say that my iBook has had a hard drive failure and I probably will not be able to have it replaced for some time. As a result, I will not be able to continue providing updates and support for the many OS X related resources available on this site, at least until I can afford to spend some money on fixing an already-aging laptop.
I believe that I have most of the files on this site, including the scripts and source files for most of the documents, backed up, so nothing is lost and nothing will go away. I also have a nice Windows laptop that I use for school, so I will continue to post tips regarding relevant applications, most of which are cross-platform anyway (Stata, R, Octave, LaTeX, etc.).
I’ve always been proud that Dataninja was one of the few sites that treats OS X like a first-class research platform, and I hope I’m able to maintain that focus. Unfortunately, it will probably be a while until I can dedicate the appropriate resources. Until then, I hope all of the Mac users out there will keep using OS X for their research, and if you find new resources along the way, do let me know, because I don’t want the Mac part of dataninja to grow stale.
Thanks everyone,
John
LaTeX Tutorials: A Primer, by the Indian TeX Users Group, is an excellent advanced LaTeX book covering, among other topics, many of the flexibilities and shortcuts provided by the AMSLaTeX package. Reading it in conjunction with The Not-So-Short Introduction to LaTeX might just teach you everything you need to know about LaTeX.
The best books and articles set matrices and vectors in bold to distinguish them from scalars. It isn’t uncommon for informal articles, texts and notes to eschew this practice since (1) doing so in LaTeX requires repeated use of \begin{boldmath} … \end{boldmath} tags and (2) by default, LaTeX won’t make Greek letters bold. However, the boldmath package makes it easier to use this helpful convention.
To use it, first invoke it in the preamble:
\usepackage{bm}
Then to use bold math in a document, simply type:
\bm{ _your_math_here_}
Note that this command will make non-variables bold (e.g., brackets, equals signs, etc).
The natbib package for LaTeX allows you to use citations of the form
some text here [Author, 05].
instead of
some text here [1]
It comes preinstalled with most distributions, but it can be downloaded from the above link (documentation is also available from that link).
To use the package, include in your documents preamble
\usepackage{natbib}
Then in the beginning of your document (after \begin{document}, use
\bibliographystyle{plainnat}
(or \bibliographystyle{abbrevnat} if you prefer the abbreviated style).
To insert a citation of the form (Author, 2005), use the command
\citep{reference_key}
and to insert a citation of the form “Author (2005),” use the command
\citet{reference_key}
where reference_key is the label for the desired author in the BibTeX database.
Finally, to create the bibliography at the end of your document, type
\bibliography{name_of_your_bibtex_database.bib}
A note about citations
By default, LaTeX only includes the elements in your BibTeX database that you reference in the main body of text. If you want to include all elements of the database, use the command
\nocite{*}
in the begging of your document.
If you’re not familiar with BibTeX
… read this introduction, which includes links to free graphical BibTeX database management software.
LyX is a very cool program — it allows you to type LaTeX documents using a graphical interface. Unlike WYSIWYG word processors, however, it retains the well-structured format of your LaTeX document. It even creates symbolic representations of LaTeX commands as you type them, so you can see what your final equations will look like (and make sure you’re not making any errors). Best of all, it’s open source.
The problem with LyX is that, except on Linux systems, it is very buggy. It hangs when exporting PDF files and crashes if you make mistakes or type too quickly. This instability deters many (including me) from using it on a day-to-day basis — it’s really not production ready[*].
However, for Mac users, there is some reprieve. If you install LyX through Fink, you can use the Linux version of LyX on your Mac. The only real problem with this method (besides the fact that X11 lacks the normal OS X coolness factor) is that you have to first start the terminal, then start X11, then launch LyX. Using the XDroplets Factory script, however, you can make a quick launcher that can start X11 (and LyX), for you with one click.
Download my version here. Just unzip it and drop it in your applications folder, dragging it to the dock if you use LyX a lot.
LyX on X11 note: For reasons that I don’t fully understand, PDF files made with LyX use bitmap fonts by default. To override this, within LyX, go to Layout > Preamble and add either
\usepackage{lmodern}
or
\usepackage{times}
to your document’s preamble (use the latter if you want Times to be the display font).
[*] Recent versions on Windows and OS X are more stable, but I’ve still managed to crash them using simple editing operations — especially ones involving the \left and \right tags.
LaTeX is great, but sometimes you want to create documents that don’t follow an article style (e.g., notes for lectures, reviews, etc.). Word allows you to do this without having to create custom styles and environments. However, typesetting math in Word, even with MathType installed and a full suite of memorized keyboard shortcuts at your disposal, can be quite time consuming.
One tip[*] is to create two simple macros:
- Symbol style — record a new macro that changes the font to “symbol,” which contains most greek letters (little “l” is lambda, L is uppercase lambda, etc.). Assign a keyboard shortcut to this macro so you can easily switch it on.
- Normal style — record, again with a keyboard shortcut, a macro to set the font back to Times New Roman.
Using these macros, you can easily typeset the paragraph-style mathematics (e.g. for every lambda in capital omega…) without having to insert a new equation object.
[*] I originally read about this one someone’s website, but I can’t remember whose it was, but props for the tip.
It has always driven me crazy that the default character for the second-degree bullet in LaTeX is a dash. I’m afraid that I’ll confuse a bullet point with a negative sign and completely misinterpret a formula. Fortunately, there’s a cure. For example, the command:
\renewcommand{\labelitemii}{$\bullet$}
\renewcommand{\labelitemiii}{$\bullet$}
\renewcommand{\labelitemiv}{$\bullet$}
would change all of the bullet characters to standard bullets. You can also change enumerated lists. For more, read this page (which might be slightly outdated).
LaTeX is great for typing articles, books and similar documents. It’s designed to be. Although it is a flexible system, customizing it for other types of documents, even slides, can be difficult. Fortunately, there are several packages available that take care of much of the required customization. Unfortunately, there is a tradeoff between the usefulness of these packages and the time it takes to learn their commands. The following tools are relatively easy to learn and use, and are likely to meet many of the needs of academics and researchers.
AMS LaTeX packages
LaTeX’s math typesetting capabilities are renowned, but two extensions make it even more effective.
1) amsmath provides several new environments for typesetting equations, including align, which is easier and more flexible than the standard eqnarray. The AMS’s Short Math Guide for LaTeX (PDF) introduces the environments and commands provided by the package.
2) amsthm makes it easy to define styles for document elements like Lemma, Remark, Note, etc. It supports different styles for theorems and allows custom numbering. It’s also very easy to use. The package is fully documented by the American Mathematical Society.
For example, to include a “Note” element, one would simply need to use
\usepackage{amsthm}
in the preamble, then
\theoremstyle*{remark)
\newtheorem{nt}{Note}
somewhere in the document to define the “note” environment. (The asterisk makes the note environment unnumbered. Omit it if numbering is desired.) Then to include a new note in the document body,
\begin{note}[Note title]
This is a note
\end{note}
where the text [Note title] is an optional title that will appear in parentheses. Supported theorem styles are
- plain
- definition
- remark
Both amsmath and amsthm are included in the amslatex package. Follow the link for thorough documentation.
The outlines package
Outlines, slides, problem sets, exams, etc. — anything with a running, nested numbering scheme — can be annoying to typeset using LaTeX, since each level of information requires a new itemize or enumerate environment. Such documents can be created more quickly (and the source can be more easily read) using the outlines package.
The package is fully documented on the CTAN link provided. In short, after using
\usepackage{outlines}
in the preamble,
\begin{outline}[enumerate]
\1 First point
\1 Second point
\2 Subpoint to the second point…
\end{outline}
would create an outline using two enumerate environments. (Alternatively, \begin{outline}[itemize] would create a bulleted outline.) A nice aspect of the outlines package is that it allows you to use the command
\o
to include a normal paragraph, which removes the need to start many outlines that alternate between paragraphs.
Slides
There are several options for creating slides. One is to use the advanced FoilTeX package, which has functions for creating overlaid slides, custom headers and many other effects. Alternatively, the simpler built-in slides document class. This class is easier to use than full-fledged slides utilities, though it is a bit more limited in that it doesn’t support sections or titles.
For example, in the preamble,
\documentclass{slides}
\usepackage{geometry}
\geometry{screen}
sets up the document. Then the only command to learn is
\begin{slide}
Of course, to have headings, one must use tricks such as
\textbf{Slide title}
which is philosophically at odds with LaTeX’s normally structured markup. But it gets the job done and doesn’t require a huge time investment to learn.
Hanging indentation
For typing lecture notes, one may not want to use the normal paragraph layout, since information becomes buried and difficult to find quickly. One solution, which mirrors the way most people take notes from a blackboard, is to use hanging indentation, in which the first line of a note/sentence/etc. is flush left, and all consecutive lines are indented from the left. The hanging package accomplishes this nicely. To use it, invoke it in the preamble (\usepackage{hanging}), then for a series of hanging-indented paragraphs, use
\begin{hangparas}{.25in}{1}
Paragraph to be hanging indentedAnother paragraph to be hanging indented
\end{hangparas}
In the above command, the first argument specifies the size of the hanging indent and the second specifies the line after which the paragraph begins to hang (it’s usually 1).
Notes
- Suggestions are invited for other useful packages, but please try to keep to the “commonly useful and relatively easy to learn and use” rule.
- For help installing packages, read my tutorial.
One great feature of Crimson Editor (my favorite Windows text editor) is that it allows for the configuration of user tools that call external applications such as SAS and Stata[*]. However, configuration can be a pain, especially for casual Windows users.
The following graphic shows my setup for PDFLaTeX:
That is:
- Command: C:\texmf\MiKTeX\bin\pdflatex.exe
- Argument: “$(FileDir)\$(FileName)”
- Initial Dir: $(FileDir)
- Capture on output: checked
This setup works for most other applications. For example, you can browse to Adobe Reader and use the argument “$(FileDir)$(FileTitle).pdf”, ignoring the other fields, to set an Open PDF file command. The key is that file names containing spaces are wrapped in double quotes in the argument field, and no “%1″ is required.
[*] If you use Windows, consider trying Crimson Editor. It comes with Stata, SAS, R, LaTeX and other language definitions by default; it has a macro recording facility; and it has advanced editing features like auto-indent, column selection, search and replace, etc. It’s free, stable and fairly popular.
The reason that I created the Stata language module for BBEdit/TextWrangler was that all the references to a set of tools provided by Ben Hulley, including a language module and several AppleScripts, pointed to a nonexistent .mac page. Courtesy of Steven Samuels, however, I have obtained a copy of the original Stata scripts[*].
The scripts include:
- Do file, which sends your entire .do file to Stata
- Do selection, which sends the selected code to Stata (editorial: very nice)
- Insert ruler, which inserts an 80-character ruler (editorial: perfect for writing pretty code, and adaptable to other languages. Also very nice)
- Multiple line slashes, which adds “///” to the end of your line, inserts a carriage return, and indents the subsequent line
To install, unzip the folder into
~/Library/Application Support/TextWrangler/Scripts
replacing TextWrangler with BBEdit if you’re so inclined.
Note: The scripts were originally written for BBEdit, so if you only have TextWrangler installed, the first time you use each script, a dialog will appear asking you to choose an application. Choose TextWrangler. Doing so will rewrite the script so that it will work with TextWrangler.
[*] I am posting these with no permission whatsoever, but I don’t think anyone will complain. Also, for the record, I did not create these, so if you love them, please don’t credit me. And if you hate them, please don’t blame me.
[Important Update: The Outreg module is no longer maintained. The preferred alternative is Estout, which is more powerful and includes the ability to export to, inter alia, HTML and LaTeX. The documentation is extensive, but when I've worked with the module for a while, I'll do a tutorial. Hat tip to Andy Felton for filling me in.]
[Update to the update: Chris Ruebeck points me to outreg2, an updated version of the original module. It looks like Stata users have numerous ways to streamline their output-to-paper process.]
The Outreg module for Stata makes it easy to save output from statistical procedures in separate tab-delimited text files, which are easy to use with other applications. Outreg is an extremely useful add-in.
Setup
To get Outreg, visit the link above and download outreg.ado and outreg.hlp. Place the files in the “O” folder of your Stata updates directory. If you’re using OS X, that directory is located in
/Applications/Stata/ado/updates/o
If you’re using Windows, the directory is located in
C:\Program Files\Stata\ado\updates\o [*]
Now you’re ready to use the module.
Outputting a table
Outreg is used after an estimation command. For example, after the regression
regress x y
the command
outreg using my_output.txt
would create a new textfile named “my_output.txt” containing a journal-style regression table (variable names and parameter values in columns, variables in rows, standard-errors beneath coefficient estimates).
If you then formulated a second regression
regress x y z
the command
outreg using my_output.txt , append
would add the estimates from the second regression to the table. Alternatively, if you wanted to completely replace the text file, you could use
outreg using my_output.txt, replace
Options
Outreg has many options, all of which are documented in detail in the outreg.hlp file, which can be browsed by typing help outreg in the command line. One useful option is pval, which replaces standard errors with p-values in the tables produced. To invoke this option, use
outreg using my_output.txt , pval
Uses
The tab-delimited text format is useful because it is compatible with so many other applications. For example:
- Tab-delimited text files can be imported directly into Excel, Calc or other spreadsheet applications. From there, they can be used for further calculations, or prepared to be copied into a word processor
- By copying the contents of the text file and copying it into your word processor and using its text to table conversion feature[**], you can quickly create styled tables
- The tab-delimited text can be easily converted to the LaTeX table format using, for example, my Tableizer AppleScript with TextWrangler
[*] These may differ, depending on your exact setup.
[**] In OpenOffice or NeoOffice: Select the text and go to Tools: Text <-> Table and select “Tabs” as the delimiter. In Word: Select the text and go to Table: Convert: Convert Text to Table, and use “separate text at tabs”
Smultron is an open-source text-editor for OS X. It’s a very lean editor, but it’s fast and it was written in Cocoa. Out of the box it has syntax highlighting for Stata and LaTeX, but not R. Basic instructions for adding a rudimentary R syntax coloring file are available here. However, I have adapted the R syntax module for TextWrangler for Smultron.
There are two files: the keyword list and the syntax definitions file, which tells Smultron which languages are available.
In order to install, you have to add these files to the application bundle. To do so:
- Right-click on Smultron and select Show in finder
- Right-click on the Smultron application (in the ~/Applications folder) and select Show package contents
- Copy the above files to the directory ./Contents/Resources within the application bundle
- Restart Smultron
Update: John MacFarlane has written pandoc, which offers a more direct route from Markdown to LaTeX:
pandoc is an implementation of markdown (and much more) in Haskell. It can convert markdown-formatted text to HTML, LaTeX, rich text format, reStructuredText, or an S5 HTML slide show. It can also convert HTML, LaTeX, and reStructuredText to markdown.
~
Markdown is a syntax language and converter by John Gruber that turns formatted plain text into XHTML. Unfortunately, there is not a modified version of Markdown capable of creating LaTeX directly[*]. The current workaround involves translating text to XHTML, then XHTML to LaTeX. But this requires installing Fink to get a newer version of xsltproc. Rather, I recommend the following (for TextWrangler users[**]):
- Download MultiMarkdown and put it in ~/Library/Application Support/TextWrangler/Unix Support/Unix Filters/
- Download this XHTML to LaTeX XSLT stylesheet from the W3C and put it in ~/Library/Application Support/TextWrangler/
- Download this AppleScript and put it in ~/Library/Application Support/TextWrangler/Scripts
Then the process to write basic LaTeX documents is then:
- Read the documentation for Markdown and MultiMarkdown
- Create your document using Markdown syntax
- Run Markdown on your document to convert it to XHTML
- Use the AppleScript to invoke xlstproc and convert your XHTML document to LaTeX
- Run LaTeX on your document
This is completely experimental right now.
MultiMarkdown has support for tables, footnotes, citations and just about everything else.
[*] If anyone’s looking for a programming project, I for one know that I would pay money for a prepackaged Markdown to LaTeX app. Bonus if you package a small LaTeX distribution with your app, so I can go straight from plain text to pdf.
[**] If you have a scriptable editor, you can surely modify the script to work with your weapon of choice.
A commonly asked question with an easy answer: how do I use endnotes rather than footnotes in a LaTeX document?
In your document’s preamble, invoke the endnotes package:
\usepackage{endnotes}
The endnotes package is included automatically in most LaTeX distributions. To insert a new endnote in your document, use the command:
\endnote{Text here}
And at the point in your document where you want the endnotes to appear, type:
\theendnotes
which will insert a list of your endnotes and an automatic heading.
Tableizer is an AppleScript for TextWrangler (and BBEdit) that can convert plain text to tables. It works by taking data in which columns are delimited by tabs and rows are delimited by carriage returns into LaTeX tabular style (with columns delimited by ampersands and rows delimited by “\\”) [*]. For example, some data in the following format:
a b
c d
where there is a tab between consecutive characters on each row, can be highlighted and processed with Tableizer to produce
a & b \\
c & d \\
which is the LaTeX tabular format.
Suggested uses
Conversion from spreadsheets. When you copy a range of cells from within spreadsheet applications such as Excel or Calc and paste them into any plain text environment, the data are pasted as several rows of data with tabs between columns. Tables pasted in this format is ready for conversion to the LaTeX tabular style using Tableizer.
Conversion from Stata. Tables produced by Stata can be copied as tab/carriage-return delimited data by highlighting the entire table, right clicking, and selecting Copy as table (from within the Stata console, not log files). Data in this format are also ready to be converted to LaTeX tabular via Tableizer.
[*] Obviously if your tabular data contains tabs, the script will not work correctly.
Update — Tableizer v2
I have updated the Tableizer script to automatically prepare sensitive characters in LaTeX, such as “%,”, “|,” and “>,” which are frequently found in statistical output (for example, regression output from Stata, R or
SAS). The newer version is available here. You still need to watch out for other special characters, and if you prefer to leave the output unchanged, the original version will remain available (link above).
The default table environment in LaTeX does not work well if the data to be displayed in your table spans more than one page — the text cuts off and may flow outside of the normal page margins. The solution to this is to use the longtable package, which allows tables to take up multiple pages, and enables table row headings that span page breaks.
To use longtable, first invoke the package in the preamble:
\usepackage{longtable}
then use the following syntax to insert a longtable into your tex file:
\begin{longtable}{ccc}
\caption{Title of your table} \\
Header of first column & Header of second column \\
\endhead
Table cell 1, 1 & Table cell 1, 2 \\
Table cell 2, 1 & Table cell 2, 2
\label{label-name}
\end{longtable}
The key difference between longtable and table is that table is a float environment that houses the tabular environment, which actually contains all of the data. In longtable, a second environment is not necessary (nor is the center environment — longtables are already centered). Most of the options and syntax are the same as those for table and tabular.
Note the \endheader command, which tells LaTeX to repeat the row immediately preceding it at the top of every page of the table.
There are a few more advanced options. For a complete reference, see the longtable manual (pdf). For more examples, see this page.
TextWrangler has support for the automatic commenting and uncommenting of blocks of code. However, it appears that this support does not work for codeless language modules, such as those for Stata and R that I have posted about on this site. As a workaround, I have made two scripts (StataComment and StataUncomment) that will insert or remove the characters “//” from each line of selected text. The scripts can easily be edited to use comment characters for other languages (such as “#” for R). New scripts can also be recorded using the recording function in the scripts menu (and the Prefix/Suffix command under the Text menu).
Keyboard shortcuts. These scripts are far more useful if you assign keyboard shortcuts to them. To do so, select Window > Palettes > Scripts, select the appropriate script and click Set Key (I’m using cmd+m and cmd+shift+m).
Update. These scripts have been updated to GenericComment and GenericUncomment. The new scripts support:
- Stata
- R
- LaTeX
and new languages can be added by directly editing the scripts. I experimented with using only one script to handle both commenting and uncommenting, but there may be cases where you want to comment already-commented lines, and vice versa.
The situation: you have two datasets with a common variable, and you want to incorporate both into one large dataset containing all of the variables. This is called merging data, and it’s easy to do in any standard statistical package. In these examples, I assume that there is only one variable between any datasets to be merged that has the same name in each file.
Stata
Before you can merge data in Stata, you must do two things:
- Read each dataset into Stata and sort it by the merging variable (ex: insheet using file.csv , comma; sort id)
- Save the dataset as a dta file (ex: save file.dta)
With the data properly formatted, you can merge two or more datasets by the same variable using the merge command:
use file1.dta
merge using id file2.dta
where id is the common variable between datasets file1.dta and file2.dta. If desired, you can merge arbitrarily many datasets using merge as long as they all contain and are sorted by one common variable.
Tip: After performing a merge in Stata, a new variable named _merge is created. This variable can be tabulated to analyze the results of the merge. For more information, type help merge into the Stata command line.
R
In R, the process is is similar. To begin, read each datafile in as a dataframe (ex: file1<-read.csv(file=”file1.csv”,header=TRUE)). Then type:
new.dataframe<-merge(file1, file2, by=”id”)
One difference between R’s merging function and Stata’s is that, in R, you can only merge two dataframes at a time. So, to merge a third dataset, working from the example above, you would use:
new.dataframe<-merge(new.dataframe, file3, by=”id”)
SAS
The process is essentially the same in SAS: first create two SAS datasets (named file1 and file2 for this example) and sort them by id (ex: proc sort data=file1; by id; run;). Then, to create a third, merged dataset:
data newdataset;
merge file1 file2;
by id;
run;
Like Stata, SAS lets you merge arbitrarily many files at once, as long as they are property sorted and imported into SAS.
For more
The examples above show how to merge two (or more) datasets in such a way that each value of the variable id, no matter whether that value is missing on any of the input datasets, will be included in the final combined file. In some cases, this may not be optimal. You may want to only use values of id that are present in one particular dataset (for example, trying to match up national population data with another dataset specific to the northeast, you would want to exclude any states not in the northeast). Stata, R and SAS all have the capability to do this kind of selective merging, but describing all of the options for each package would take up too much space for this short tutorial.
To lean about advanced features, try the following:
After much delay, I have finally created a LaTeX tutorial. Although there are many introductions to LaTeX, they all suffer from one of two deficiencies: they are either too long or too short. Too-short introductions don’t give potential users enough information, and too-long introductions scare them off. Mine is designed to be an accessible and fast read, but contain enough information to get people making polished LaTeX documents very quickly.
This tutorial is 20 pages long and can easily be read over the course of 10-20 minutes. It covers installation and setup on Windows and Mac OS X, a topic which is often overlooked by beginner’s guides, but can scare off new users since installation is different than standard applications on those systems. It gives a brief introduction to the history and purpose of LaTeX, then provides enough information about
- Sectioning
- Environments
- Text styles
- Packages
- Tables and figures
- Footnotes, cross references and basic bibilographies
- Mathematical notation
that, after reading it, most should be ready to use LaTeX for production, using search engines to look for specific advice not covered in the tutorial.
Download it here (PDF).
Updated September 2008
LyX is a what you see is what you mean word processor that acts as a LaTeX frontend. Essentially, you first select your LaTeX environment from a menu (standard, section, subsection, enumerate, quote, etc.), then type. You see an approximation of what your printed or PDF’d document will look like on-screen, but it’s ultimately processed with your LaTeX engine. It also has nice feature whereby equations are input using the normal LaTeX syntax, but rendered in real time.
Some other features of LyX include:
- Menu-based tweaking of document settings, such as bullet style, margin size, columns, etc. These options, which aren’t changed a lot, are the hardest parts of the LaTeX language to remember.
- Built in classes. LyX knows the environments available for popular document classes, such as FoilTeX. So as long as you have FoilTeX installed, you can use it without having to learn its syntax and options in addition to the normal LaTeX language.
- Export to PDF, LaTeX, HTML, DVI and PS, depending on the options installed on your system.
- Ability to insert raw LaTeX code — for example, if you have a table created using R’s xtable package, you can just insert the entire thing as raw code.
LyX is completely open source. It’s been around for a while, but it was fairly buggy. With the latest release, it seems 100% usable and stable.
In order to use LyX, you need, at minimum, a LaTeX installation (Mac, Win), and for full functionality, there are some other requirements, like Ghostscript and other utilities that I believe are included in MacTeX and ProTeXt. LyX is available for Windows and Mac. It’s fairly easy to use right off the bat, but I suggest PDFing the User’s Guide and giving it a read to get the most out of LyX.
Update: I added a list of some common keyboard shortcuts for LyX on OS X in the Reference cards section.
In SAS, it's very easy to automatically recode or generate variables using 2 or more arrays of the same length. For example, if you have 5 quantities (good1 good2 … good5) and 5 prices (price 1 … price5) you could figure out the total cost of purchasing goods 1 through 5:
array goods good1-good5;
array prices price1-price5;
array cost cost1-cost5;
do over goods;
cost = prices * goods;
end;
In the array above, SAS knows to multiply prices[i] by goods[i] for each i in 1 to the length of the goods array. This is a very convenient feature.
I'm a newcomer to Stata, but I've been having a difficult time implementing this same functionality. Although Stata's foreach functionality is very convenient, it doesn't seem to provide an explicit indexing capability. I have no idea how to directly reference the ith element of a varlist and interface that element with the ith element of another varlist, as I did in the SAS code above. I have however come up with a proxy. To implement the same goods*prices example above, try:
local goods good1 good2 good3 good4 good5
local prices price1 price2 price3 price4 price5
local cost cost1 cost2 cost3 cost4 cost5
for each i of numlist 1/5 {
local g: word `i' of `goods'
local p: word `i' of `prices'
local c: word `i' of `cost'
gen `c' = `g' * `p'
}
Of course, with the variable names that I've made up, you could simply write
for each i in numlist 1/5 {
gen cost`i' = good`i' * price`i'
}
because all of the variables end in the number. But say, for example, that instead of (good1 … good5) the goods were named (bread milk eggs butter jam). In that case, you couldn't simply put the number suffix on the end of the variable, so the word `i' of `goods' trick would come in handy.
There is a more complete discussion of this and other Stata programming topics in "A little bit of Stata programming goes a long way," (PDF) by Christopher Baum.
* By all means, if there is a simple way to do so, please contact me at dataninja at gmail.
Lancaster University’s Centre For Applied Statistics makes notes from its short courses on statistics and software available for download. The notes are available as PDF files and are accompanied by sample datasets.
This AppleScript allows you to process a Stata .do file from within TextWrangler.
To install, copy the .scpt file to
~/Library/Application Support/TextWrangler/Scripts/
Recommended: Assign a keyboard shortcut to the script (or any script) by (1) selecting Window > Palettes > Scripts (2) selecting the SendToStata Script and (3) choosing “Set Key.”
Scenario: You get a dataset that has currency values listed in the form ($1,045.65), instead of the preferred (1045.65). If you’re using Stata, making the transformation is easy. Say your variable is named price. Then
destring price , ignore(”,, $”) replace
will make the conversion.
The code above simply takes advantage of the powerful destring statement. One of its options is ignore(”chars”), which tells it to ignore certain characters when converting from string to numeric. An example of a destring statement using the ignore() option is
destring x , ignore (”%, #”)
This would strip all of the %’s and #’s from the variable x and turn it into a numeric variable.
However, in the price example above, one of the characters we want to ignore is the comma — but the comma is also used to delimit the list of characters that the ignore option takes as arguments. So we use the somewhat awkward statement
…ignore(”,, $”)
to eliminate both commas and dollar signs.
I have put together a Stata language module for TextWrangler. It has syntax coloring support for many Stata keywords (functions, options, etc) as well as block and inline comments. To install,
- Download the module here
- Copy the file to “username/Library/Application Support/TextWrangler/Language Modules”
Using BBEdit
This language module also works with BBEdit (as of version 8.5). However, to use the module, you must create the directory
~/Library/Application Support/BBEdit/Language Modules
since it is not created by default after installation. Then simply copy the Stata.plist file into that directory.
Notes
This module would not have been possible if not for the Crimson Editor syntax files, available here, by Cai Yong. Almost the entire keyword string list is based on his work.
I was unable to incorporate both “*” and “//” for line comments. Since I use the asterisk style more often, I encoded it as the line comment character instead of double slash. This can easily be changed. If you know (or figure out) how to use both styles simultaneously, please contact me via dataninja at gmail.
More generally, I’m sure that this could be done more elegantly. As always, this file is open source. Modify at will, but if you make the module a lot better, please set me know, so I can use your version…
Update
Thanks to Ronán Conroy for adding keywords for the SSC packages and for fixing the single-quote open string key.
Update
The language module has been updated to work with BBEdit version 8.5. Either download the updated module (above) or replace the following line in the Stata.plist file manually:
<key>BBLMLanguageCode</key>
<string></string>
with
<key>BBLMLanguageCode</key>
<string>STTA</string>
Leaving this line blank was a mistake on my part. Hat tip to Steven Samuels for first discovering, then fixing, the error.
Via this page, I just learned that there is a TextWrangler Language Module for R. The file itself is available here.
To install, copy the .plist file to the following directory
Username/Library/Application Support/TextWrangler/Language Modules
I also recommend going into TextWrangler’s preferences and changing the color for comments. The default is a gray color that doesn’t provide a lot of contrast compared to normal black screen text.
Also, check out the via page from SciViews.com. It has links to R syntax highlighting files for many editors on many platforms.
The easiest way (that I’ve found) to quickly accomplish repetitive tasks in Stata is to use the foreach statement. Foreach is essentially a loop command. It allows you to define a list of variables, then have Stata perform an action (or actions) for every variable in the list.
The general syntax for foreach is:
foreach x of varlist var1 var2 var3 {
thing to do
}
Alternatively, you could say “foreach x in var1 var2 var3 {…”
Examples:
1. Recode several variables at once
foreach x of varlist var1-var3 {
recode `x’ (-9=.)
}
This example:
- Acts on all of the variables between and including var1 and var3 in the dataset (not alphanumerically, but in the order that they were added, so if the dataset goes var1 var2 age var3, then all four variables get recoded)
- Recodes all instances of “-9″ to missing
2. Quickly run several regressions
foreach `x’ of varlist var1-var3 {
regress `x’ age height income education
}
This example runs one regression for with each of the variables between (and including) var1 and var3 as a dependent variable, regressed on age, height income and education.
As the examples above illustrate, foreach in Stata is like a combination of arrays and macros in SAS, because it allows you to create automate both modifications to the dataset and analytical output. Using the foreach statement can save a lot of time and effort when writing even quick Stata programs.
The call to perform factor analysis on a set of variables in R is:
fact1<- factanal(x,factors,scores=c(”regression”),rotation=”varimax”)
where “x” is a dataframe containing the appropriate variables, and “factors” is the number of factors to be extracted.
socres=”…” and rotation=”…” are optional, and varimax is the default rotation.
The factanal function doesn’t seem to handle missing observations well, so it’s easier to create a new dataframe based on your original, with the missing values omitted:
x1<-na.omit(x)
To view the summary of the factor analysis, simply type the name of the object under which the analysis was saved.
fact1
To view the scores, use
fact1$scores
To view a scree plot, install the “psy” package load it
install.packages(”psy”)
library(psy)
Then type
scree.plot(fact1$correlations)
The above commands and options are only the very basics. For more information, view the documentation for the factanal function. For more information about factor analysis in general, here is a nice nontechnical introduction.
NeoOffice/OpenOffice Calc and Microsoft Excel have the ability to run basic linear regressions. The usefulness of this estimation function is limited in that:
- Missing values must be removed before any estimation can be done
- All of the variables must be located next to each other in the spreadsheet, with the dependent variable on the left
However, the feature may be helpful. Almost every computer has some spreadsheet application (usually Excel), data are often stored in .xls files, so quickly running a regression in Excel or Calc may save time, etc.
Issuing the command
The syntax for running an OLS regression is
=linest(dependent variable range; independent variable range; has intercept?; include stats?)
where
- "dependent variable range" is the array of cells containing the dependent variable
- "independent variable range" is the array of cells containing the independent variables (if there are more than one, list the first cell of the first variable through the last cell of the last variable)
- "has slope?" is "1" if an intercept term should be included, "0" otherwise
- "include stats?" is "1" is regression statistics should be included, "0" otherwise. Selecting "0" leads the procedure to include only the slope coefficients
For example, in a spreadsheet with 10 records and 4 variables (y, x1, x2, and x3) placed in cells a-d, respectively, where the data start on line two, the regression
y=a+b1×1+b2×2+b3×3
could be estimated using the command:
=linest(a2:a11;b2:d11;1;1)
Note 1: The above syntax is for NeoOffice/OpenOffice Calc. To use in Excel, replace ";" with ",".
Note 2: The above formula is an array formula, meaning that it creates output that takes up more than one cell. After the command is typed, hit command+shift+enter (Mac) or control+shift+enter (Win). If you are using Excel, you must first select a 5x(n+1) matrix of cells, where n is the number of explanatory variables (5 rows, 4 columns, in the above example), then type the formula, then hit cmd+shift+enter. OpenOffice is a little smarter, so you can just type the formula in one cell.
Understanding the output
Using the above regression example, Calc/Excel will output the following array:
| b3 | b2 | b1 | a |
| se_b3 | se_b2 | se_b1 | se_a |
| r^2 | se_y | ||
| f | df | ||
| sum_i(y_i-bar.y)^2 | sum_i(hat.y_i-y_y)^2 |
Where
- se_b1 is the standard error of the estimated cofficient on x1
- hat.y is the predicted value of y using the regression
- bar.y is the average of y across all observations
So the output, from left to right, goes in the reverse order of the data. And from top to bottom, starts with coefficient estimates, then goes to standard errors, with other statistics below. It’s a bit complicated, but it’s worth knowing how to use and interpret.
Use this template when creating do files for Stata programming to ensure that you remember to properly set your directory, log your output, turn of more and error modes, etc.
set more off
clear
capture log close
cd /path/to/file
log using FILENAME.log , replace
* * * * *
// program goes here
* * * * *
capture log close
Note: Most text editors have some ability to save and open templates. To use this as stationery in TextWrangler, copy and paste the above into a blank text file, select Save, check the “Save as stationery” box, and save the file in Home/Library/Application Support/TextWrangler/Stationery.
Turning plain-text output into well-formatted tables can be a repetitive task, especially when many tests or models are being incorporated into a paper.
For R users, there are several methods that can make this task easier (though not much less repetitive), regardless of what typesetting system you use.
LaTeX tables
The xtable package produces LaTeX-formatted tables. Using xtable, specific kinds of R objects, such as linear model summaries, can be turned into "xtables", which can in turn be output to either LaTeX or HTML.
To use:
- Install the xtable package: install.packages("xtable")
- Load the xtable package: library(xtable)
- Create an xtable for the object that you want to export: newobject<-xtable(object)
- Export your xtabled object to LaTeX: print.xtable(newobject, type="latex", file="filename.tex")
Options
- To export to HTML instead of LaTeX, use type="html" and use the .html extension in the filename
- To have R produce the proper markup in the console (instead of writing in to a file) omit the file=filename option
Other word processors
If, instead of LaTeX, you are using Word, OpenOffice, NeoOffice, etc., there is a way to produce more easily formatted tables. Just follow these steps:
- Create an xtable, as above, and print the output to an HTML file: print.xtable(newobject, type="html", file="filename.html")
- Open the generated HTML file in your browser (may I recommend Firefox)
- Copy the table contents and paste them into your word processor
- Convert the text to a table: In OpenOffice or NeoOffice: Select the text and go to Tools: Text <-> Table and select "Tabs" as the delimiter. In Word: Select the text and go to Table: Convert: Convert Text to Table, and use separate text at tabs
This technique is somewhat more convoluted than creating pure LaTeX output, but it is probably quicker than entering the output by hand.
Note: An alternative to xtable is the R2HTML package, which works similarly, but does not require xtable objects to be created in order to generate HTML output.
From the Stata website,
Stata is a complete, integrated statistical package that provides everything you need for data analysis, data management, and graphics. Stata 9 adds many new features such as linear mixed models, balanced repeated replications, and multinomial probit ….
These resources will help you learn Stata and use it to its fullest capabilities:
- Introduction to Stata 8 (PDF) by Svend Juul
- UCLA’s Stata resources
For more advanced use, you may find the following useful:
- Stata Journal — Tools for creating LaTeX files in the style of the Stata Journal
- SendToStata — My AppleScript for sending .do files to Stata from within TextWrangler (Copy the file to Home/Library/Application Support/TextWrangler/Scripts)
Loops in R are similar to arrays in SAS — they allow you to work with multiple variables at the same time (among other things). For example, you can use a list to recode several variables in one step. For example,
yesnovars<-data.frame(x1,x2,x3)
for(i in 1:length(yesnovars)) {
yesnovars[[i]]=recode(yesnovars[[i]],”c(’Y')=1;c(’N')=0″)
}
would
1) Make a dataframe called “yesnovars” containing the three variables x1, x2 and x3
2) Recode all three variables to be 1 if the original value was “Y” or 0 if the original value was “N”
Note: The car package is required to use the recode() function.
Basic loop theory
- The basic structure for loop commands is: for(i in 1:n){stuff to do}, where n is the number of times the loop will execute
- listname[[1]] refers to the first element in the list “listname.”
- In a for loop, listname[[i]] refers to the variable corresponding to the ith iteration of the for loop.
- The code “for(i in 1:length(yesnovars))” tells the loop to execute only once for each variable in the list.
Update: Mass recoding
For me, one of the most important uses of loops/arrays/etc. is recoding a lot of variables at once. R wasn’t really designed for extensive data management, and I’ve read that many use another application, like SAS or even Fortran to manipulate data, then use R for analysis. Still, it’s possible to use R for these tasks, assuming that the dataset is reasonably sized. For example, to recode the variables x and y in a dataframe named a that contains several variables, use the code:
for(i in 1:length(a[, c("x", "z")])){a[,c("x", "z")][[i]]<-a[ ,c("x", "z")][[i]]+10}
This is a little long, but it’s easy to understand if you look at it by parts:
- a[,c("x","y,")] means the x and y variables in the dataframe a. The syntax a[r,c] selects rows r and columns c from a matrix, so a[,c("x","y")] takes all of the rows (observations) and columns x and y
- After that, the for loop works the same as the ones described above, once you realize that a[,c("x","y")][[i]] stands for the ith variable in a[,c("x","y")]
- The advantage to doing it this way is that you don’t have to (a) make a new dataframe then (b) merge the new variables back — everything is done in one step.
Stata is a very popular commercial statistical and econometric package. Like SPSS and S-Plus, learning Stata is made easier by the fact that the program offers a dual command-line and menu-driven interface, so users can initally use menu commands to learn the programming syntax.
I have assembled a quick reference card for Stata that covers:
- Basic commands
- Importing data
- Recoding and manipulating data (basics)
- Basic statistics and tables
- Basic regression analysis and diagnostics
Testing and correcting for heteroscedasticity is probably more convenient in R that it is in SPSS or SAS, probably because those packages weren't designed especially for econometric analysis, and R users have the CAR package at their disposal.
Testing
The easiest way to test for heteroscedasticity is to use the non-constant variance test function that is included in the car (Companion for Applied Regression) package. To do so, use:
library(car)
ncv.test(ModelName)
This test is essentially a version of the Breusch-Pagan test. Alternatively, you could manually implement a White's test.
Correction
The car package includes the "hccm(ModelName)" command, which returns the heteroscedasticity consistent covariance matrix. The squares of the White standard errors are on the diagonal.
Alternatively, Ott Toomet has written an R function that returns regression results using White's standard errors.
Here is the actual function:
# STEP 1: define the function
summaryw <- function(model) {
s <- summary(model)
X <- model.matrix(model)
u2 <- residuals(model)^2
XDX <- 0
## here one needs essentially to calculate X'DX. But due to the fact that D
## is huge (NxN), it is better to do it with a cycle.
for( i in 1:nrow( X)) {
XDX <- XDX + u2[i]*X[i,]%*%t( X[i,])
}
XX1 <- solve( t( X)%*%X)
varcovar <- XX1 %*% XDX %*% XX1
stdh <- sqrt( diag( varcovar))
t <- model$coefficients/stdh
p <- 2*pnorm( -abs( t))
results <- cbind(model$coefficients, stdh, t, p)
dimnames(results) <- dimnames( s$coefficients)
results
}
# STEP 2: get the results
summaryw(YourModel)
So, to use this function, (1) paste and submit the function code, which will define a new function, summaryw(), and (2) run summaryw() on your model. This is probably easier if you are running R in batch mode.
Adding summaryw() to your environment
If you use the summaryw() function frequently, you may not want to manually enter it each time you want to call it. The function can be added to your default workspace by
1) Opening R normally
2) Copying and pasting the function into the console and submitting it
3) Typing save.image()
Now each time you start R, the function summaryw() will be an object in your workspace, and you can call it for any model that you run.
Note: The function will be called for any new workspace that you save and reopen. However, if you open workspaces (.RData files) that you saved before you added the function to your default workspace, you will still need to add the function to your workspace.
Other applications
For the purpose of completeness, this document shows how to correct for heteroscedasticity in SAS, SPSS, Stata and Limdep.
SPSS is a popular (commercial) statistics package that offers a convenient graphical interface. Although it is a very common application, it does not offer the same range of statistical tests and functions as competing applications such as SAS, Stata and S-Plus. On the other hand, it is generally less expensive and easier to use (for casual users at least) than those applications.
Linear regression analysis is fairly straightforward in SPSS. Through dialog boxes, dependent and independent variables can be added to a model, residuals and predicted values can be added to the dataset, plots and charts can be exported, and summary statistics such as goodness-of-fit information and the Durbin-Watson statistic can be viewed.
These options cover most of the basics of regression analysis. However, one feature that is missing is heteroscedasticity diagnostics and correction tools (besides WLS).
The missing tools
This tutorial (zipped PDF) by Gwilym Price details a number of heteroscedasticity detection and correction techniques for SPSS.
Andrew Hayes has written canned macros for both SPSS and SAS for creating White Standard Errors. This PDF contains a summary of the techninque, instructions on using the macros and the macros themselves.
In addition, Raynald Levesque maintains a website with many additional SPSS tutorials.
Calc2LaTeX is a macro for OpenOffice Calc. Capable of converting between Calc spreadsheet cells and equivalent LaTeX markup, it is essentially a version of Excel2LaTeX for those of us that prefer open-source software. Calc2LaTeX works on Windows, Linux and Mac OS X.
Calc2LaTeX running on NeoOffice (OS X)
OS X users: The download page offers a versions for Linux and Windows. OS X users should choose the Linux version. The macro works for both NeoOffice, the OS X native version of OpenOffice, and the official X11 version of OpenOffice for Mac.
Installation instructions are available at the above download link.
To use the macro, either
1) Select “Tools: Macro: Macros: Calc2Latex: Main” or
2) Customize Calc’s menus to add a toolbar button that links to the macro (using the path above)
For more information on using and customizing OpenOffice/NeoOffice, including application and macros documentation, see the OpenOffice Documentation Project’s How-Tos page.
This add-in for Excel enables you to quickly create tables in Excel and convert them to the LaTeX tabular format.
To install, just unzip the file and drop the Excel2LaTeX.xla file in your add ins folder (under Program Files or Applications, depending on your OS). Then from within Excel, select Tools > Add-Ins and check the Excel2LaTeX box.
To use, select the range of cells that you want to convert to a LaTeX table and either select Format > Convert Table to LaTeX or click the icon on the Xcel2LaTeX toolbar that is created upon installation. Copy the text to the clipboard and paste it into your LaTeX frontend.
Note: For whatever reason, when using Excel on Mac OS X, the end line “\end{tabular}” contains characters that confuse LaTeX. The best solution to this problem is to delete that line and re-type it manually.
Update [15 Mar 08]: This macro is no longer available from the author’s website, but it can still be downloaded from Softpedia. I have updated the link.
SubEthaEdit is a “powerful and lean” text editor for Mac OS X — enough features to be suitable for programming and coding tasks, but not so many features that it becomes complicated. SubEthaEdit is also known for its collaboration features.
One nice feature of the editor is that an R/S-Plus Editing Mode is available, which includes syntax highlighting — one feature that has not been implemented for TextWrangler.
SubEthaEdit is free for non-commercial use.
About R Commander
R Commander is a graphical user interface for R. The interface includes functions for
- importing and managing data
- conducting simple statistical tests
- linear and logisitic models and diagnostics
- graphs
- probability distributions
as well as other functions.
R Commander on OS X under X11
Availability
R Commander works automatically with Windows and some Linux distributions.
R Commander works on Mac OS X, with two requirements:
1) Apple’s X11 is installed
2) The Tcl/Tk package (optional on the R dmg) is installed
Installing and running
Under Windows and (I think) Linux, simply type
install.packages(”Rcmdr”)
and select the appropriate mirror. R may say that additional packages need to be installed; allow them.
To run R Commander, type
library(Rcmdr)
and the GUI will appear.
Under Mac OS X, you must select Misc > Run X11 Server from the R GUI, then follow the above steps.
R packages are add-ons that increase the range of utility, analytical and other functions that R can handle.
Installation and use
The easiest way to install packages is to use the command
install.packages(”PackageName”)
This presupposes that you are connected to the internet, and requires that you select a CRAN mirror. It is also possible to download packages to your computer and install them locally — consult other R documentation for instructions.
To use a package that you have installed, tell R
library(PackageName)
Note: Double quotes around the package name are required when installing the package, but not when loading it for use in R.
Finding packages
The best place to find information about R packages is on CRAN. Although it is not necessary to visit CRAN in order to actually download the package, the website has a list of packages, usually containing a short description and a PDF file with package documentation.
Some useful packages:
- car (Companion to Applied Regression) — One of the nicest features of this package is the recode() function, which makes it much easier to recode data.
- lmtest (Testing Linear Regression Models) — This packages contains many tests and functions for regression diagnostics.
- foreign — This packages reads data saved in SAS, SPSS, Stata, Minitab and other formats into R.
- xtable — Makes LaTeX tables of linear models (and output from other statistical functions)
Several utilities exist for converting LaTeX documents to other format, including RTF (rich text format, similar to Word’s DOC format) and HTML. There are many reasons why you might want to convert a LaTeX document to another format — collaboration with non LaTeX users, decreasing download time when sharing online documents, etc.
Converting to RTF
The standard utility for converting LaTeX documents to RTF is latex2rtf. This utility does an outstanding job of making RTF files from TeX files, including indexes, references, headings, lists, tables and even equations.
- Windows users can install latex2rtf by following the the link above. The Windows version consists of the application and a graphical frontend that eliminates the need to use the command line.
- Mac OS X users can install latex2rtf by using i-Installer, the same application that is used to install gwTeX (it is bundled with the MacTeX collection as well). To do so:
- Select “Known Packages i-Directory” from the i-Package menu.
- Select “LaTeX to RTF” from the i-Packages list on the right, then click on “Open i-Package”.
- In the i-Package window that appears, select “Install”.
Note: The Mac version does not include a graphical frontend. To use latex2rtf, open the terminal (Applications/Utilities/Terminal.app) and type “cd directory_of_tex_file; latex2rtf filename.tex”, or use one of the conversion AppleScripts described below.
Converting to HTML
There are more options for LaTeX to HTML conversion. HeVeA is probably the most straightforward cross-platform tool. Windows and Mac versions can be found on the page linked above. HeVeA attempts to use HTML to emulate mathematical notation, so its presentation is not as perfect as LaTeX’s, but it still creates high-quality translations. Installation (for either operating system) is a bit tricky, so I have created a separate tutorial on installing HeVeA.
Once HeVeA is installed, it is run from the command line. For example, the command
hevea testfile.tex
would produce an testfile.html — a webpage version of the original LaTeX document.
Note: Windows users can download TeX Converter — a graphical frontend to HeVeA and several other TeX conversion utilities (not discussed here).
AppleScripts
Since neither HeVeA or latex2rtf come with graphical frontends for the Mac, I have written two AppleScripts that allow users to convert LaTeX files to HTML or RTF without having to use the terminal.
- Convert_TS is a TeXShop macro. To install:
- From TeXShop, select “Open Macro Editor” from the “Macros” menu.
- In the Macro Editor, click the “New Item Button” and copy the contents of this text file into the provided box. Name the macro and click “Save.”
- Now selecting the Convert_TS macro from the Macros menu will supply a dialog that allows point-and-click conversion.
- Convert_TW is a TextWrangler AppleScript. To install,
- Download this script file.
- Copy the file to “Home Directory/Library/Application Support/TextWrangler/Scripts”.
- Now selecting “Convert_TW” from the TextWrangler’s Script menu will supply a dialog that allows point-and-click conversion.
The Pew Research Center makes some of its data available online after a lag of about six months from survey completion.
Most of the data are saved in SPSS’s .sav format. Topics include public opinion on political and social issues.
The SendToR for TextWrangler AppleScript simply sends R routines to R from within TextWrangler. Aspects of the R program such as setting the working directory or creating a sink file are left to the user. If you want a Stata-style log, I suggest using the command:
sink(”filename.txt”,split=TRUE)
which will save the output to filename.txt but also display it in the R console.
An older version of SendToR for TextWrangler, in which the sink file is automatically created and opened in TextWrangler after the routine is processed is still available (I changed it because it never really worked correctly).
TeXShell is a simple front-end for Unix-style LaTeX engines on OS X. It essentially runs the LaTeX variant of your choice (LaTeX, PDFLaTeX, etc.) in the background without requiring any command-line work, but routs the output to a native window, so you can check for errors, etc. It is a free download from the maker of CMacTeX, a shareware LaTeX distribution for OS X.
This script will send a LaTeX file to TeXShell from within TextWrangler.
One nice feature of SAS is the ability to run programs in batch mode (without launching the SAS System). R has this facility to an extent via the source() command, but still requires that the application be running.
I have written a Batch R AppleScript that allows you to run an R routine without having to open R and type the source commands. Just drag and drop any R source file onto the DropR icon and (1) your routine will be executed and (2) a text file with the output from your routine will be written in the same directory.
Download DropR here.
Sorry, this only works on Mac OS X.
I took part of the code for this from the R for Mac OS X FAQ.
Some survey data is coded as open-ended responses. Rather than asking respondents to select from a range of preselected answers (very likely, somewhat likely, …), respondents are asked to provide unpropmted answers, which are recorded verbatim.
The task usually falls on an analyst, student or research assistant to later assign broader categories to these open-ended responses. This spreadsheet-based technique can make the job a bit easier:
=IF(ISNUMBER(SEARCH(”keyword”;A1))=1;1;0)
This code returns a 1 if “keyword” is found in the cell A1 and a 0 if it is not. It can be applied for a number of keywords to determine, for example, whether a survey comment pertains to a particular subject.
The individual functions are:
- SEARCH() — which returns the number of the character where the string “keyword” begins in cell A1 (it returns an error if the string isn’t there). The search function is not case sensitive
- ISNUMBER() — which returns a 1 if its argument is numeric and a 0 othersize (this eradicates the error message from part 1)
This code can be changed and extended at will to mine your data. Obviously this is no substitute for reading textual data, but it could save a lot of time if your dataset is large.
Note. The code above is optimized for OpenOffice. To use it in Excel, use commas instead of semicolons.
I just tried out two new (to this site, that is) LaTeX frontends that both work very well.
Texmaker is a cross-platform frontend (OS X, Windows and Linux) that integrates seamlessly with both gwTeX and MikTeX. It offers document setup wizards, a simple table editor, document structure navigation, BibTeX management tools and many graphical shortcuts.
LaTeX Editor is a Windows-only application. It has fewer graphical shortcuts and wizards, but it offers extensive code completion which speeds up typing considerably. It also has a built-in previewer.
Survey data come in all shapes and sizes — some less conditioned for data analysis than others. One common data entry problem is multiple-valued records, in which more than one value is supplied for a variable. For example:
| Var1 | Var2 |
| 1, | 5, |
| 2,3 | 4, |
| 2, | 1,3 |
From a data processing standpoint, the above data seems useless. A better way to enter the data would be to have two columns for each variable: one for the first answer and one for the second.
This macro imports an Excel spreadsheet into SAS and lets the user define a number of subcolumns for a variable that contains multi-valued records, either numeric or string. It adds new variables for the subcolumns and, if the variable is numeric, a final column containing the average of all of the values reported for the variable. It exports the original data, plus the new variables, as a new Excel file.
- No, Excel isn’t the greatest way to store data. But it is one of the most common. This can easily be adapted to work with any plain text data format.
- Run the macro once for each variable that contains multi-valued records.
- Note: Every record (in the relevant variable) must be delimited with a comma, even if it isn’t multi-valued. See the example table above.
Download the macro here.
I have put together a LaTeX quick reference card (PDF) to cover the basics — document structure, tables, figures, bilbiographies, text formatting, lists, quotes and some math. Useful if you don’t use LaTeX that often.
What is BibTeX?
BibTeX is a file format and a program designed to work with LaTeX. The file format stores bibliographical information like author name, journal title, date,etc. The program incorporates files stored in a BibTeX file (.bib) into LaTeX documents.
BibTeX databases
A BibTeX database is essentially a plain text file containing bibliography entries. A BibTeX file might look like this:
@article{Gettys90,
author = {Jim Gettys and Phil Karlton and Scott McGregor},
title = {The {X} Window System, Version 11},
journal = {Software Practice and Experience},
volume = {20},
number = {S2},
year = {1990},
abstract = {A technical overview of the X11 functionality. This is an update of the X10 TOG paper by Scheifler \& Gettys.}
}
What this means:
- In the first line, @article lets BibTeX know that the bibliographical entry is an article. Other entry types include book, phdthesis and unpublished (and others).
- Gettys90 is the identifier for that entry – it is a type of shorthand that will allow the users to refer to the entry quickly when writing a LaTeX document.
- Notice that multiple authors are seperated by the word and.
BibTeX database management software
There are applications that provide a graphical interface that allow users to manage their BibTeX databases. These applications are very useful, because the number of different fields (article/book/manual/etc. or title/journal/issue/note/etc.) is quite large.
- For Mac OS X users, there is BibDesk – bibdesk.sourceforge.net.
- For Windows users, there is BibEdit – http://www.iui.se/staff/jonasb/bibedit/.
These applications will allow you to input and edit bibliographical entries and save them as BibTeX databases that can be used in LaTeX documents.
Using BibTeX in a LaTeX document
In order to use your BibTeX database in a LaTeX document, you must essentially do three things:
1) Set the bibliography style. The standard is plain:
\bibliographystyle{plain}
put this command after the \begin{document} command in your LaTeX document. Other styles include
- unsrt – The same as plain except entries are numbered based on when they are cited, not alphabetically by author.
- alpha – Similar to plain except instead of having numerical identifiers (e.g. [1]), labels are created based on the year of publication and the name of the author(s).
- abbrv – Names and journal titles are abbreviated.
2) Make citations. When you come to a passage in your text that you’d like to cite, insert the LaTeX command
\cite{ident}
where ident is the identifier you chose when either typing the database file by hand or using a graphical manager.
3) Tell LaTeX to make the bibilography. This happens at the end of the LaTeX document. Just type
\bibliography{bibfile}
where bibfile is the the file bibfile.bib – your BibTeX database.
Running BibTeX
The hardest part is running BibTeX on your LaTeX file. This assumes that you have BibTeX installed already (most distributions come with it by default). The steps are:
- Run LaTeX on your .tex file – this will generate the .aux file that BibTeX needs to find the citations.
- Run BibTeX on your .tex file – this can usually be done from your LaTeX frontend. If not, use the command line.
- Run LaTeX on your .tex file – this will create the bibliography section in the document, but will not insert the correct numbering
- Run LaTeX on your .tex file one more time – this step finishes everything; the references section will be created and all of the citations will be properly numbered.
Note: TeXShop for OS X includes an AppleScript that will execute these four steps automatically.
Note: This tutorial is available as a downloadable PDF file.
References
- The BibTeX Format. Dana Jacobsen. 1996. http://www.ecst.csuchico.edu/~jacobsd/bib/formats/bibtex.html.
- LaTeX Tricks: BibTeX. Peter Newbury. 1995. http://www.iam.ubc.ca/~newbury/tex/bibtex.html.
Introduction
R, S, SAS, Stata, SPSS, Eviews and other statistical and econometric packages all provide a way for users to interact with the program in “batch mode” — running prewritten routines in which sequences of commands are written and submitted to the program, producing all analytical output in one step.
There are several compelling reasons to take steps to ensure that the code that comprises these routines is well-organized:
- They are often worked on by multiple people
- They are often very complicated/long
- They are often reused on other projects.
The following tips will help you make sure that your code is in a usable condition for easy editing and use, collaboration, and integration in future projects.
Data coding
Although data coding doesn’t technically count as a programming step, several data coding practices can make sure that your data management and analysis is accurate (and as easy as possible).
- Code variables consistently. If you have a series of Yes/No variables, code them all using “yes” or “no” or “0” and “1” or “y”and “n” — whichever works best for you. The key is to code them all consistently, so you don’t find yourself wondering which coding scheme you used for different variables. This can easily lead to mistakes.
- Watch cases. Some applications do not differentiate between upper and lower cases in string variables. Some do. A good practice is to use all lower-case for discrete string variables.
- Missing value codes. Use consistent missing value codes. Common codes include “na”, “-9” and “-999”. It is a good idea to use a separate code for missing values instead of leaving them blank. A separate code can clear up confusion regarding whether a data point is actually missing.
- NA/Don’t Know/Refused codes. For survey data, “refused” or “don’t know” are different than “missing” — the latter implies that no answer was given at all. It is a good idea to code these separately.
Variable naming
Many of the same rules of thumb that apply to data coding also apply to naming conventions within your code. It is important to have consistent naming conventions because doing so prevents confusion and reduces the likelihood of errors. Other tips include:
- Use a consistent naming type. Common types include:
Camelback: VariableName
Underscore: variable_name (Note: R doesn’t allow the underscore. Use a “.” instead – variable.name)
- Watch out for case sensitivity. Some applications are case sensitive when it comes to variable names, and some aren’t. Using lower-case names across the board might be a good idea if you use multiple systems.
- Use easy-to-understand mnemonics. Some systems limit the length of variable names, which might require using mnemonics such as geoloc instead of geographical_location. Using these mnemonics is fine – it even saves time – but they should be easy to remember and consistent.
- Label your variables. If the system that you’re using provides such a capability, it is a good idea to add variable labels that connect variable mnemonics to a short description of the variable in the analytical output produced by the system. This reduces the burden of having to remember the names of many variable codes.
- Label your values. If you can, add labels to the values used. For example, a coding scheme that uses “1” for less, “2” for the same and “3” for no, adding value labels reduces the chance that you’ll mix up what the numbers stand for.
- Keep a codebook. Especially if you are unable to use variable and value labels in your data analysis package, keep a codebook that records the variable names and the meanings of the values.
Comment your code
Every system that allows coded input also has a special character that indicates a comment. For example:
#this is an r comment
* this is a SAS comment;
Use comments to
1) Provide structure to your documents – setting out section headers that explain what a large part of the program does. Example:
# # # # # input the data # # # # #
2) Add explanations of important pieces of code:
x=y # set x equal to y
Comments make it easy for you to find parts of code, and for you and others to quickly understand your code in the future.
Note: This tutorial is available as a downloadable PDF file.
There are many how-to’s and books on using R, but most are aimed at experienced users who have already familiarized themselves with the basics of R. Getting started can be difficult for those with little experience using statistical software, those used to graphical software, or for those who simply don’t have access to the correct documentation.
This tutorial (PDF) shows beginners how to load data into R, recode it if necessary, produce basic descriptive statistics, correlations and regressions, and save their work.
It only scratches the surface of what R can do. For more advanced use, see
SAS is one of the standard applications for statistical analysis (alongside S-Plus and Stata). It is an enormous and expensive application that consists of the base system (SAS/Base) and optional packages (SAS/Stat, SAS/ETS, …). A limited version of SAS, Learning Edition, is available at a reduced price.
I have produced a brief tutorial (PDF) on its use in survey research.
Note: SAS is a complicated application with thousands of functions about which countless volumes of documentation have been written. This tutorial barely scratches the surface; it is designed to get a new user up and running by importing some data and running some basic tables and regressions. For more comprehensive information about SAS, see Resources to help you learn and use SAS, from UCLA’s Academic Technology Services.
The default mode of interaction with R is the console’s command line. This can be a convenient way of using R — it’s interactive, so you can make real time changes to your data and models. However, there are times when it might be more convenient to write a routine beforehand, run it in R, then look at the output separately. This is how many other statistical applications work.
To run R routines in this manner:
1) Tell R where to direct the output
sink(”FileName.txt”)
- You can save the output as any file type, just open it in your text editor
- To save time, you can use the command setwd(”Path/To/Your/Files/”) so that you don’t have to type the full path to your sink and source files
2) Write your source file in any text editor and save it as a .R file. The source file will usually include
- Loading data: An R workspace image or a .csv file, for example
- Recoding and manipulating data: The car package includes the recode function that makes this step substantially easier
- Statistical tests, descriptive statistics and other analytics
- Note: use #’s to comment your code
3) Tell R to run the sourcefile
source(”SourceFile.R”,echo=TRUE)
- Unless you use the echo=TRUE option, your sink file will be blank
- Optionally, you can include the setwd and sink commands from part (1) in the beginning of your source file. Then just use the above command every time you want to run your R routine
There are several advantages to running predefined R routines instead of using the program interactively. You get a text file that represents all of your output; you can easily reuse code; and you have a built-in record of everything that you did, so you can easily find bugs.
About packages
LaTeX packages are sets of commands that, when invoked in the preamble of a LaTeX document, extend LaTeX’s functionality or force it to style documents according to special guidelines. The geometry package, for example, allows you to change the margins of a LaTeX document. A good place to find packages is the CTAN (The Comprehensive TeX Archive Network).
Installing (most) packages
1) Download the files and uncompress them if they are zipped.
2) Run LaTeX on the .ins file, either by opening the .ins file in your frontend or via the command line (ex: latex packagename.ins). This creates various files, such as a .cls, .sty, .cfg and .drv files.
2a) Optional: Run LaTeX on the .dtx file to create the manual.
3) Move the folders to somewhere in your LaTeX tree where they can be found by the program. This is the hardest part, because it varies depending on LaTeX distribution.
- For gwTeX on OS X, make the folder “Library/texmf” in your Home directory and put the files there (they can be organized into folders or just dropped into the directory).
- For MikTeX on Windows, put the files in “C:\texmf\tex\latex” again putting them in a subdirectory if desired.
Note: Not all of the files are necessary, usually LaTeX is just looking for the .sty or .cls files. However, this may vary, so it’s a good idea to move all of the files into the appropriate folder.
4) (This step may also be optional, depending on your distribution) Tell LaTeX that you have added the files.
- This is not necessary if you are using gwTeX.
- If you are using MikTeX, this can be done using the MiKTeX Options application, which is available through the Start Menu: Open the application and click the Refresh Now button under the General tab.
Other installation methods
1) The method described above applies to almost all distributions. However, some flavors of LaTeX offer simpler ways to install packages. For example, MiKTeX users can open the MiKTeX Package Manager and select an online repository (such as CTAN) to automatically install/uninstall some packages. i-Installer, the graphical application that installs gwTeX, also offers automatic installation of a limited number of packages. It is still a good idea to know how to do it the long way, since not all packages will be available via these easy package managers.
2) Some packages will come unzipped in the form of a .sty or .cls file, usually with some accompanying documentation. These packages do not need to be processed with LaTeX before they are moved to the appropriate texmf subdirectory.
Using packages
Once a package is installed, it can be invoked in the preamble of a LaTeX document. For example, to use the geometry package to set the margins of a document to one inch, type:
\documentclass{article}
\usepackage[margin=1in]{geometry}
\begin{document}…
The general syntax is “\usepackage[options]{packagename}.” It is important to read the documentation provided with different packages, because they will all have their own options and syntax, which can be complicated compared to the core LaTeX markup language.
MacTeX is the ultimate LaTeX toolkit for Mac OS X. Like ProTeXt, it is a bundle of
- A TeX distribution (gwTeX + XeTeX)
- A frontend (TeXShop)
- Useful packages (like Ghostscript)
- An application manager (i-Installer, for updating your distribution)
- Other useful tools (like Excalibur for spell checking and BibDesk for bibliography creation and management)
It comes with a Mac OS X installer, so setup is very easy. To add new packages, make a folder in your home library called texmf and put new files there (this ensures that your packages don't get changed if you use i-Installer to update TeX).
MacTeX is completely free and open source.
ProTeXt is a new LaTeX distribution for Windows. It basically includes MikTeX, TeXnic Center, a trial of WinEdit (a commercial LaTeX frontend) and easy-to-follow setup instructions. This is probably the easiest way to get started with LaTeX on Windows.
GNU Octave is a high-level language, primarily intended for numerical computations. It provides a convenient command line interface for solving linear and nonlinear problems numerically, and for performing other numerical experiments using a language that is mostly compatible with Matlab. It may also be used as a batch-oriented language.
Octave runs natively on Linux and Windows. Octave can be installed on OS X through Fink, but there is no native version.
Micheal Creel has made an open source graduate econometrics textbook available online. It is open source because the document is editable and may be incorporated into other works and changed, as long as the resulting document is also made available.
The National Institute of Standards and Technology has a Handbook of Engineering Statistics that includes examples. The book can be viewed online, and individual chapters can be downloaded in PDF format.
These add-ins augment Excel’s functionality – Bootstrap, Dummy Dependent Variable, Monte Carlo Simulation, P Value Calculator, Gauss Newton DDV, Histogram OLS, Regression.
No, Excel isn’t the ideal tool for data analysis, but it is installed on almost every computer, so these might come in handy.
The Journal of Applied Econometrics makes all nonconfidential data used in its articles available online.
People who do a lot of programming tend to grow attached to their text editors. Features like advanced search and replace, auto indent, column selection, syntax highlighting and macros make repetitive editing much faster and easier. Below are some editors that I have found useful:
OS X
- TextWrangler
- Smultron (open source)
Windows
There are, of course, countless editors, and different programmers will have different needs depending on their tastes and programming languages. Hardier programmers may wish to try Emacs or Vi, both of which are available under Linux, OS X and Windows.
My original list of free software for researchers, for those who bookmarked this page previously. I hope to someday write tutorials or references for many of these applications.
Census data from multiple years are available freely from the University of Minnesota’s Population Center in the form of iPums.
Christopher Foote and Christopher Goetz of the Boston Fed have posted data and programs from their paper “Testing Economic Hypothesis with State-Level Data: A Comment on Donahue and Levitt” online.
This paper is a response to the controversial hypothesis, popularized in the book Freakonomics, that the legalization of abortion contributed to the drop in crime rates in the 90s.
Gretl, the GNU Regression, Econometrics and Time-Series Library, is a graphical statistics and econometrics package that runs on Linux, Windows and OS X (though I’ve had limited success getting it to work on my Mac under X11).
Gretl was developed by Alan Cottrell of Wake Forest University. It is an open-source package that does cross-section and time-series econometrics, as well as basic statistics.
Gretl uses a session-based format that allows users to store and retrieve models, tests, data and notes. It supports plain text and LaTeX output of equations and tables. Gretl also has its own syntax language, so Gretl routines can be assembled, rather than interactively conducting tests.
Documentation can be found on the Gretl website.
Update: I did get Gretl to work on OS X Panther. Here are the steps:
1) Make sure Apple’s X11 is installed on your system (/Applications/X11.app)
2) Download and mount the Gretl for OS X disk image
3) Move Gretl_Folder somewhere on your system — make sure you move it to where you’ll want to run the application from, because moving it after the first launch creates problems (example: /Applications/Gretl_Folder)
4) Installation is complete. To use the application, open X11.app and navigate to the bin folder within the Gretl_Folder and type “./gretl.sh” – for example:
cd /Applications/Gretl_Folder/bin
./gretl.sh
I would write an AppleScript to do this, but it looks like X11 isn’t scriptable. And since Gretl for OS X uses a shell script and not a binary file, it can’t be launched from the Terminal.
But it does work, and it works well. Point-and-click graphics, regressions, diagnostics and corrections and more!
Gretl on OS X using X11
Data analysis is of little value if the analyst can’t convey the results. Statistical, econometric and mathematical writing is more demanding of word processing and typesetting systems than most business or academic prose. Mathematical formulae, graphs, figures, numerous tables and other scientific information can be difficult to include when using standard software.
For professional-quality documents, there are basically two systems: WYSIWYG (what you see is what you get) systems and LaTeX-based systems.
LaTeX
LaTeX (pronounced “lay-tech”) is a macro package of Donald Knuth’s TeX typesetting system, originally written by Leslie Lamport. LaTex is a markup language. Users type their text, interspersed with document structure commands, and process it with the LaTeX system to produce a printable or screen readable file (.ps, .dvi, .pdf, etc.).
LaTeX removes most of the formatting from typesetting. LaTeX documents are plain text documents that contain special commands. Users specify whether text is a title, a table, a list, etc. and LaTeX processes the commands to create formatted, publication ready documents. LaTeX also makes typesetting mathematical formulas very easy, by providing a syntax to represent most symbols and operators.
LaTeX is open-source and cross platform (OS X, Linux and Windows). LaTeX consists of a TeX engine and an optional frontend that makes it easier to send LaTeX files to the engine. On any platform, there are multiple LaTeX distributions that may differ slightly in terms of how they format documents.
- Mac OS X users can download the gwTeX distribution, which uses iInstaller, a graphical installer. TeXShop and iTeXMac are OS X native frontends that integrate will with gwTeX.
- Windows users can download the MikTeX distribution, the default engine for TeXnic Center.
- Linux users can install TeTeX through their package management system.
For a great tutorial on using LaTeX, download and read The Not So Short Introduction to LaTeX (PDF).
For more information, including links to alternative LaTeX distributions, visit the LaTeX project’s website.
WYSIWYG Systems
Most users are probably more familiar with WYSIWYG systems. The most common of these is Microsoft Word. Although word-processors are not as smart as LaTeX in automatically formatting documents, most offer some advanced features, like automatic section, table and figure numbering. They do, however, make it easier to create non-standard documents (lecture notes, presentations, etc.) than LaTeX, which requires the installation of special packages or classes, which are more difficult to create than LaTeX documents themselves.
Typesetting mathematical formulas is somewhat more indirect in WYSIWYG word processing systems. Word comes with an optional equation editor, as does WordPerfect, though the interface is slightly cumbersome, especially for documents with many equations. Several commercial third-party applications, such as MathType and MathMagic are available, which make mathematical typesetting somewhat easier.
Alternatively, OpenOffice.org, an open-source office suite, uses its own LaTeX-like syntax to allow quick equation editing in a familiar word processing environment. OpenOffice is also cross-platform, though OS X users should try NeoOffice, an OS X native version of OpenOffice, because the direct Mac version relies on X11. OpenOffice also includes spreadsheet, presentation and drawing applications, and uses open-source file formats, so documents are not locked in any proprietary format.
For more information on mathematical typesetting in OpenOffice/NeoOffice see:
I have put together a quick reference (PDF) of common R commands for data management and basic statistical tests.
R is an open-source statistics package. It is highly extensible — the core package comes with many statistical routines, R’s functionality can easily be extended by installing optional packages, and the R language can be used to write new packages. From the R Project website:
R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.
R is widely used and produces reliable results. It is also cross platform: it runs natively on Mac OS X, Linux and Windows.
R can be downloaded from the website for The R Project for Statistical Computing. To get started using R, visit:
The basics of R are easy to learn for users with some experience using syntax-based statistical software. Nevertheless, R can be extended in many ways, so it is a good idea to consult multiple references when learning or researching functions.




