README.md 3.29 KB
Newer Older
Ole Voldsæter's avatar
Ole Voldsæter committed
1
# STATA-SDTL Converter
Ole Voldsæter's avatar
Ole Voldsæter committed
2

3
Tool for parsing STATA transformations. Written in Clojure with Ring and [Instaparse](https://github.com/Engelberg/instaparse)
Ole Voldsæter's avatar
Ole Voldsæter committed
4

Ole Voldsæter's avatar
Ole Voldsæter committed
5 6 7 8 9
####  STATA commands supported so far:

Mutating commands

- generate
Ole Voldsæter's avatar
Ole Voldsæter committed
10 11 12
- replace
- label
- egen
Ole Voldsæter's avatar
Ole Voldsæter committed
13
- use
Ole Voldsæter's avatar
Ole Voldsæter committed
14 15
- save
- format
Ole Voldsæter's avatar
Ole Voldsæter committed
16 17 18 19 20
- drop
- keep
- rename
- recode
- reshape
Ole Voldsæter's avatar
Ole Voldsæter committed
21 22
- label variable/values
- format
Ole Voldsæter's avatar
Ole Voldsæter committed
23
- collapse 
Ole Voldsæter's avatar
Ole Voldsæter committed
24 25 26 27 28 29

The following commands are understood, but cannot be expressed in SDTL yet

- order 
- merge
- append
Ole Voldsæter's avatar
Ole Voldsæter committed
30 31 32 33

Control flow statements

- local
Ole Voldsæter's avatar
Ole Voldsæter committed
34 35
- foreach ... of varlist ...
- foreach ... of numlist ...
Ole Voldsæter's avatar
Ole Voldsæter committed
36
- foreach ... of local ...
Ole Voldsæter's avatar
Ole Voldsæter committed
37
- foreach ... in ... (supports numbers, names and string literals)
Ole Voldsæter's avatar
Ole Voldsæter committed
38 39 40
- forvalues
- while

Ole Voldsæter's avatar
Ole Voldsæter committed
41 42
## Usage

Ole Voldsæter's avatar
Ole Voldsæter committed
43
Type `lein run` to start the server. It runs on port 40000
Ole Voldsæter's avatar
Ole Voldsæter committed
44

Ole Voldsæter's avatar
Ole Voldsæter committed
45
To run as as a command line tool, type `lein uberjar` to build the application, then copy the file `stata2sdtl-standalone.jar` from the `target` folder to `/usr/local/lib/`.
Ole Voldsæter's avatar
Ole Voldsæter committed
46 47
Then copy the file `stata2sdtl` from the `scripts` folder to some folder on your path, e.g. `/usr/local/bin/`

Ole Voldsæter's avatar
Ole Voldsæter committed
48 49 50
On the command line, stata2sdtl takes one or two arguments. The first one (required) is the name of a text file with a white space separated list of variable names. The optional argument
is a stata program file. If omitted, the tool will read the stata program from stdin.

51 52 53 54 55 56 57
## How does it work?

This converter is built around the [Instaparse](https://github.com/Engelberg/instaparse) parser generator. It's not simply
a parser, but actually an interpreter which relies on two distinct parsers with corresponding grammars. The reason for this will become
clear as each step is described.

The first step is to read a variable list file and convert it to a map of datafile names to lists of variable names. This is necessary as Stata commands
Ole Voldsæter's avatar
Ole Voldsæter committed
58
may refer to variables in shorthand notation containing intervals of variables and wildcards.
59 60 61 62 63 64 65

The next step is to convert the entire Stata script to a list of discrete statements. Since Stata is a REPL type language, things like loop declarations
must be treated as statements in their own right. This step would be straightforward if Stata used a single statement delimiter, but this is not the case. A Stata
script may switch between newlines and `;` delimiters. Therefore we need a parser (declared in `stata2sdtl.breakdown-parser`) to tokenize the script into statements. This
parser cannot go any further, because each statement is initially unparsable.

In the next step, statements from the previous step are processed one by one. Each statement may contain Stata macros which must be expanded in order to
Ole Voldsæter's avatar
Ole Voldsæter committed
66 67 68
form syntactically valid statements which are then parsed by the main parser. A macro must have been declared earlier in the script, otherwise the script will be
effectively unreadable. If a statement is of a kind that alters a dataset, an SDTL command is added to the output. Control flow statements are treated
as such.
69 70 71 72 73 74 75 76 77

Because of the macro complication, loops must be converted to a flat list of statements since macros inside the loop body must be expanded according
to their contents for each iteration. A loop with 3 stataments over 10 iterations will thus be expanded to 30 sequential statements. Nested loops may result
in a lot of stataments.

When the last statement has been evaluated, the application terminates and outputs SDTL.

*to be continued*

Ole Voldsæter's avatar
Ole Voldsæter committed
78 79
## License

Ole Voldsæter's avatar
Ole Voldsæter committed
80
Copyright © 2016 NSD