Getting Started

Getting and compiling the code

Open a terminal. Move to the directory you want to contain the Fieldspring directory, then clone the repository:

git clone https://github.com/utcompling/fieldspring.git

Set the environment variable FIELDSPRING_DIR to point to Fieldspring's directory, and add $FIELDSPRING_DIR/bin to your PATH.

Compile Fieldspring like this:

fieldspring build compile

Downloading OpenNLP models

Move to this directory:

cd $FIELDSPRING_DIR/data/models

Then run:

./getOpenNLPModels.sh

This should download the files en-ner-location.bin, en-token.bin, and en-sent.bin.

Getting and preparing the GeoNames gazetteer

Run the script called download-geonames.sh (in $FIELDSPRING_DIR/bin). This will put the correct version of the GeoNames gazetteer (a file called allCountries.zip) into $FIELDSPRING_DIR/data/gazetteers. It is important that you use this method to get GeoNames, as even slightly different versions will cause results to change.

Once you've obtained the correct allCountries.zip, import the gazetteer for use with Fieldspring by running this:

fieldspring --memory 8g import-gazetteer -i $FIELDSPRING_DIR/data/gazetteers/allCountries.zip \
    -o geonames-1dpc.ser.gz -dkm

Importing the TR-CoNLL corpus

You should have a directory (we'll call it /path/to/trconllf/xml/) containing the TR-CoNLL corpus in XML format, with the subdirectories dev/ and test/ for each split. Ideally you should have the fixed version (trconllf) rather than the original (trconll); this fixes various errors in the latitude and longitude coordinates. To import the test portion to be used with Fieldspring, run this, making use of the gazetteer serialized in the previous step:

fieldspring --memory 8g import-corpus -i /path/to/trconllf/xml/test/ -cf tr -gt \
   -sg /path/to/geonames-1dpc.ser.gz -sco trftest-gt-g1dpc.ser.gz

You should see output that includes this:

Number of documents: 315
Number of word tokens: 67572
Number of word types: 11241
Number of toponym tokens: 1903
Number of toponym types: 440
Average ambiguity (locations per toponym): 13.68891224382554
Maximum ambiguity (locations per toponym): 857

Serializing corpus to trftest-gt-g1dpc.ser.gz ...done.

This will give you the version of the corpus with gold toponym identifications. Run this to get the version with NER identified toponyms:

fieldspring --memory 8g import-corpus -i /path/to/trconllf/xml/test/ -cf tr \
   -sg /path/to/geonames-1dpc.ser.gz -sco trftest-ner-g1dpc.ser.gz

Throughout this guide, it is important that you use the same filenames as those shown (e.g. trftest-gt-g1dpc.ser.gz) in order for the scripts that run the experiments to run properly.

Getting and importing the CWar corpus

Download and unpack the original Perseus 19th Century American corpus found here: http://www.perseus.tufts.edu/hopper/opensource/downloads/texts/hopper-texts-AmericanHistory.tar.gz

Download the Dyer KML file containing the locations annotations in this dataset here: http://dsl.richmond.edu/emancipation/data-download/

Download the file containing Getty TGN annotations here: http://vocab.getty.edu/dataset/tgn/explicit.zip

Unzip this file and delete everything except for TGNOut_Coordinates.nt.

Run the following script (located in $FIELDSPRING_DIR/bin):

prepare-cwar.sh -c /path/to/original/cwar/xml/ -k /path/to/reviseddyer20120320.kml \
  -t /path/to/TGNOut_Coordinates.nt -g $FIELDSPRING_DIR/geonames-1dpc.ser.gz -o /path/to/cwar/xml

This takes the CWar corpus in its original XML format and combines it with the Getty TGN and Dyer KML annotations and the gazetteer and generates a completely different XML format that follows the same conventions as the TR-CONLL corpus.

Once you have the CWar corpus in the correct format in a directory (we'll call it /path/to/cwar/xml/) with subdirectories dev/ and test/ for each split, import the test portion as follows:

fieldspring --memory 30g import-corpus -i /path/to/cwar/xml/test -cf tr -gt \
  -sg $FIELDSPRING_DIR/geonames-1dpc.ser.gz -sco /path/to/cwartest-gt-g1dpc-20spd.ser.gz -spd 20

where /path/to/cwartest-gt-g1dpc-20spd.ser.gz is where the serialized CWar corpus will be written.

(To decode the arguments: -i = input corpus; -cf tr = corpus format TR-CONLL; -gt = use gold toponyms (if left out, the OpenNLP named entity recognizer will be run to find toponyms); -sg = serialized gazetteer (input); -sco = serialized corpus output; -spd = sentences per document, i.e. the documents of the corpus, which are whole books, will be split up into smaller documents of a fixed number of sentences.)

This will give you the version of the corpus with gold toponym identifications. Run this to get the version with NER identified toponyms:

fieldspring --memory 30g import-corpus -i /path/to/cwar/xml/test -cf tr \
  -sg $FIELDSPRING_DIR/geonames-1dpc.ser.gz -sco /path/to/cwartest-gt-g1dpc-20spd.ser.gz -spd 20

(This is identical except for lacking -gt.)

Downloading and preparing a Wikipedia corpus

Note: There is a script download-wiki-data.sh to download some WISTR models and Wikipedia log files. Currently these are based off of an older version of Wikipedia (from February 2012?) and the filenames are incorrectly named. This should be fixed.

The following steps require the use of TextGrounder. Download and set it up, which will require you to set the TEXTGROUNDER_DIR variable and put $TEXTGROUNDER_DIR/bin on your path. Download a version of Wikipedia, something like this:

download-preprocess-wiki enwiki-20131104

This will take a long time, perhaps up to 24 hours. It will create a subdirectory under the current directory with the name of the Wikipedia tag you specify, e.g. enwiki-20131104 for the English Wikipedia dump of November 4, 2013. The most important file inside of this directory is enwiki-20131104-permuted-training.data.txt. You can compress this file using bzip2 or gzip if needed.

After this you will still need to generate the Wikipedia raw-text file. Change to the directory containing the preprocessed Wikipedia (e.g. enwiki-20131104) and run e.g.

preprocess-dump enwiki-20131104 coord-words

This might take an hour or two. It will create a file called e.g. enwiki-20131104-permuted-text-only-coord-documents.txt. You can compress this file if needed. This file needs to be further processed by FieldSpring, e.g.

fieldspring run opennlp.fieldspring.tr.app.FilterGeotaggedWiki \
  -c /path/to/enwiki-20131104-permuted-training.data.txt.bz2 \
  -w /path/to/enwiki-20131104-permuted-text-only-coord-documents.txt.bz2 \
  > enwiki-20131104-permuted-text-training.txt

This might take two or three hours.

To run TRIPDL and TRAWL you will also need to create an appropriate TextGrounder corpus and run TextGrounder to get a log file. The TextGrounder corpus to be created is a combination of the Wikipedia training set and dev/test files taken from the CWAR or TR corpora converted into TextGrounder format. Conversion of the latter is as follows:

fieldspring run opennlp.fieldspring.tr.app.ConvertCorpusToUnigramCounts \
  -sci /path/to/cwartest-gt-g1dpc-20spd.ser.gz \
  > cwartest-gt-g1dpc-20spd-test.data.txt

This runs fairly quickly.

You then need to create the corpus, as follows:

Create a directory, e.g. enwiki-20131104-cwartest-gt-g1dpc-20spd.
Put in it the files enwiki-20131104-permuted-training.data.txt.bz2 and enwiki-20131104-permuted-training.schema.txt from the preprocessed Wikipedia corpus.
Put in it the file cwartest-gt-g1dpc-20spd-test.data.txt, created above (optionally compressed).
Put in it a corresponding schema file cwartest-gt-g1dpc-20spd-test.schema.txt. To create this file, copy the file enwiki-20131104-permuted-training.schema.txt and change the word training to test in the line that says split test (note, there should be a TAB character between the words, not spaces).

For the dev set, the file created above would be named cwardev-gt-g1dpc-20spd-dev.data.txt. The schema file would be named similarly but ending in .schema.txt, and would have split dev in it (with a TAB character between the words).

To run TextGrounder to create a log file, run it as follows:

tg-geolocate /path/to/enwiki-20131104-cwartest-g1dpc-20spd \
  --print-results --print-results-as-list --print-knn-results \
  --num-top-cells 100 --dpc 1 --eval-set test \
  > enwiki-20131104-cwartest-g1dpc-20spd-100-nbayes-dirichlet.log 2>&1
bzip2 enwiki-20131104-cwartest-g1dpc-20spd-100-nbayes-dirichlet.log

Extracting the WISTR training instances and training the classifiers

(This section can be skipped if one simply uses the classifiers already included in the download above.)

For the WISTR training instances relevant to the test split of TR-CoNLL, run the following:

fieldspring --memory 30g run opennlp.fieldspring.tr.app.SupervisedTRFeatureExtractor \
  -w /path/to/enwiki-20131104-permuted-text-training.txt \
  -c /path/to/enwiki-20131104-permuted-training.data.txt.bz2 -i /path/to/trconllf/xml/test/ \
  -g /path/to/geonames-1dpc.ser.gz -s $FIELDSPRING_DIR/src/main/resources/data/eng/stopwords.txt \
  -d /path/to/wistr-models-enwiki-20131104-trftest-gt/

Where /path/to/wistr-models-enwiki-20131104-trftest-gt/ is the path to the directory where the training instances will be written.

To train the models given the training instances, run this:

fieldspring --memory 30g run opennlp.fieldspring.tr.app.SupervisedTRMaxentModelTrainer /path/to/wistr-models-enwiki-20131104-trftest-gt/

Running the experiments to get the results

To run the experiments, run the following script:

runexps.sh trf test gt /path/to/trconllf/xml

The script takes four arguments:

Which corpus to evaluate on (either trf or cwar, although you can substitute your own corpus here, e.g. lgl, if you have the appropriate files with the correct names)
Which split to evaluate on (either dev or test)
Which toponym identification method to use (either gt for gold toponyms or ner for toponyms detected by a named entity recognizer)
The path to the directory containing the prepared corpus in XML format, which contains dev/ and test/ subdirectories.

This requires that the following files be present in the current directory, with the following names:

The appropriate serialized corpus file, e.g. cwartest-gt-g1dpc-20spd.ser.gz.
The appropriate TextGrounder log file, e.g. enwiki-20131104-cwartest-g1dpc-20spd-100-nbayes-dirichlet.log.bz2.
The appropriate WISTR classifier model directory, e.g. wistr-models-enwiki-20131104-cwartest-gt.
For running LISTR, the appropriate LISTR classifier model directory, e.g. listr-models-enwiki-20131104-cwartest-gt.
For running WISTR+LISTR, the appropriate WISTR+LISTR classifier model directory, e.g. wistrlistr-models-enwiki-20131104-cwartest-gt.

If all of these files are in place, this should output something like this:

\oracle & 104.57995807879772 & 19.828158539411007 & 1.0
\rand & 3914.634055985425 & 1412.4048552451488 & 0.3348197696023783
\population & 216.14728454090616 & 23.103466226857382 & 0.8099219620958752
\spider & 2689.7013998421176 & 982.4361524584441 & 0.49182460052025273
\tripdl & 1494.1413395381906 & 29.258599245838536 & 0.6198439241917503
\wistr & 279.05523246633146 & 22.579446357728344 & 0.8232998885172799
\wistr+\spider & 430.17546527897343 & 23.103466226857382 & 0.8182831661092531
\trawl & 235.41656899283578 & 22.579446357728344 & 0.81438127090301
\trawl+\spider & 297.1435368979808 & 23.103466226857382 & 0.806577480490524

The columns shown are mean error in kilometers, median error in kilometers, and accuracy. If "ner" is used as the toponym identification method, the three columns that will be output are precision, recall, and F-score. The format is meant to be pasted into a LaTeX file with minimal additional markup (e.g. "\\" at the end of each line if no other columns will be in the results table you are building).

You can also run individual tests by specifying them as further arguments. For example, to run just spider and tripdl, use

runexps.sh trf test gt /path/to/trconllf/xml spider tripdl

NOTE: In order to run trawl+spider you need to also run trawl directly before it, because the act of running TRAWL writes out a weights file (e.g. probToWMD.21647.dat) that is then read in by SPIDER in order to implement trawl+spider. Similar considerations do not currently apply to wistr+spider because this step runs the equivalent of WISTR on its own to create the weights file.

runexps.sh also takes optional arguments that can be specified before any of the other arguments:

--cwar-suffix SUFFIX: Specify the suffix to use in place of 20spd for the CWAR corpus.
--wikitag WIKITAG: Specify the version of Wikipedia in place of enwiki-20131104.
--wiki-log-suffix SUFFIX: Specify the suffix to use for the TextGrounder log files in place of nbayes-dirichlet. This indicates which document geolocation method was used when generating the log file.
--memory MEMORY: Specify the amount of memory to use when running Java, e.g. 8g for 8 GB.

Note that the commands used to execute Fieldspring, and the actual output from running Fieldspring, is saved into a temporary file in the same directory from which runexps.sh is run, with a name like temp-results.21763.txt. (The number is the process ID of runexps.sh.) The name of this file is output at the beginning when running runexps.sh. If something goes wrong, look in this file.

Importing your own plaintext corpus

To work with your own data (either a single .txt file or a directory of them), run the following:

fieldspring --memory 2g import-corpus -i /path/to/your/data -cf p -sg geonames-1dpc.ser.gz -sco corpus-name.ser.gz

This will take some time as it will use a named entity recognizer to identify toponyms. You also may need to adjust the amount of memory if your corpus is large.

Running toponym resolution on your corpus

You can now run various toponym resolvers on the serialized corpus (corpus-name.ser.gz) and output the result in a few different ways.

To run the population baseline, execute this:

fieldspring --memory 2g resolve -sci corpus-name.ser.gz -r population -o corpus-name-pop-resolved.xml \
  -ok corpus-name-pop-resolved.kml -sco corpus-name-pop-resolved.ser.gz

This will resolve your corpus with the population baseline and output the result in TR-CoNLL format (XML) to wherever the -o flag points, a Google Earth readable file (KML) to wherever the -ok flag points, and a serialized resolved corpus to wherever the -sco flag points.

You can visualize the resolved serialized file with Fieldspring's own visualizer by executing this:

fieldspring --memory 8g viz corpus-name-pop-resolved.ser.gz

To run other resolvers, change what the -r flag is set to. Investigating runexps.sh in a text editor should help give you an idea of the resolvers currently supported. Many of them require additional data via additional flags.

Here are some of the commands you'd need to put after -r in order to run various resolvers (in many cases shortenings like 'pop' for population also work):

random
population
bmd (BasicMinDistance)
wmd (WeightedMinDistance, aka SPIDER)
maxent (Maximum Entropy, aka WISTR)
prob (Probabilistic, aka TRAWL)
constructiontpp (ConLAC)
acotpp (TRACO)

Finding where these are invoked in runexps.sh will give you an idea of what you'll need to run them yourself.

Adding the -oracle flag to any resolve command (I recommend random for speed) will use the oracle resolver.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!