IR - TREC
1. Introduction
I forget what "TREC" stood for; probably "Trav's" something or other.
The name was a pun on a song I had written a few years earlier
("Trek").
Regardless of the title, TREC formed a big part of my research.
2. Source Code
The TREC logic consisted of a bunch of PERL scripts. Most of the descriptions here are copied from the headers of the source files.
2.1 PERL Scripts
- cmp_ranks.pl
- Compares ranks, generates MSE and P vs R output files.
- combine_graphs.pl
- Combines multiple graphs onto one page (TEX).
- compmap.pl
- Compares our mappings (coll_map.txt) to Charlie Viles' mappings.
- evaluate.pl
- Reads rank comparisons for baseline_vs_estimate (one file) and baseline_versus_random (contains avg and std dev). It then calculates the statistical significance of the estimate's scores against the baseline.
- get_fandw.pl
- Generates global (gloss) vocab matrix from source F and W files.
- graph.pl
- Generates MSE points for plotting.
- matrix.pl
- Generates a tab-deliminated merit matrix (queries versus collections) for the given merit file(s).
- mits.pl
- Provides an easy and fast interface for processing trec data. This handles map files, merits, ranks, comparisons, summary matrices, evaluations versus random ranks, etc.
- new_dir.pl
- Creates a new test directory, creating all the needed subdirectories and making "stubs" for all the needed files. Also has options for creating links to files in an existing data directory.
- postproc.pl
- Performs post-processing logic.
- problems.pl
- Lists problematic query_ids and coll_ids (gotten from a diff on A.noord and B.noord).
- reformat.pl
- Makes diff-able format of ranks, and then gets n_star.
- testmap.pl
- Verifies counts of doc_ids in the mapping file, the source files, and a pre-generated frequency list. Run this after running buildmap.pl to make sure buildmap.pl worked correctly.
2.2 PERL Modules
This is a bunch of PERL modules used by the scripts. They are all in the subs subdirectory.
- files.pm
- Routines for manipulating files and directories.
- gentable.pm
- General-purpose table generator routines (for two-dimentional associative arrays).
- graphs.pm
- Routines for creating graphs out of MSEs, Recalls, and Precisions.
- handler.pm
- Handles MITS' menu choices (is responsible for interfacing with all the called routines.
- ir_subs.pm
- Miscellaneous commonly used subroutines for IR perl scripts.
- maps.pm
- Routines related to doc_id -> coll_id maps.
- masks.pm
- Routines related to masks.
- merits.pm
- Routines related to merits.
- merit_est.pm
- Calculates merit for estimates (query_id, coll_id), reads from F/W.txt, masks out certain collections from certain queries.
- merit_ideal.pm
- Gets ideal document similarities from smart files and generates merits (collection similarities) per specified thresholds.
- merit_opt.pm
- Reads in a qrels.all.txt file and generates a corresponding "optimal" merits file.
- random.pm
- Reads in a random weights file and then generates several random ranks from it (in memory only). It then compares the several random guesses to a baseline and calculates the average and standard distrubution (per query) of the scores, which it then writes to file.
- ranks.pm
- Rank-related routines.
- terms.pm
- Routines for manipulating terms and term files (e.g., query ids, coll ids, etc.).
- weights.pm
- Handles writing and reading of weights files, as well as the generation of the actual weights.
2.3 Shell Scripts
- go
- This shell script helps make calling some of the graphing-related PERL scripts more convenient.
- graph_inq
- Another shell script which calls graphing-related PERL scripts.
3. Data Files
The output files, which were hosted in a shared directory, are no longer available; the symlinks I was using are dead. Here's all that's left of the data files:
- callan.txt
- This looks like a list of publications, in human-readable format.
- mits.cfg
- MITS configuration file.
- ratios.txt
- Ratios relating to queries and other stuff. I don't remember.
- test.zip
- Zip of test data.