Guest User

cwbify_standard_corpus.pl

a guest
Dec 16th, 2015
20
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Perl 3.01 KB | None | 0 0
  1. #!/usr/bin/perl -w
  2.  
  3. use lib "/usr/local/share/perl/5.10.1/";
  4. use CWB::Encoder;
  5.  
  6. # For comments, usage, etc. do:
  7. # tagwrapper -h
  8.  
  9. use Getopt::Std;
  10.  
  11. my $usage;
  12. {
  13. $usage = <<"_USAGE_";
  14.  
  15. This script uses Stefan Evert's CWB::Encoder module to CWB-index
  16. corpora in the following simple CWB-compatible format:
  17.  
  18. <corpus>
  19. <text id="...">
  20. <s>
  21. The DET the
  22. dogs N dog
  23. ...
  24. </s>
  25. ...
  26. </text>
  27. ...
  28. </corpus>
  29.  
  30. where the positional attributes are always arranged in the order: word
  31. pos lemma (tab delimited).
  32.  
  33. Of course, the script requires CWB and the CWB::Encoder module to be
  34. properly installed.
  35.  
  36. Usage:
  37.  
  38. cwbify_standard_corpus.pl -l langcode -d dir -c cwbname -n descname corpus
  39. cwbify_standard_corpus.pl -h | more
  40.  
  41. -h: prints this information and quits
  42.  
  43. -l langcode: an ISO-style language code, such as en, de, it, etc.
  44.  
  45. -d dir: a directory where the indexed data will be stored -- the
  46.    directory cannot exist already and it will be created by the
  47.    script ; for the corpus to be available to others, you should
  48.    specify an absolute path
  49.  
  50. -c cwbname: the CWB name for the corpus, short and upper-case
  51.  
  52. -n descname: a descriptive name made of a few words, in double quotes
  53.  
  54. corpus: the input corpus
  55.  
  56. Copyright 2005, Marco Baroni
  57.  
  58. This program is free software. You may copy or redistribute it under
  59. the same terms as Perl itself.
  60.  
  61. _USAGE_
  62. }
  63. {
  64.    my $blah = 1;
  65. # this useless block is here because here document confuses
  66. # emacs
  67. }
  68.  
  69. getopts('hl:d:c:n:',\%opts);
  70.  
  71. if ($opts{h}) {
  72.    print $usage;
  73.    exit;
  74. }
  75.  
  76. if (!($langcode = $opts{l})) {
  77.    die "specify language code!";
  78. }
  79. if (!($datadir = $opts{d})) {
  80.    die "specify data directory!";
  81. }
  82. if (!($corpusname = $opts{c})) {
  83.    die "specify CWB corpus name!";
  84. }
  85. if (!($longname = $opts{n})) {
  86.    die "specify descriptive name!";
  87. }
  88.  
  89. if (!($ifile = shift)) {
  90.    die "specify input corpus!";
  91. }
  92.  
  93. $corpus = new CWB::Encoder $corpusname;
  94. $corpus->dir($datadir); # directory for corpus data files
  95. $corpus->overwrite(0);         # may NOT overwrite existing files / directories
  96.                               # this was changed from 1 to 0 after a few
  97.                               # users overwrote their home directories...
  98.  
  99. $corpus->longname($longname);
  100. $corpus->language($langcode);
  101.  
  102. $corpus->p_attributes("word"); # declare postional atts (no default!)
  103. $corpus->p_attributes("pos"); # declare postional atts (no default!)
  104. $corpus->p_attributes("lem"); # declare postional atts (no default!)
  105.  
  106. $corpus->s_attributes("corpus");
  107. $corpus->s_attributes("s");
  108. $corpus->s_attributes("text:0+id+type+title+language+target_language+author+author_sex+translator+translator_sex+publisher+year+words");
  109.  
  110. $corpus->memory(400);          # use up to 400 MB of RAM (default: 75)
  111. $corpus->validate(0);          # disable validation for faster indexing
  112. $corpus->verbose(1);           # print some progress information
  113. $corpus->debug(1);             # enable debugging output
  114. $corpus->encode($ifile);       # encoding, indexing, and compression
Add Comment
Please, Sign In to add comment