Tools ===== ``nlppln`` contains the following tools: anonymize.cwl +++++++++++++ Replace named entities in a directory of text files. Can be used as part of an data anonymization workflow. apachetika.cwl ++++++++++++++ Convert Word documents to text using `Apache Tika `_. archive2dir.cwl +++++++++++++++ Extract archive and recursively put all files in the output directory. Uses `Patool `_ for extracting archives. basic-text-statistics.cwl +++++++++++++++++++++++++ Output a csv file with basic text statistics (#tokens, #sentences). check-utf8.cwl ++++++++++++++ Convert text files to utf-8 encoding. Uses `BeautifulSoup `_'s Unicode, Dammit module to guess the file encoding if it isn't utf-8. clear-xml-elements.cwl ++++++++++++++++++++++ Empty (i.e. remove all content from) specified XML elements in the XML file. copy-and-rename.cwl +++++++++++++++++++ Copy a file and optionally rename it. File renaming options are: ``copy`` (don't rename), ``spaces`` (remove spaces, default), and ``random`` (generate a random file name. The file extension is copied too.) If the renaming option is spaces, this tool must be run with the ``--relax-path-checks`` option, because it accepts file names with spaces, which CWL normally does not accept. create-chunked-list.cwl +++++++++++++++++++++++ No documentation delete-empty-files.cwl ++++++++++++++++++++++ No documentation filter-nes.cwl ++++++++++++++ Control which named entities will be removed. See `replace-ner.cwl`_. flatten-dirs.cwl ++++++++++++++++ Given a list of directories, return a directory that contains all the files in the input directories. By default the name of the output directory is `flattened`. You can specify a different name using the `--dir_name` option. freqs.cwl +++++++++ Return a sorted list of word frequencies in the corpus. The corpus should consist of files containing space-separated tokens. frog-dir.cwl ++++++++++++ `Frog `_ a directory of text files. frog-single-text.cwl ++++++++++++++++++++ `Frog `_ a single text file. frog-to-saf.cwl +++++++++++++++ Convert `frog `_ csv output to `saf `_. gather-dirs.cwl +++++++++++++++ Given a list of directories, return a directory that contains all the files in the input directories. By default the name of the output directory is `gathered`. You can specify a different name using the `--dir_name` option. ixa-pipe-tok.cwl ++++++++++++++++ Tokenize a text using `ixa-pipe-tok `_. liwc-tokenized.cwl ++++++++++++++++++ Apply `LIWC `_ to a directory of tokenized text files. The text files have to contain space separated tokens. lowercase.cwl +++++++++++++ Lowercase a text. ls.cwl ++++++ List files in a directory. This command can be used to convert a ``Directory`` into a list of files. This list can be filtered on file name by specifying ``--endswith``. ls_chunk.cwl ++++++++++++ No documentation merge-csv.cwl +++++++++++++ Merge csv files (with the same header) into a single csv file. mkdir.cwl +++++++++ Create directory normalize-whitespace-punctuation.cwl ++++++++++++++++++++++++++++++++++++ Normalize whitespace and punctuation. Replace multiple subsequent occurrences of whitespace characters and punctuation with a single occurrence. prettify-xml.cwl ++++++++++++++++ Pretty print xml file. Uses `BeautifulSoup pretty printing `_. remove-newlines.cwl +++++++++++++++++++ Remove newlines from a text. remove-xml-elements.cwl +++++++++++++++++++++++ Remove specified XML elements from XML file. replace-ner.cwl +++++++++++++++ Replace named entities in `saf `_ files. Named entities can be replaced with their type or deleted. saf-to-freqs.cwl ++++++++++++++++ Return csv file wit a ranked list of (word, pos) pairs. The list can be of (word, pos) pairs of (lemma, pos) pairs. saf-to-txt.cwl ++++++++++++++ Convert `saf `_ to space separated tokens. save-dir-to-subdir.cwl ++++++++++++++++++++++ Save a directory to a subdirectory. Puts ``inner_dir`` into the ``outer_dir``. save-files-to-dir.cwl +++++++++++++++++++++ Save a list of files to a directory. If the ``dir_name`` is not specified, it is set to the string before the rightmost - of the ``nameroot`` of the first input file (e.g., ``input-file-1-0000.txt`` becomes ``input-file-1``). If the file name does not contain a -, the ``nameroot`` is used (e.g. ``input.txt`` becomes ``input``). save-ner-data.cwl +++++++++++++++++ Create csv file with statistics about named entities. By editing the csv file, you can control which named entities are replaced or removed using `replace-ner.cwl`_. tar.cwl +++++++ Extract zipped tar archives. textDNA-generate.cwl ++++++++++++++++++++ Generate data to vizualize using `TextDNA `_. xml-to-text.cwl +++++++++++++++ Extract text from an XML element and save it to a file. zip-dir-flat.cwl ++++++++++++++++ Compress a directory into a zip archive. All structure is removed from the input directory. So, if you unzip the archive, you get a flat list of files.