Tools
=====
``nlppln`` contains the following tools:
anonymize.cwl
+++++++++++++
Replace named entities in a directory of text files.
Can be used as part of an data anonymization workflow.
apachetika.cwl
++++++++++++++
Convert Word documents to text using `Apache Tika `_.
archive2dir.cwl
+++++++++++++++
Extract archive and recursively put all files in the output directory.
Uses `Patool `_ for extracting archives.
basic-text-statistics.cwl
+++++++++++++++++++++++++
Output a csv file with basic text statistics (#tokens, #sentences).
check-utf8.cwl
++++++++++++++
Convert text files to utf-8 encoding.
Uses `BeautifulSoup `_'s
Unicode, Dammit module to guess the file encoding if it isn't utf-8.
clear-xml-elements.cwl
++++++++++++++++++++++
Empty (i.e. remove all content from) specified XML elements in the XML file.
copy-and-rename.cwl
+++++++++++++++++++
Copy a file and optionally rename it.
File renaming options are: ``copy`` (don't rename), ``spaces`` (remove
spaces, default), and ``random`` (generate a random file name. The file
extension is copied too.) If the renaming option is spaces, this tool must be
run with the ``--relax-path-checks`` option, because it accepts file names
with spaces, which CWL normally does not accept.
create-chunked-list.cwl
+++++++++++++++++++++++
No documentation
delete-empty-files.cwl
++++++++++++++++++++++
No documentation
filter-nes.cwl
++++++++++++++
Control which named entities will be removed.
See `replace-ner.cwl`_.
flatten-dirs.cwl
++++++++++++++++
Given a list of directories, return a directory that contains all the files in the input directories.
By default the name of the output directory is `flattened`. You can specify a different name using the `--dir_name` option.
freqs.cwl
+++++++++
Return a sorted list of word frequencies in the corpus.
The corpus should consist of files containing space-separated tokens.
frog-dir.cwl
++++++++++++
`Frog `_ a directory of text files.
frog-single-text.cwl
++++++++++++++++++++
`Frog `_ a single text file.
frog-to-saf.cwl
+++++++++++++++
Convert `frog `_ csv output to
`saf `_.
gather-dirs.cwl
+++++++++++++++
Given a list of directories, return a directory that contains all the files in the input directories.
By default the name of the output directory is `gathered`. You can specify a different name using the `--dir_name` option.
ixa-pipe-tok.cwl
++++++++++++++++
Tokenize a text using `ixa-pipe-tok `_.
liwc-tokenized.cwl
++++++++++++++++++
Apply `LIWC `_ to a directory of tokenized text files.
The text files have to contain space separated tokens.
lowercase.cwl
+++++++++++++
Lowercase a text.
ls.cwl
++++++
List files in a directory.
This command can be used to convert a ``Directory`` into a list of files. This list can be filtered on file name by specifying ``--endswith``.
ls_chunk.cwl
++++++++++++
No documentation
merge-csv.cwl
+++++++++++++
Merge csv files (with the same header) into a single csv file.
mkdir.cwl
+++++++++
Create directory
normalize-whitespace-punctuation.cwl
++++++++++++++++++++++++++++++++++++
Normalize whitespace and punctuation.
Replace multiple subsequent occurrences of whitespace characters and
punctuation with a single occurrence.
prettify-xml.cwl
++++++++++++++++
Pretty print xml file.
Uses `BeautifulSoup pretty printing `_.
remove-newlines.cwl
+++++++++++++++++++
Remove newlines from a text.
remove-xml-elements.cwl
+++++++++++++++++++++++
Remove specified XML elements from XML file.
replace-ner.cwl
+++++++++++++++
Replace named entities in `saf `_ files.
Named entities can be replaced with their type or deleted.
saf-to-freqs.cwl
++++++++++++++++
Return csv file wit a ranked list of (word, pos) pairs.
The list can be of (word, pos) pairs of (lemma, pos) pairs.
saf-to-txt.cwl
++++++++++++++
Convert `saf `_ to space separated tokens.
save-dir-to-subdir.cwl
++++++++++++++++++++++
Save a directory to a subdirectory.
Puts ``inner_dir`` into the ``outer_dir``.
save-files-to-dir.cwl
+++++++++++++++++++++
Save a list of files to a directory.
If the ``dir_name`` is not specified, it is set to the string before the rightmost - of the ``nameroot`` of the first input file
(e.g., ``input-file-1-0000.txt`` becomes ``input-file-1``). If the file name does not contain a -, the ``nameroot`` is used (e.g.
``input.txt`` becomes ``input``).
save-ner-data.cwl
+++++++++++++++++
Create csv file with statistics about named entities.
By editing the csv file, you can control which named entities are replaced or
removed using `replace-ner.cwl`_.
tar.cwl
+++++++
Extract zipped tar archives.
textDNA-generate.cwl
++++++++++++++++++++
Generate data to vizualize using `TextDNA `_.
xml-to-text.cwl
+++++++++++++++
Extract text from an XML element and save it to a file.
zip-dir-flat.cwl
++++++++++++++++
Compress a directory into a zip archive.
All structure is removed from the input directory. So, if you unzip the archive,
you get a flat list of files.