Tools

nlppln contains the following tools:

anonymize.cwl

Replace named entities in a directory of text files.

Can be used as part of an data anonymization workflow.

apachetika.cwl

Convert Word documents to text using Apache Tika.

archive2dir.cwl

Extract archive and recursively put all files in the output directory.

Uses Patool for extracting archives.

basic-text-statistics.cwl

Output a csv file with basic text statistics (#tokens, #sentences).

check-utf8.cwl

Convert text files to utf-8 encoding.

Uses BeautifulSoup’s Unicode, Dammit module to guess the file encoding if it isn’t utf-8.

clear-xml-elements.cwl

Empty (i.e. remove all content from) specified XML elements in the XML file.

copy-and-rename.cwl

Copy a file and optionally rename it.

File renaming options are: copy (don’t rename), spaces (remove spaces, default), and random (generate a random file name. The file extension is copied too.) If the renaming option is spaces, this tool must be run with the --relax-path-checks option, because it accepts file names with spaces, which CWL normally does not accept.

create-chunked-list.cwl

No documentation

delete-empty-files.cwl

No documentation

filter-nes.cwl

Control which named entities will be removed.

See replace-ner.cwl.

flatten-dirs.cwl

Given a list of directories, return a directory that contains all the files in the input directories.

By default the name of the output directory is flattened. You can specify a different name using the –dir_name option.

freqs.cwl

Return a sorted list of word frequencies in the corpus.

The corpus should consist of files containing space-separated tokens.

frog-dir.cwl

Frog a directory of text files.

frog-single-text.cwl

Frog a single text file.

frog-to-saf.cwl

Convert frog csv output to saf.

gather-dirs.cwl

Given a list of directories, return a directory that contains all the files in the input directories.

By default the name of the output directory is gathered. You can specify a different name using the –dir_name option.

ixa-pipe-tok.cwl

Tokenize a text using ixa-pipe-tok.

liwc-tokenized.cwl

Apply LIWC to a directory of tokenized text files.

The text files have to contain space separated tokens.

lowercase.cwl

Lowercase a text.

ls.cwl

List files in a directory.

This command can be used to convert a Directory into a list of files. This list can be filtered on file name by specifying --endswith.

ls_chunk.cwl

No documentation

merge-csv.cwl

Merge csv files (with the same header) into a single csv file.

mkdir.cwl

Create directory

normalize-whitespace-punctuation.cwl

Normalize whitespace and punctuation.

Replace multiple subsequent occurrences of whitespace characters and punctuation with a single occurrence.

prettify-xml.cwl

Pretty print xml file.

Uses BeautifulSoup pretty printing.

remove-newlines.cwl

Remove newlines from a text.

remove-xml-elements.cwl

Remove specified XML elements from XML file.

replace-ner.cwl

Replace named entities in saf files.

Named entities can be replaced with their type or deleted.

saf-to-freqs.cwl

Return csv file wit a ranked list of (word, pos) pairs.

The list can be of (word, pos) pairs of (lemma, pos) pairs.

saf-to-txt.cwl

Convert saf to space separated tokens.

save-dir-to-subdir.cwl

Save a directory to a subdirectory.

Puts inner_dir into the outer_dir.

save-files-to-dir.cwl

Save a list of files to a directory.

If the dir_name is not specified, it is set to the string before the rightmost - of the nameroot of the first input file (e.g., input-file-1-0000.txt becomes input-file-1). If the file name does not contain a -, the nameroot is used (e.g. input.txt becomes input).

save-ner-data.cwl

Create csv file with statistics about named entities.

By editing the csv file, you can control which named entities are replaced or removed using replace-ner.cwl.

tar.cwl

Extract zipped tar archives.

textDNA-generate.cwl

Generate data to vizualize using TextDNA.

xml-to-text.cwl

Extract text from an XML element and save it to a file.

zip-dir-flat.cwl

Compress a directory into a zip archive.

All structure is removed from the input directory. So, if you unzip the archive, you get a flat list of files.