Tools¶
nlppln
contains the following tools:
anonymize.cwl¶
Replace named entities in a directory of text files.
Can be used as part of an data anonymization workflow.
apachetika.cwl¶
Convert Word documents to text using Apache Tika.
archive2dir.cwl¶
Extract archive and recursively put all files in the output directory.
Uses Patool for extracting archives.
basic-text-statistics.cwl¶
Output a csv file with basic text statistics (#tokens, #sentences).
check-utf8.cwl¶
Convert text files to utf-8 encoding.
Uses BeautifulSoup’s Unicode, Dammit module to guess the file encoding if it isn’t utf-8.
clear-xml-elements.cwl¶
Empty (i.e. remove all content from) specified XML elements in the XML file.
copy-and-rename.cwl¶
Copy a file and optionally rename it.
File renaming options are: copy
(don’t rename), spaces
(remove
spaces, default), and random
(generate a random file name. The file
extension is copied too.) If the renaming option is spaces, this tool must be
run with the --relax-path-checks
option, because it accepts file names
with spaces, which CWL normally does not accept.
create-chunked-list.cwl¶
No documentation
delete-empty-files.cwl¶
No documentation
flatten-dirs.cwl¶
Given a list of directories, return a directory that contains all the files in the input directories.
By default the name of the output directory is flattened. You can specify a different name using the –dir_name option.
freqs.cwl¶
Return a sorted list of word frequencies in the corpus.
The corpus should consist of files containing space-separated tokens.
gather-dirs.cwl¶
Given a list of directories, return a directory that contains all the files in the input directories.
By default the name of the output directory is gathered. You can specify a different name using the –dir_name option.
ixa-pipe-tok.cwl¶
Tokenize a text using ixa-pipe-tok.
liwc-tokenized.cwl¶
Apply LIWC to a directory of tokenized text files.
The text files have to contain space separated tokens.
lowercase.cwl¶
Lowercase a text.
ls.cwl¶
List files in a directory.
This command can be used to convert a Directory
into a list of files. This list can be filtered on file name by specifying --endswith
.
ls_chunk.cwl¶
No documentation
merge-csv.cwl¶
Merge csv files (with the same header) into a single csv file.
mkdir.cwl¶
Create directory
normalize-whitespace-punctuation.cwl¶
Normalize whitespace and punctuation.
Replace multiple subsequent occurrences of whitespace characters and punctuation with a single occurrence.
remove-newlines.cwl¶
Remove newlines from a text.
remove-xml-elements.cwl¶
Remove specified XML elements from XML file.
replace-ner.cwl¶
Replace named entities in saf files.
Named entities can be replaced with their type or deleted.
saf-to-freqs.cwl¶
Return csv file wit a ranked list of (word, pos) pairs.
The list can be of (word, pos) pairs of (lemma, pos) pairs.
save-files-to-dir.cwl¶
Save a list of files to a directory.
If the dir_name
is not specified, it is set to the string before the rightmost - of the nameroot
of the first input file
(e.g., input-file-1-0000.txt
becomes input-file-1
). If the file name does not contain a -, the nameroot
is used (e.g.
input.txt
becomes input
).
save-ner-data.cwl¶
Create csv file with statistics about named entities.
By editing the csv file, you can control which named entities are replaced or removed using replace-ner.cwl.
tar.cwl¶
Extract zipped tar archives.
xml-to-text.cwl¶
Extract text from an XML element and save it to a file.
zip-dir-flat.cwl¶
Compress a directory into a zip archive.
All structure is removed from the input directory. So, if you unzip the archive, you get a flat list of files.