ISCC - Text Processing#
Text handling functions.
text_meta_extract(fp)
#
Extract metadata from text document file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fp
|
Filepath to text document file. |
required |
Returns:
Type | Description |
---|---|
Metadata mapped to IsccMeta schema |
text_meta_embed(fp, meta)
#
Embed metadata into a copy of the text document.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fp
|
Filepath to source text document file |
required | |
meta
|
Metadata to embed into text document |
required |
Returns:
Type | Description |
---|---|
Filepath to the new file with updated metadata (None if no embedding supported) |
text_extract(fp)
#
Extract plaintext from a text document.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fp
|
Filepath to text document file. |
required |
Returns:
Type | Description |
---|---|
Extracted plaintext |
text_features(text)
#
Create granular simprints for text (minhashes over ngrams from cdc-chunks). Text should be normalized and cleaned before extracting text features.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
Normalized and cleaned plaintext. |
required |
Returns:
Type | Description |
---|---|
Dictionary with 'sizes', 'features', 'offsets', and 'contents'. |
text_chunks(text, avg_size=idk.sdk_opts.text_avg_chunk_size)
#
Generates variable sized text chunks (without leading BOM).
:yields: Text chunks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
normalized plaintext |
required | |
avg_size
|
Targeted average size of text chunks in characters. |
text_avg_chunk_size
|
text_name_from_uri(uri)
#
Extract filename
part of an uri without file extension to be used as fallback title for an
asset if no title information can be acquired.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
uri
|
Url or file path |
required |
Returns:
Type | Description |
---|---|
derived name (might be an empty string) |
text_thumbnail(fp)
#
Create a thumbnail for a text document.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fp
|
Filepath to text document. |
required |
Returns:
Type | Description |
---|---|
Thumbnail image as PIL Image object |
text_sanitize(text)
#
Sanitize text from untrusted sources (e.g. metadata extracted from assets)