ISCC - Text Processing#
Text handling functions.
text_meta_extract(fp)
#
Extract metadata from text document file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fp |
str
|
Filepath to text document file. |
required |
Returns:
Type | Description |
---|---|
dict
|
Metadata mapped to IsccMeta schema |
text_meta_embed(fp, meta)
#
Embed metadata into a copy of the text document.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fp |
str
|
Filepath to source text document file |
required |
meta |
IsccMeta
|
Metadata to embed into text document |
required |
Returns:
Type | Description |
---|---|
str|None
|
Filepath to the new file with updated metadata (None if no embedding supported) |
text_extract(fp)
#
Extract plaintext from a text document.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fp |
st
|
Filepath to text document file. |
required |
Returns:
Type | Description |
---|---|
str
|
Extracted plaintext |
text_features(text)
#
Create granular fingerprint for text (minhashes over ngrams from cdc-chunks). Text should be normalized before extracting text features.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
Normalized plaintext. |
required |
Returns:
Type | Description |
---|---|
Dictionary with 'sizes' and 'features'. |
text_chunks(text, avg_size = idk.sdk_opts.text_avg_chunk_size)
#
Generates variable sized text chunks (without leading BOM)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
normalized plaintext |
required | |
avg_size |
Targeted average size of text chunks in bytes. |
idk.sdk_opts.text_avg_chunk_size
|
text_name_from_uri(uri)
#
Extract filename
part of an uri without file extension to be used as fallback title for an
asset if no title information can be acquired.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
uri |
str
|
Url or file path |
required |
Returns:
Type | Description |
---|---|
str
|
derived name (might be an empty string) |
text_thumbnail(fp)
#
Create a thumbnail for a text document.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fp |
str
|
Filepath to text document. |
required |
Returns:
Type | Description |
---|---|
Image.Image|None
|
Thumbnail image as PIL Image object |