Skip to content

ISCC - Text Processing#

Text handling functions.

text_meta_extract(fp) #

Extract metadata from text document file.

Parameters:

Name Type Description Default
fp str

Filepath to text document file.

required

Returns:

Type Description
dict

Metadata mapped to IsccMeta schema

text_meta_embed(fp, meta) #

Embed metadata into a copy of the text document.

Parameters:

Name Type Description Default
fp str

Filepath to source text document file

required
meta IsccMeta

Metadata to embed into text document

required

Returns:

Type Description
str|None

Filepath to the new file with updated metadata (None if no embedding supported)

text_extract(fp) #

Extract plaintext from a text document.

Parameters:

Name Type Description Default
fp st

Filepath to text document file.

required

Returns:

Type Description
str

Extracted plaintext

text_features(text) #

Create granular fingerprint for text (minhashes over ngrams from cdc-chunks). Text should be normalized before extracting text features.

Parameters:

Name Type Description Default
text str

Normalized plaintext.

required

Returns:

Type Description

Dictionary with 'sizes' and 'features'.

text_chunks(text, avg_size = idk.sdk_opts.text_avg_chunk_size) #

Generates variable sized text chunks (without leading BOM)

Parameters:

Name Type Description Default
text

normalized plaintext

required
avg_size

Targeted average size of text chunks in bytes.

idk.sdk_opts.text_avg_chunk_size

text_name_from_uri(uri) #

Extract filename part of an uri without file extension to be used as fallback title for an asset if no title information can be acquired.

Parameters:

Name Type Description Default
uri str

Url or file path

required

Returns:

Type Description
str

derived name (might be an empty string)

text_thumbnail(fp) #

Create a thumbnail for a text document.

Parameters:

Name Type Description Default
fp str

Filepath to text document.

required

Returns:

Type Description
Image.Image|None

Thumbnail image as PIL Image object