Skip to content

ISCC - Text Processing#

Text handling functions.

text_meta_extract(fp) #

Extract metadata from text document file.

Parameters:

Name Type Description Default
fp

Filepath to text document file.

required

Returns:

Type Description

Metadata mapped to IsccMeta schema

text_meta_embed(fp, meta) #

Embed metadata into a copy of the text document.

Parameters:

Name Type Description Default
fp

Filepath to source text document file

required
meta

Metadata to embed into text document

required

Returns:

Type Description

Filepath to the new file with updated metadata (None if no embedding supported)

text_extract(fp) #

Extract plaintext from a text document.

Parameters:

Name Type Description Default
fp

Filepath to text document file.

required

Returns:

Type Description

Extracted plaintext

text_features(text) #

Create granular simprints for text (minhashes over ngrams from cdc-chunks). Text should be normalized and cleaned before extracting text features.

Parameters:

Name Type Description Default
text

Normalized and cleaned plaintext.

required

Returns:

Type Description

Dictionary with 'sizes', 'features', 'offsets', and 'contents'.

text_chunks(text, avg_size=idk.sdk_opts.text_avg_chunk_size) #

Generates variable sized text chunks (without leading BOM).

:yields: Text chunks.

Parameters:

Name Type Description Default
text

normalized plaintext

required
avg_size

Targeted average size of text chunks in characters.

text_avg_chunk_size

text_name_from_uri(uri) #

Extract filename part of an uri without file extension to be used as fallback title for an asset if no title information can be acquired.

Parameters:

Name Type Description Default
uri

Url or file path

required

Returns:

Type Description

derived name (might be an empty string)

text_thumbnail(fp) #

Create a thumbnail for a text document.

Parameters:

Name Type Description Default
fp

Filepath to text document.

required

Returns:

Type Description

Thumbnail image as PIL Image object

text_sanitize(text) #

Sanitize text from untrusted sources (e.g. metadata extracted from assets)