Transformers

As the name suggests, a Transformer is a kind of object that transforms other objects. In pliers, every Transformer always takes a single Stim as its input, though it can return different outputs. The Transformer API in pliers is modeled loosely on the widely-used scikit-learn API; as such, what defines a Transformer, from a user’s perspective, is that one can always call pass a Stim instance to Transformer’s .transform() method and expect to get another object as a result.

In practice, most users should never have any reason to directly instantiate the base Transformer class. We will almost invariably work with one of three different Transformer sub-classes: Extractor, Converter, and Filter. These classes are distinguished by the type of output that their respective .transform() methods produce:

Transformer class

Input

Output

Extractor

AStim

ExtractorResult

Converter

AStim

BStim

Filter

AStim

AStim

Here, AStim and BStim are different Stim subclasses. So an Extractor always returns an ExtractorResult, no matter what type of Stim it receives as input. A Converter and a Filter are distinguished by the fact that a Converter always returns a Stim of a different class than its input, while a Filter always returns a Stim of the same type as its input. This simple hierarchy turns out to be extremely powerful, as it enables us to operate in a natural, graph-like way over Stims, by filtering and converting them as needed before applying one or more Extractors to obtain extracted feature values.

Let’s examine each of these Transformer types more carefully.

Extractors

Extractors are the most important kind of Transformer in pliers, and in many cases, users will never have to touch any other kind of Transformer directly. Every Extractor implements a transform() method that takes a Stim object as its first argument, and returns an object of class ExtractorResult (see below). For example:

# Google Cloud Vision API face detection
from pliers.extractors import GoogleVisionAPIFaceExtractor

ext = GoogleVisionAPIExtractor()
result = ext.transform('my_image.jpg')

List of Extractor classes

At present, pliers implements several dozen Extractor classes that span a wide variety of input modalities and types of extracted features. These include:

Audio feature extractors

AudiosetLabelExtractor([hop_size, top_n, …])

Extract probability of 521 audio event classes based on AudioSet corpus using a YAMNet architecture.

BeatTrackExtractor([feature, hop_length])

Dynamic programming beat tracker (beat_track) from audio using the Librosa library.

ChromaCENSExtractor([n_chroma])

Extracts a chroma variant “Chroma Energy Normalized” (CENS) chromogram from audio (via Librosa).

ChromaCQTExtractor([n_chroma])

Extracts a constant-q chromogram from audio using the Librosa library.

ChromaSTFTExtractor([n_chroma])

Extracts a chromagram from an audio’s waveform using the Librosa library.

HarmonicExtractor([feature, hop_length])

Extracts the harmonic elements from an audio time-series using the Librosa library.

MeanAmplitudeExtractor([name])

Mean amplitude extractor for blocks of audio with transcription.

MelspectrogramExtractor([n_mels])

Extracts mel-scaled spectrogram from audio using the Librosa library.

MFCCExtractor([n_mfcc])

Extracts Mel Frequency Ceptral Coefficients from audio using the Librosa library.

OnsetDetectExtractor([feature, hop_length])

Detects the basic onset (onset_detect) from audio using the Librosa library.

OnsetStrengthMultiExtractor([feature, …])

Computes the spectral flux onset strength envelope across multiple channels (onset_strength_multi) from audio using the Librosa library.

PercussiveExtractor([feature, hop_length])

Extracts the percussive elements from an audio time-series using the Librosa library.

PolyFeaturesExtractor([order])

Extracts the coefficients of fitting an nth-order polynomial to the columns of an audio’s spectrogram (via Librosa).

RMSExtractor([feature, hop_length])

Extracts root mean square (RMS) from audio using the Librosa library.

SpectralCentroidExtractor([feature, hop_length])

Extracts the spectral centroids from audio using the Librosa library.

SpectralBandwidthExtractor([feature, hop_length])

Extracts the p’th-order spectral bandwidth from audio using the Librosa library.

SpectralContrastExtractor([n_bands])

Extracts the spectral contrast from audio using the Librosa library.

SpectralFlatnessExtractor([feature, hop_length])

Computes the spectral flatness from audio using the Librosa library.

SpectralRolloffExtractor([feature, hop_length])

Extracts the roll-off frequency from audio using the Librosa library.

STFTAudioExtractor([frame_size, hop_size, …])

Short-time Fourier Transform extractor.

TempogramExtractor([win_length])

Extracts a tempogram from audio using the Librosa library.

TempoExtractor([feature, hop_length])

Detects the tempo (tempo) from audio using the Librosa library.

TonnetzExtractor([feature, hop_length])

Extracts the tonal centroids (tonnetz) from audio using the Librosa library.

ZeroCrossingRateExtractor([feature, hop_length])

Extracts the zero-crossing rate of audio using the Librosa library.

Image feature extractors

BrightnessExtractor([name])

Gets the average luminosity of the pixels in the image

ClarifaiAPIImageExtractor([api_key, model, …])

Uses the Clarifai API to extract tags of images.

ClarifaiAPIVideoExtractor([api_key, model, …])

Uses the Clarifai API to extract tags from videos.

FaceRecognitionFaceEncodingsExtractor(…)

Uses the face_recognition package to extract a 128-dimensional encoding for every face detected in an image.

FaceRecognitionFaceLandmarksExtractor(…)

Uses the face_recognition package to extract the locations of named features of faces in the image.

FaceRecognitionFaceLocationsExtractor(…)

Uses the face_recognition package to extract bounding boxes for all faces in an image.

GoogleVisionAPIFaceExtractor([…])

Identifies faces in images using the Google Cloud Vision API.

GoogleVisionAPILabelExtractor([…])

Labels objects in images using the Google Cloud Vision API.

GoogleVisionAPIPropertyExtractor([…])

Extracts image properties using the Google Cloud Vision API.

GoogleVisionAPISafeSearchExtractor([…])

Extracts safe search detection using the Google Cloud Vision API.

GoogleVisionAPIWebEntitiesExtractor([…])

Extracts web entities using the Google Cloud Vision API.

MicrosoftAPIFaceExtractor([face_id, …])

Extracts face features (location, emotion, accessories, etc.).

MicrosoftAPIFaceEmotionExtractor([face_id, …])

Extracts facial emotions from images using the Microsoft API

MicrosoftVisionAPIExtractor([features, …])

Base MicrosoftVisionAPIExtractor class.

MicrosoftVisionAPITagExtractor([…])

Extracts image tags using the Microsoft API

MicrosoftVisionAPICategoryExtractor([…])

Extracts image categories using the Microsoft API

MicrosoftVisionAPIImageTypeExtractor([…])

Extracts image types (clipart, etc.) using the Microsoft API

MicrosoftVisionAPIColorExtractor([…])

Extracts image color attributes using the Microsoft API

MicrosoftVisionAPIAdultExtractor([…])

Extracts the presence of adult content using the Microsoft API

SaliencyExtractor([name])

Determines the saliency of the image using Itti & Koch (1998) algorithm implemented in pySaliencyMap

SharpnessExtractor([name])

Gets the degree of blur/sharpness of the image

TFHubExtractor(url_or_path[, features, …])

A generic class for Tensorflow Hub extractors :param url_or_path: url or path to TFHub model. You can browse models at https://tfhub.dev/. :type url_or_path: str :param features: list of labels (for classification) or other feature names. The number of items must match the number of features in the output. For example, if a classification model with 1000 output classes is passed (e.g. EfficientNet B6, see https://tfhub.dev/tensorflow/efficientnet/b6/classification/1), this must be a list containing 1000 items. If a text encoder outputting 768-dimensional encoding is passed (e.g. base BERT), this must be a list containing 768 items. Each dimension in the model output will be returned as a separate feature in the ExtractorResult. Alternatively, the model output can be packed into a single feature (i.e. a vector) by passing a single-element list (e.g. [‘encoding’]) or a string. Along the lines of the previous examples, if a single feature name is passed here (e.g. if features=[‘encoding’]) for a TFHub model that outputs a 768-dimensional encoding, the extractor will return only one feature named ‘encoding’, which contains the encoding vector as a 1-d array wrapped in a list. If no value is passed, the extractor will automatically compute the number of features in the model output and return an equal number of features in pliers, labeling each feature with a generic prefix + its positional index in the model output (feature_0, feature_1, … ,feature_n). :type features: optional :param transform_out: function to transform model output for compatibility with extractor result :type transform_out: optional :param transform_inp: function to transform Stim.data for compatibility with model input format :type transform_inp: optional :param kwargs: arguments to hub.KerasLayer call :type kwargs: dict.

VibranceExtractor([name])

Gets the variance of color channels of the image

Text feature extractors

BertExtractor([pretrained_model, tokenizer, …])

Returns encodings from the last hidden layer of BERT or similar models (ALBERT, DistilBERT, RoBERTa, CamemBERT). Excludes special tokens. Base class for other Bert extractors. :param pretrained_model: A string specifying which transformer model to use. Can be any pretrained BERT or BERT-derived (ALBERT, DistilBERT, RoBERTa, CamemBERT etc.) models listed at https://huggingface.co/transformers/pretrained_models.html or path to custom model. :type pretrained_model: str :param tokenizer: Type of tokenization used in the tokenization step. If different from model, out-of-vocabulary tokens may be treated as unknown tokens. :type tokenizer: str :param model_class: Specifies model type. Must be one of ‘AutoModel’ (encoding extractor) or ‘AutoModelWithLMHead’ (language model). These are generic model classes, which use the value of pretrained_model to infer the model-specific transformers class (e.g. BertModel or BertForMaskedLM for BERT, RobertaModel or RobertaForMaskedLM for RoBERTa). Fixed by each subclass. :type model_class: str :param framework: name deep learning framework to use. Must be ‘pt’ (PyTorch) or ‘tf’ (tensorflow). Defaults to ‘pt’. :type framework: str :param return_input: if True, the extractor returns encoded token and encoded word as features. :type return_input: bool :param model_kwargs: Named arguments for transformer model. See https://huggingface.co/transformers/main_classes/model.html :type model_kwargs: dict :param tokenizer_kwargs: Named arguments for tokenizer. See https://huggingface.co/transformers/main_classes/tokenizer.html :type tokenizer_kwargs: dict.

BertSequenceEncodingExtractor([…])

Extract contextualized sequence encodings using pretrained BERT

BertLMExtractor([pretrained_model, …])

Returns masked words predictions from BERT (or similar, e.g.

BertSentimentExtractor([pretrained_model, …])

Extracts sentiment for sequences using BERT (or similar, e.g.

ComplexTextExtractor([name])

Base ComplexTextStim Extractor class; all subclasses can only be applied to ComplexTextStim instance.

DictionaryExtractor(dictionary[, variables, …])

A generic dictionary-based extractor that supports extraction of arbitrary features contained in a lookup table.

LengthExtractor([name])

Extracts the length of the text in characters.

NumUniqueWordsExtractor([tokenizer])

Extracts the number of unique words used in the text.

PartOfSpeechExtractor([batch_size])

Tags parts of speech in text with nltk.

PredefinedDictionaryExtractor(variables[, …])

A generic Extractor that maps words onto values via one or more pre-defined dictionaries accessed via the web.

SpaCyExtractor([extractor_type, features, model])

A generic class for Spacy Text extractors

TextVectorizerExtractor([vectorizer])

Uses a scikit-learn Vectorizer to extract bag-of-features from text.

VADERSentimentExtractor()

Uses nltk’s VADER lexicon to extract (0.0-1.0) values for the positve, neutral, and negative sentiment of a TextStim.

WordCounterExtractor([case_sensitive, log_scale])

Extracts number of times each unique word has occurred within text

WordEmbeddingExtractor(embedding_file[, …])

An extractor that uses a word embedding file to look up embedding vectors for text.

Video feature extractors

FarnebackOpticalFlowExtractor([pyr_scale, …])

Extracts total amount of dense optical flow between every pair of video frames.

** Deep Learning Models **

TensorFlowKerasApplicationExtractor([…])

Labels objects in images using a pretrained Inception V3 architecture implemented in TensorFlow / Keras.

TFHubImageExtractor(url_or_path[, features, …])

TFHub Extractor class for image models :param url_or_path: url or path to TFHub model :type url_or_path: str :param features: list of labels (for classification) or other feature names. If not specified, returns numbered features (feature_0, feature_1, … ,feature_n) :type features: optional :param rescale_rgb: whether to rescale values to 0-1 range :type rescale_rgb: bool :param reshape_input: if input needs to be reshaped, specifies target shape (height, width, n_channels). Details on whether the model only accept a fixed size are usually provided on the TFHub model page :type reshape_input: tuple :param kwargs: arguments to hub.KerasLayer call :type kwargs: dict.

TFHubTextExtractor(url_or_path[, features, …])

TFHub extractor class for text models :param url_or_path: url or path to TFHub model. You can browse models at https://tfhub.dev/. :type url_or_path: str :param features: list of labels or other feature names. The number of items must match the number of features in the model output. For example, if a text encoder outputting 768-dimensional encoding is passed (e.g. base BERT), this must be a list containing 768 items. Each dimension in the model output will be returned as a separate feature in the ExtractorResult. Alternatively, the model output can be packed into a single feature (i.e. a vector) by passing a single-element list (e.g. [‘encoding’]) or a string. If no value is passed, the extractor will automatically compute the number of features in the model output and return an equal number of features in pliers, labeling each feature with a generic prefix + its positional index in the model output (feature_0, feature_1, … ,feature_n). :type features: optional :param output_key: key to desired embedding in output dictionary (see documentation at https://www.tensorflow.org/hub/common_saved_model_apis/text). Set to None is the output is not a dictionary. :type output_key: str :param preprocessor_url_or_path: if the model requires preprocessing through another TFHub model, specifies the url or path to the preprocessing module. Information on required preprocessing and appropriate models is generally available on the TFHub model webpage :type preprocessor_url_or_path: str :param preprocessor_kwargs: dictionary or named arguments for preprocessor model hub.KerasLayer call :type preprocessor_kwargs: dict.

TFHubExtractor(url_or_path[, features, …])

A generic class for Tensorflow Hub extractors :param url_or_path: url or path to TFHub model. You can browse models at https://tfhub.dev/. :type url_or_path: str :param features: list of labels (for classification) or other feature names. The number of items must match the number of features in the output. For example, if a classification model with 1000 output classes is passed (e.g. EfficientNet B6, see https://tfhub.dev/tensorflow/efficientnet/b6/classification/1), this must be a list containing 1000 items. If a text encoder outputting 768-dimensional encoding is passed (e.g. base BERT), this must be a list containing 768 items. Each dimension in the model output will be returned as a separate feature in the ExtractorResult. Alternatively, the model output can be packed into a single feature (i.e. a vector) by passing a single-element list (e.g. [‘encoding’]) or a string. Along the lines of the previous examples, if a single feature name is passed here (e.g. if features=[‘encoding’]) for a TFHub model that outputs a 768-dimensional encoding, the extractor will return only one feature named ‘encoding’, which contains the encoding vector as a 1-d array wrapped in a list. If no value is passed, the extractor will automatically compute the number of features in the model output and return an equal number of features in pliers, labeling each feature with a generic prefix + its positional index in the model output (feature_0, feature_1, … ,feature_n). :type features: optional :param transform_out: function to transform model output for compatibility with extractor result :type transform_out: optional :param transform_inp: function to transform Stim.data for compatibility with model input format :type transform_inp: optional :param kwargs: arguments to hub.KerasLayer call :type kwargs: dict.

** Misc-type extractor **

MetricExtractor([functions, var_names, …])

Extracts summary metrics from SeriesStim using numpy, scipy or custom

Note that, in practice, the number of features one can extract using the above classes is extremely large, because many of these Extractors return open-ended feature sets that are determined by the contents of the input Stim and/or the specified initialization arguments. For example, most of the image-labeling Extractors that rely on deep learning-based services (e.g., GoogleVisionAPILabelExtractor and ClarifaiAPIImageExtractor) will return feature information for any of the top N objects detected in the image. And the PredefinedDictionaryExtractor provides a standardized interface to a large number of online word lookup dictionaries (e.g., word norms for written frequency, age-of-acquisition, emotionality ratings, etc.).

Working with Extractor results

ExtractorResult classes differ from other Transformers in an important way: they return feature data rather than Stim objects. Pliers imposes a standardized representation on these results; in particular, calling transform on any Extractor returns an aptly-named object of class ExtractorResult. This object contains all kinds of useful internal references and logged data; however, it can also be easily converted to a pandas DataFrame. There’s much more to say about feature extraction results in pliers, but to keep things focused, we’ll say it in a separate Results section rather than here.

Converters

Converters, as their name suggests, convert Stim classes from one type to another. For example, the IBMSpeechAPIConverter, which is a subclass of AudioToTextConverter, takes an AudioStim as input, queries IBM’s Watson speech-to-text API, and returns a transcription of the audio as a ComplexTextStim object. Most Converter classes have sensible names that clearly indicate what they do, but to prevent any ambiguity (and support type-checking), every concrete Converter class must define _input_type and _output_type properties that indicate what Stim classes they take and return as input and output, respectively.

Implicit Stim conversion

Although Converters play a critical role in pliers, they usually don’t need to be invoked explicitly by users, as pliers can usually figure out what conversions must be performed and carry them out implicitly. For example, suppose we want to run the STFTAudioExtractor—which computes the short-time Fourier transform on an audio clip and returns its power spectrum—on the audio track of a movie clip. We don’t need to explicitly convert the VideoStim to an AudioStim, because pliers is clever enough to determine that it can get the appropriate input for the STFTAudioExtractor by executing the VideoToAudioConverter. In practice, then, the following two snippets produce identical results:

from pliers.extractors import STFTAudioExtractor
from pliers.stimuli import VideoStim
video = VideoStim('my_movie.mp4')

# Option A: explicit conversion
from pliers.converters import VideoToAudioConverter
conv = VideoToAudioConverter()
audio = conv.transform(video)
ext = STFTAudioExtractor(freq_bins=10)
result = ext.transform(audio)

# Option B: implicit conversion
ext = STFTAudioExtractor(freq_bins=10)
result = ext.transform(video)

Because pliers contains a number of “multistep” Converter classes, which chain together multiple standard Converters, implicit Stim conversion will typically work not only for a single conversion, but also for a whole series of them. For example, if you feed a video file to a LengthExtractor (which just counts the number of characters in each TextStim’s text), pliers will use the built-in VideoToTextConverter class to transform your VideoStim into a TextStim, and everything should work smoothly in most cases.

I say “most” cases, because there are two important gotchas to be aware of when relying on implicit conversion. First, sometimes there’s an inherent ambiguity about what trajectory a given stimulus should take through converter space; in such cases, the default conversions pliers performs may not line up with your expectations. For example, a VideoStim can be converted to a TextStim either by (a) extracting the audio track from the video and then transcribing into text via a speech recognition service, or (b) extracting the video frames from the video and then attempting to detect any text labels within each image. Because pliers has no way of knowing which of these you’re trying to accomplish, it will default to the first. The upshot is that if you think there’s any chance of ambiguity in the conversion process, it’s probably a good idea to explicitly chain the Converter steps (you can do this very easily using the Graph interface discussed separately). The explicit approach also provides additional precision in that you may want to initialize a particular Converter with non-default arguments, and/or specify exactly which of several candidate Converter classes to use (e.g., pliers defaults to performing speech-to-text conversion via the IBM Watson API, but also provides alternative support for the Wit.AI, and Google Cloud Speech APIs services).

Package-wide conversion defaults

Alternatively, you can set the default Converter(s) to use for any implicit Stim conversion at a package-wide level, via the config.default_converters attribute. By default, this is something like:

default_converters = {
    'AudioStim->TextStim': ('IBMSpeechAPIConverter', 'WitTranscriptionConverter'),
    'ImageStim->TextStim': ('GoogleVisionAPITextConverter', 'TesseractConverter')
}

Here, each entry in the default_converters dictionary lists the Converter(s) to use, in order of preference. For example, the above indicates that any conversion between ImageStim and TextStim should first try to use the GoogleVisionAPITextConverter, and then, if that fails (e.g., because the user has no Google Cloud Vision API key set up), fall back on the TesseractConverter. If all selections specified in the config fail, pliers will still try to use any matching Converters it finds, but you’ll lose the ability to control the order of selection.

Second, because many Converters call API-based services, if you’re going to rely on implicit conversion, you should make sure that any API keys you might need are properly set up as environment variables in your local environment, seeing as you’re not going to be able to pass those keys to the Converter as initialization arguments. For example, by default, pliers uses the IBM Watson API for speech-to-text conversion (i.e., when converting an AudioStim to a ComplexTextStim). But since you won’t necessarily know this ahead of time, you won’t be able to initialize the Converter with the correct credentials–i.e., by calling IBMSpeechAPIConverter(username=’my_username’, password=’my_password’). Instead, the Converter will get initialized without any arguments (IBMSpeechAPIConverter()), which means the initialization logic will immediately proceed to look for IBM_USERNAME and IBM_PASSWORD variables in the environment, and will raise an exception if at least one of these variables is missing. So make sure as many API keys as possible are appropriately set in the environment. You can read more about this in the API keys section.

List of Converter classes

Pliers currently implements the following Converter classes:

ComplexTextIterator([name])

Iterates elements in a ComplexTextStim as TextStims.

ExtractorResultToSeriesConverter([name])

Converts an ExtractorResult instance to a list of SeriesStims.

IBMSpeechAPIConverter([username, password, …])

Uses the IBM Watson Text to Speech API to run speech-to-text transcription on an audio file.

GoogleSpeechAPIConverter([language_code, …])

Uses the Google Speech API to do speech-to-text transcription.

GoogleVisionAPITextConverter([…])

Detects text within images using the Google Cloud Vision API.

MicrosoftAPITextConverter([language, …])

Detects text within images using the Microsoft Vision API.

RevAISpeechAPIConverter([access_token, …])

Uses the Rev AI speech-to-text API to transcribe an audio file.

TesseractConverter([name])

Uses the Tesseract library to extract text from images.

VideoFrameCollectionIterator([name])

Iterates frames in a DerivedVideoStim as ImageStims.

VideoFrameIterator([name])

Iterates frames in a VideoStim as ImageStims.

VideoToAudioConverter([name])

Convert a VideoStim to an AudioStim by extracting the audio track using moviepy.

VideoToComplexTextConverter([steps])

Converts a VideoStim directly to a ComplexTextStim.

VideoToTextConverter([steps])

Converts a VideoStim directly to a TextStim.

WitTranscriptionConverter([api_key, rate_limit])

Speech-to-text transcription via the Wit.ai API.

ExtractorResultToSeriesConverter([name])

Converts an ExtractorResult instance to a list of SeriesStims.

Filters

A Filter is a kind of Transformer that returns an object of the same Stim class as its input. Filters can be used for tasks like image or audio filtering, text tokenization or sanitization, and many other things. The defining feature of a Filter class is simply that it must return a Stim of the same type as the input passed to the .transform() method (e.g., passing in an ImageStim and getting back another, modified, ImageStim).

List of Filter classes

Pliers currently implements the following Filter classes:

AudioTrimmingFilter([start, end, frames, …])

FrameSamplingFilter([every, hertz, top_n])

Samples frames from video stimuli, to improve efficiency.

ImageCroppingFilter([box])

Crops an image.

LowerCasingFilter([name])

Lower cases the text in a TextStim.

PillowImageFilter([image_filter])

Uses the ImageFilter module from PIL to run a pre-defined image enhancement filter on an ImageStim.

PunctuationRemovalFilter([name])

Removes punctuation from a TextStim.

TemporalTrimmingFilter([start, end, frames, …])

Temporally trims the contents of the audio stimulus using the provided start and end points.

TokenizingFilter([tokenizer])

Tokenizes a TextStim into several word TextStims.

TokenRemovalFilter([tokens, language])

Removes tokens (e.g., stopwords, common words, punctuation) from a TextStim.

VideoTrimmingFilter([start, end, frames, …])

WordStemmingFilter([stemmer, tokenize, …])

Nltk-based word stemming and lemmatization Filter.

AudioResamplingFilter([target_sr, resample_type])

Librosa-based audio resampling Filter.

Iterable-aware transformations

A useful feature of the Transformer API is that it’s inherently iterable-aware: every pliers Transformer (including all Extractors, Converters, and Filters) can be passed an iterable (specifically, a list, tuple, or generator) of Stim objects rather than just a single Stim. The transformation will then be applied independently to each Stim.