Bridging the gap between computers and human emotion.

Published in

Artificial Intelligence in Plain English

18 min readMay 19, 2021

In a time where we are so interconnected, it is baffling that we can be just as alone. I remember feeling alone, and I’m sure we all have at some point in our lives. Everyone in the world has felt incompetent, damaged, or misguided.

But in that we also have found new ways to adapt and grow through platforms that speak openly about mental health to new outlets for getting help.

It’s that human connection that makes us complete, but we still have a long ways to go from living in a world where we can fairly and effectively treat an individual’s mental health issues.

The barriers to mental health access are far more pronounced today than ever. Teens and adults alike are finding it harder to find access to care, and people are becoming more isolated than ever.

Globally, more than 70% of people with mental illness receive no treatment whatsoever while a study by the World Health Organization found that between 30 and 80 percent of people with mental health issues don’t seek treatment.

Despite all of the therapists and counsellors that are available, there seems to be something that’s preventing people from not just getting care, but also the right care.

Personalization comes hand in hand with empathy and compassion, and while we don’t have enough therapists in the world, we do have the advent of technology. But it’s crazy for me to think that despite all of our progress in social networks and emerging tech, the only service someone has access to for help is a chatbot.

A simple pre-determined response doesn’t solve issues, and in cases where personalized care is inaccessible to most, we need something new that can see above the surface. It’s hard to find a therapist, a friend or someone you can even talk to who can especially understand your struggles.

So how can we not only increase access to care, but dramatically improve our approach to helping the people we care about?

That’s what I set off to do with Bluesky. The goal was to create a pipeline that can utilize much more relevant data on a patient to better help them. I wanted to tackle the area of personalized mental health support and care using something unconventional: deep learning.

No, this isn’t a chatbot but instead its a set of tools that help to collect far more detailed, relevant data on a patient which can dramatically improve the potential for personalized care in this space.

Traditionally, most applications you’ve probably heard of involve chatbots or some pre-defined response mechanism that’s triggered with keywords like “I’m sad” or “I had a bad day”.

But this is not how human emotions work. There are nuances in the way someone talks, what they say, and their facial expression and automating this is a challenge. Solving the communication and understanding gap is hard on its own, but to make it accurate is still miles away.

We need more data and smart people working on this problem, and Bluesky is a step forward in that direction.

Bluesky uses multimodal deep learning to extract key data points on a patient’s mood, attitude, and emotions in a given moment. This can be used for several applications, primarily in identifying relationships between certain questions, key topics, and mentions with changes in sentiment.

If you want to understand what’s bothering you, just speak your mind and get a potential answer to that question. The system is designed to analyze similar queues a therapist or trained professional might use, but also draw better connections between a person’s facial features, speech, statements, and more.

The idea is to take a patient and identify their mood and emotion using two primary indicators: facial features and audio tone. This is done using time-series data to provide context, after which these sentiment predictions are examined with what the person is saying.

For example, a patient records a video of them saying

“Today was an okay day. Nothing good happened, but a few bad things did happen. I lost that promotion I was looking to get after 5 years with my company. I guess we move forward.”

Now, there’s a lot of potential areas of contradiction or interest someone might want to ask more about. Questions like “why was the promotion so important to you” or “do you feel happy moving forward?” can offer us more insight on whether the person really is feeling optimistic or is covering up their emotions.

To do this, Bluesky can analyze a person’s facial expressions and make sentiment predictions, after which it can connect the timestamp of a prediction with a specific word or phrase the patient said. The same goes for the speech sentiment model, which analyzes patterns in volume and tone to make predictions of whether someone is happy or sad, regardless of the context.

These two models come together and help zone in on key phrases that might have more to them.

Bluesky achieves this using the following DL algorithms:

1. Audio Samples Sentiment Analysis
2. Facial Features Sentiment Analysis
3. Automatic Speech to Text Recognition
4. Part of Speech Tagging
5. NLU, Subject Capture, and Named Entity Recognition

There’s a lot of machine learning behind this, so let’s break it down a little bit more from the sentiment models to the actual phrase capture system.

Voice:

Our voice can tell us things just beyond what we’re saying. Most of us subconsciously don’t really pay attention to the way we say things because that’s just determined by what you’re feeling at the time.

This is critical in not just the way we communicate, but the way we can also identify tension and mood. With that in mind, we’re jumping into the world audio processing and the key requirement for audio sentiment analysis to be successful at its task is that it needs to interpret mood like a human.

But let’s start from the ground up. What even is audio data? Well, it’s analog in nature, meaning that we hear it thanks to the vibrations and variation in the air-pressure and not by digital means.

Your typical .wav format is nothing but a bunch of sound waves that have been sampled into these very small intervals.

The more intervals there are, the higher quality the data is. This is also known as your sampling rate or frequency, which is measured in no. of intervals/second = kHz. If you increase the number of intervals in a second, your audio becomes faster and the less intervals there are, the slower it gets.

This is a quick diagram to get you familiar with what type of data we’re working with.

Audio has more elements than just its frequency. It has amplitude which, if you’ve taken any high school math class, is just the distance from the rest to the crest (highest to base).

Now, most sound that we hear can be differentiated by different kinds of frequencies and volumes to produce consonants, words, and literally everything else that you hear. We can visualize the different frequencies in the form of a waveform.

To extract more useful features from this, we can apply the Fourier transform which can help break down a waveform into its independent frequencies and their respective amplitudes.

This is known as a spectrum, but to convert this data into something valuable, we need the time axis that is lost in a typical spectrum. To do this, we can convert windows of our waveform into segments and then apply the Fourier transform on them. This produces a spectrogram as seen below.

Now, the representation you see here is a vanilla spectrogram where the intensity in colour signals the presence of a given frequency.

This is useful in most cases, but super ineffective in real-life applications.

But it’s just audio isn’t it? If I play this on a computer, I can totally hear someone speak. Well yes, but it’s not exactly how we actually hear audio.

Remember, audio is analog so even if we have a spectrogram, the playback does not necessarily equal what the graph is implying above. The reason why is because frequencies are linear because it’s just a count of the number of oscillations in audio per second.

We don’t listen to things linearly because that’s not how frequencies work. Instead, sound is typically displayed on a logarithmic scale or exponential.

While the difference between octaves or the frequencies is the same, the actual pitch and sound generated is on the logarithmic scale. So a linear change in frequency equates to an exponential change in the key of a piano or someone’s voice.

Thankfully, we can use something called the Mel Scale, which is basically a logarithmic scale for interpreting sound.

On the x-axis, you have your frequency, and on the y-axis is your pitch. This allows us to model a linear change in pitch in mels against a logarithmic change in the frequency.

This is really just mapping the y-axis of a spectrogram onto the mel scale, thus converting the values into logarithmic representations.

So let’s go back to our example.

We have a dataset of several audio samples where we want to predict their mood and tone. To do this, we convert our audio file into a spectrogram and then in its mel representation.

To actually model the mel spectrogram into something valuable, we can extract their respective coefficients which can model the growth rate.

These are called MFCCs or Mel-frequency cepstral coefficients which are directly derived from the mel-cepstrum. We can extract the actual values in the form of an array.

What you’re looking at now is the MFCc representation of the original spectrogram in this example.

Now, what’s important here is the idea of time.

We need to make sure that we can make correlations in the sentiment data with its relation to a trigger or queue, and so our data is fundamentally a time series.

The best model for this that can not only classify the mood and tone of the sample for that one datapoint but across a wide series of samples is an LSTM or a long short-term memory model.

This is not only valuable for classification, but also plays hand in hand with the actual data that’s being expressed by the audio sample.

An LSTM is essentially a model that can take a series of input and feed themselves back into the next prediction so that they influence future predictions.

This is similar to how time-series data works, where the stock-price of AAPL at 9AM can be used to predict the price at 10AM based on previous trends.

The same goes with emotion and audio as we ultimately need to be able to not only identify how the emotional change occurs over time, but connect those predictions with what was said by the patient.

I won’t go into LSTMs too much but the de-facto blog post you should read is Christopher Olah’s Blog.

TLDR; an LSTM is a better way of capturing information from previous nodes far back in time rather than just the most recent information using something called a cell state which contains the most important or unnecessary information in a given cell which the model trains to optimize.

Using this model, we can train the data on 20 different emotions between male and female voices.

Face:

After a brief but simplified dive into signal and audio processing, we can now move onto facial features. Although they’ve become the staple of most computer vision applications, extracting important facial features is very difficult but not impossible.

The pipeline required to provide a viable prediction of a facial expression can often be computationally expensive and inaccurate.

To solve this, Bluesky uses 2 different approached to verify the other’s results: a pre-trained facial landmark detection algorithm and a traditional image classification approach.

The goal of both of these algorithms is to extract meaningful information from the image of a face, but the first step is always going to be isolating the face in the image.

This can be done using the famous Haar Cascades, which are a collection of functions that are pre-trained as convolutional kernels that pass over an image and extract key facial identifiers.

Some cascades can involve edge features, line features, or other potential indicators. Using additional data, the model can filter out the weak classifiers and keep the accurate ones to create a cascade of classifiers.

Using OpenCV’s pre-trained detector, identifying the face and its subcomponents like the eye and mouth are trivial.

Facial landmarks are a great way to extract important features in a face. By identifying distance markers from a forehead, eye, lip and also their relative position from other key landmarks, you can use the coordinates of these landmarks to not only reduce complexity but yield visible and useful data from a face.

They use a similar approach to face detection but are minute classifiers that can measure the distance between key points such as the lip and forehead to approximate and validate other predictions.

OpenCV can easily be used, specifically its shape_predictor_68 face_landmark model and other prebuilt models for face identification.

The preprocessing works by first isolating the faces in an image using a series of regression trees, after which it uses a pre-trained model to identify landmarks. The constrained local model helps increase accuracy in landmark detection dramatically.

It uses the prior or probability distribution of the distance between pairs of input pixels to approximate points. These regression trees use the dataset to learn how to better approximate the location of these landmarks which can then be used in real time.

The landmarks you see below were created using the 68 point iBUG 300-W dataset, the same dataset that the popular dlib library uses.

These landmarks can then be extracted and turned into coordinates, which works well with convolutional neural networks but also a feedforward to reduce complexity and get better at generalization.

However, to preserve a few more important features and distinctions such as gender, age, and apply noise to the data in the form of other facial features, hairs, etc, we can use a convolutional neural network.

If you haven’t already, you can read an article on one of my previous projects to get a better grasp of how a CNN works.

For this use case, the pipeline still uses the Haar Cascade for a better prediction, but preserves the image in pixel format. We can resize, regularize, and transform the image data into pixel values, after which another CNN can be used to extract key features and make a viable prediction.

The architecture for this model is very deep, mostly because it needs to generalize and make clear distinctions between just a small subset of changes in pixel values or orientations.

To make the predictions accurate in the context of time, we can plug the automatic speech recognition model after the predictions to refine the outputs.

This feature is not necessarily needed from a plug-ability standpoint, but it can be used to predict future emotions and predict behaviour on a timescale.

The facial sentiment model can now in real-time using OpenCV read facial emotions, collect an array of the landmarks and face detected, and create a viable sentiment classification.

With technically 3 models that can accurately predict sentiments and changes in mood and emotion at any given time, now comes the part where we need to make conclusions and connections across our data.

Conversation:

From a practicality and usability standpoint, it wouldn’t make much sense if this sentiment data could not be used with a much more usable data format.

This is where creating tools and functions for the NLP component comes into the mix.

The goal of the conversation models is to essentially extract the exact content that a patient has mentioned and using the sentiment analysis algorithm, draw conclusions and point directions for further context.

This could mean leaning into family problems at home or exploring the idea of self-worth, but this all needs to be backed with context. Making this pluggable and usable is the primary challenge in this part of the algorithm.

First we need an algorithm that can convert speech data into text and words by training on large corpuses. Then, we need to break down the text further and extract meaningful information about what is said.

Let’s say that someone’s mood dipped when they said:

“I went home and saw that I lost my promotion. The email was blunt, and it made me feel hopeless.”

There are so many things that could be associated with the change in mood, and to a computer, it could be the act of going home (home = noun) or the feeling of hopelessness (adjective) but no connections could be drawn between the cause of hopelessness and going home, let alone any meaningful ones.

This will require POS or part of speech tagging to give the computer a background on what each word means in relation to another and this can be done by assigning each word as a noun, preposition, verb, etc.

Once we have that, we can then embed the text to draw connections between certain keywords and identify the most important parts of the sentence.

This can be trained in an abstract manner specific to the patient, but for now we just need to parse and simplify the statement above. This is called key-phrase extraction.

Let’s start with Automatic Speech Recognition. These models work by looking at the individual sounds or frequencies in a speech recording and matching them as closely as possible to a word or character.

This is the same tech that Siri uses for your microphone, but the actual part of this that we need is the Speech to Text algorithm.

For end-to-end speech recognition, the audio needs to be transcribed from a mel spectrogram and then turned into time-bound vectors.

These vectors store the audio data and then use an RNN to convert the input sequence into the corresponding text/label. The sequence is trained with bidirectional activations, which are critical in two-way validation.

I ended up adopting a model from Assemblyai.com and the Deep Speech 2 paper that is famously used for most ASR applications, but the poor dataset size made it difficult to adopt.

Here’s my attempted architecture where I ended up using Batch Normalization over Layer Norm.

The convolution is used to extract key audio sample maps from the Mel spectrogram compared to traditional coefficients. The ResNet block helps in feeding previous inputs into future predictions, which makes the model more efficient while allowing new predictions to be more accurate.

These convolution outputs are then fed into a middle-layer that connects the outputs to the Gated Recurrent Unit or a series of GRUs which you can think of as a variation of the LSTM without its cell state.

Instead it uses similar gates but passes its values through a reset and update gate which serves a similar function to the cell state.

These 2 networks are combined in a sequential classifier as seen above. To overcome this model’s limitation, pre-trained models proved useful in accessing not only a large corpus of training data but also overcoming the potential of errors in special cases (keywords, organizations, etc).

Using this, we can take audio and turn it into readable text. However, just mapping a sentiment to a given word is not enough, but applying context to a word can make it more helpful.

By categorizing words with a particular part of speech, the definition of a word loses importance but its use prevails. This can further help our NLP model identify what information is actually important without taking everything for face value.

This model can be used both ways for optimizing ASR and assisting in named-entity recognition.

The PoS system works by tagging words in the form of embeddings, which consist of pairs of the previous tag, current tag, and next tag.

Take the sentence:

“I was walking in Toronto”. The embedding format of this sentence would be (<pad>, I, was), (I, was, walking), (was, walking, in), (walking, in, Toronto), (in, Toronto, .).

Using this representation, the model can learn more about the relation of words with each other using a probability distribution across the sentence.

After some preprocessing involving smoothing certain examples, the PoS can be programmed using a RNN.

We won’t necessarily need the long-term memory of an LSTM as it might be a little overkill.

PoS runs into similar problems like ASR, so utilizing a pre-trained model also works fine here. This can be done with the popular NLTK library.

The PoS yields something called contextualized embeddings. This can now be used for key-phrase extraction, which is the process of capturing the most important topics in a given text example.

This is a challenging problem that researchers continue to work on perfecting because of the low-tolerance that these models have for background or external knowledge.

Building these knowledge branches or lexical patterns is critical for context building and improving accuracy, and with enough mental health patient data, this is possible.

To represent these knowledge branches in the form of relationships and their importance, we can use linear chain conditional random fields or CRFs which I won’t get into but you can read more about them here:

Named Entity Recognition(NER) using Conditional Random Fields in NLP

Its time to jump on Information Extraction in NLP after a thorough discussion on algorithms in NLP for pos tagging…

medium.com

Using a BiLSTM-CRF architecture adopted from:

Keyphrase Extraction as Sequence Labeling Using Contextualized Embeddings

Dhruva Sahrawat Debanjan Mahata Haimin Zhang Mayank Kulkarni Agniv Sharma Rakesh Gosangi Amanda Stent Yaman Kumar Rajiv…

link.springer.com

the model can map a tag and word to a greater bucket.

These buckets are formed after analyzing large quantities of data which all have similar coefficients that describe their relation to other words. Here’s an example of this map:

Again, the problem still comes down to having not enough relevant data. But with some pre-trained models and libraries, preprocessing and identifying key phrases in a statement can be done.

Purpose

Bluesky was built to explore the notion of what machine learning promised: something that isn’t human but can act and understand humans.

Here, it does that while working closely with experts in the field to make access to care more equitable and higher quality.

This is a context problem above all else, but the tools and pipelines provided have great potential at connecting sentiment to context and truly creating a personalized treatment plan for each patient.

It’s now more important than ever that we can help each other, and that doesn’t come with just helping a lot of people, it means understanding everyone’s struggles.

To do that for something as sensitive as mental health, it requires a giant leap forward in our capabilities, and machine learning can play a critical role.

My hope with this project was to learn more about not just building projects, but building important projects that we all need, and I hope this can be an indication of what ingenuity and a little bit of experimentation can lead to.

More things coming soon.

Thanks for taking the time to read this and I hope you got something out of it. If you want to get more technical or simply reach out to me, you can find me on LinkedIn, Email, or GitHub. You can also subscribe to my newsletter here.

Check out all of the resources that I used for guidance for the ASR models you saw and a lot of the preprocessing methods mentioned below.

Building an end-to-end Speech Recognition model in PyTorch

Deep Learning has changed the game in speech recognition with the introduction of end-to-end models. These models take…

www.assemblyai.com

Residual Network: Implementing ResNet

in Pytorch

towardsdatascience.com

Audio manipulation with torchaudio - PyTorch Tutorials 1.8.1+cu102 documentation

torchaudio provides powerful audio I/O functions, preprocessing transforms and dataset. In this tutorial, we will look…

pytorch.org

torchtext.data - torchtext 0.4.0 documentation

The data module provides the following: Ability to define a preprocessing pipeline Batching, padding, and…

torchtext.readthedocs.io