Chronicles of an ML Engineer in Lambda Labs

8 min readAug 26, 2021

Photo by Christina @ wocintechchat.com on Unsplash

As a student of Lambda School, Lambda Labs is what all of us have been working towards ever since we enrolled. Labs has been a lot of late nights, research, failed and successful attempts but as someone at the end of Labs, it has been enriching and totally worth it!

What was it that I did in Labs you ask?

I worked on a cross-functional team of Data Scientists, Front end developers, Back-end developers, UX/UI designers, and Technical Project Leads and Managers. This was a perfect representation of an actual work scenario within the safety of an educational environment — an internship with a twist!

The Project

I was honored to have been placed on a project that was being developed for Human Rights First — an organization that cares about the rights of a citizen and holds human rights abusers accountable. My project was Blue Witness. This was a project that began with the intention of holding the cops accountable for their use of force.

The Objective

The purpose of this project was to be able to gather information on police use of force incidents from social media and rank each incident as the Department of Justice would do so. The app allows the user to visualize all the incidents reported on the map. The data collected would be useful to journalists/organizations, the wider public looking to make changes to the system and also bring about awareness and accountability backed by data.

Initial Days

When our team inherited the project, we were given a roadmap of what was expected for us to develop in terms of new features and enhancements. We inherited a codebase that seemed almost perfect. Everything seemed to be working as it was meant to — I felt like there was nothing more I could add to it. Given that I chose to be the ML Engineer on the project and the Natural Language Processing model being used was at 81% accuracy, I felt like there was not much work to be done. However, very soon after taking a closer look at the code base and having a conversation with the ML Engineer on the previous team — I soon stood corrected!

Introducing FrankenBert!

FrankenBert is our NLP Model that we have used in Blue Witness. It is a pre-trained BERT model that we have retrained for our purposes in order to classify tweets that have been scraped off Twitter as incident reports of police brutality. The BERT model is a Natural Language Processing (NLP) model that achieves state-of-the-art accuracy on many natural language processing tasks like general language understanding evaluation. This was the primary reason behind using this model.FrankenBert classifies the incident into a particular rank based on the intensity of police brutality reported. These ranks were predefined as under:

Rank -1: Not Usable for Training
Rank 0: No Police Presence
Rank 1: Non-violent Police Presence
Rank 2: Open Handed (Arm Holds & Pushing)
Rank 3: Blunt Force Trauma (Batons & Shields)
Rank 4: Chemical & Electric Weapons (Tasers & Pepper Spray)
Rank 5: Lethal Force (Guns & Explosives)

The Challenge

We had our first meeting with the stakeholder to understand if there were features/enhancements that he would like us to add/improve to the project in order to give him a product that he would be satisfied with. This led to a conversation about how FrankenBert was not performing well in classifying a rank 2 and rank 3 incident and that this was something that needed to be looked into.

Problem Solving

In order to understand if there were defects/shortcomings that the previous team may have encountered while building the model, I reached out to the ML Engineer on that team. Our conversation and my research on the data led me to understand that we did not have enough data in our training set for rank 2 and rank 3 incidents. To be more precise, we had over 1200 reports for Ranks 0, 1, 4, and 5 but only had 111 reports for Rank 2 and even less for Rank 3. This was very clear why the model that we used for classification was struggling with those two classes in particular.

Data distribution across the different ranks before we began our work

Now that we had identified the real issue, we needed to come up with a plan that would solve this lack of data issue for us. The Data Science team along with our DS Head went through a brainstorming session and we decided that we needed to synthesize tweets to tackle this issue. As a team, we saw 2 ways of going about this

1. The team would take time every day to write tweets that would then be added to our training data via the BW Labeler app that our DS Head had developed.

2. As one of the ML Engineers on the team, I could build a model that would generate synthesized tweets and these could then be added to the training data set.

We decided to go with the second suggestion as that would not take away time from the team which could be used for other improvements that the project required.

With this decision made, I began chalking down a plan on what would be the best way to accomplish the task at hand. I needed to build a model that would be able to generate text based on the training data provided to it. Text generation being a subfield of Natural Language processing, I knew I needed to research which type of NLP model would yield the best results. I decided to use a Long Short term memory (LSTM) model which is an artificial recurrent neural network that would read the training data and sequentially attempt to predict the next few words.

I needed to train this LSTM model on data that was relevant to a Rank 2 and Rank 3 incident report. To do this I had to filter the data for these ranks from the training set that the BERT model (used to make the rank classifications) was using. Having pulled the relevant data, I trained the LSTM model and generated tweets that I initially thought were pretty decent. On further inspection, I noticed that some synthesized tweets were not usable because of the sequential method of text generation that the LSTM model used. There would be a lot of manual work involved in cleaning them up and making them workable which defeated the whole purpose of this exercise.

This led another Data Scientist on the team to explore an alternate model and he was able to use a pre-trained GPT2 model and generated some exceptional synthesized tweets. A few of which are as follows:

“ Near 3rd and Salmon, a protester is held face-down on the ground by an officer. The protester is punched and struck by another officer with a baton. The protester is then arrested. ”

“ Footage taken at West and Rector in Manhattan shows multiple officers arresting a protester for yelling out profanities. Police respond by striking the protester with batons and kneeling on him. At one point, the protester states he cannot breathe. Police do not acknowledge this. ”

We managed to generate a whole bunch of synthesized tweets that could be used to retrain FrankenBert.

Data distribution of ranks post text generation

FrankenBert for the win!

It was now time to test FrankenBert and see if it was as capable as I thought it to be. The notebook for this can be found here.

Data Wrangling: We used the synthesized data and combined it with the original training data and the original test data that we had inherited from the previous team. As with any dataset, there was some cleaning to be done. The data generated from GPT2 model had quite a few duplicates that needed to be dropped.
Splitting Data: This entire data was now split randomly into a training set (80% of the total data) and a testing set(20% of the total data)
Model Training: FrankenBert was now ready to be retrained on this data. The training took a little over 2 hours! All good things come to those who wait.
Model Testing: For some sanity testing, we fed the trained FrankenBert with some synthetic tweets.

5. Model Metrics: As with all models, we needed some metrics to judge the performance of FrankenBert. The results of these metrics were phenomenal. We managed to increase the accuracy score from 81% to 96.8%. The notebook for just the metrics can be found here. Some significant metrics are shown below.

Confusion Matrix: This shows us the number of true positives and the false positives by FrankenBert. These numbers of true positives have significantly improved post the retraining.

Classification Report: This report gives us a more detailed report on how the model performed in the classification. The data here is represented in percentages.

The above image says it all. FrankenBert — we did it!

6. Inferences: We were able to improve the performance of FrankenBert from 81% to 96.8%. This resulting in more confidence that our model is classifying incident reports correctly.

Next Steps

Although the model correctly classifies incident reports, the confidence levels for these classifications can certainly be improved in some specific circumstances.

For instance, at the moment an incident report that states “Cops use rubber bullets to disperse a crowd of protestors” is classified correctly as a Rank 4 incident, the confidence level for the same is just about 50%. This is something the next team that gets to work on Blue Witness could build on.

My Learnings:

The data that is used to train an NLP model plays a very significant role. The quality, relevancy, and availability of the data directly affect the goals of the model. Incomplete and inaccurate data sets will result in the model's poor performance.

My learning curve in this one month has been exponential. With the exposure to different libraries including new and upcoming one, I have had instances when I have gone -” What!! We can do that??” and also had instances where I have gone “ Oh no! I messed up!!”. Each of those moments was an instance where I have either received feedback both positive and negative.

In conclusion, the hours and the energy put into this project are completely worthwhile and I am extremely proud of what we have achieved here. Being able to work on multiple different models and with such a collaborative and efficient team has me excited for what lies ahead!

Chronicles of an ML Engineer in Lambda Labs

Written by Rhia George