Creating Our Own Classifier for Evaluation
Welcome to the evaluation part of this tutorial. In this part we will construct a logistic regression classifier using sklearn. We will then train it to recognize which words are English, French, or German. Then, we will present it with a sequence of unknown words and ask it to make predictions. We will do this once with our own custom GloVe embeddings, and then with the sklearn CountVectorizer embeddings. We will compare the performance of the two embedding styles to see which one comes out on top.
Necessary packages: In order to follow along, you will need python3 with jupyter notebook and the following python libraries: sklearn, numpy If you do not have these libraries, you can install them using:
pip install scikit-learn
pip install numpy
Downloading the notebook
If you haven’t already downloaded this repository, you will need to do so now, if you wish to follow along. I provide a jupyter notebook with which you can follow along every step of the way. So go ahead and clone the repository and open your jupyter notebook. It is imperative that you clone or download the entire repository, as it contains the data the notebook is working with.
git clone https://github.com/remo-help/character-embedding-with-glove
jupyter notebook
Once you are in your jupyter notebook browser, open the “Glove_Char_Classifier.ipynb” notebook. You can now follow along. You can also copy all the code I paste here and do this live in your python interpreter.
Imports
We will start off by importing the necessary packages:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, accuracy_score
import numpy as np
This will be enough to implement our classifier.
Reading in the embeddings
Unfortunately, sklearn by itself does not support GloVe vector files, so we need to write a little function that allows us to read in the embeddings we need. This function will take a path to a file with glove-style embeddings and return a dictionary where the keys are the word or character that is embedded and the value is the respective vector. This way every character has a unique GloVe vector that we can easily access:
def glove(path):
embeddings_dict={}
f = open(path,'r',encoding='utf8') #reading in the input data
vector_file = f.read()
f.close()
vector_file=vector_file.split("\n")
for line in vector_file:
if line[0]:
line=line.split()
token = line[0]
vector=line[1:]
vector= [float(x) for x in vector]
vector = np.array(line[1:])
vector = vector.astype('float64')
embeddings_dict[token]=vector
return embeddings_dict
Defining the Classifier
We will define our classifier as a class. We will give this class a few attributes and functions:
train()
This function will train our classifier on the GloVe vectors we provide. It will take the vector-dictionary created by the glove() function as 1st argument and the path to the training data as a second argument.
train_count()
This function will train our classifier on the embeddings provided by the CountVectorizer of the sklearn library. This is a count-based embedding technique. It takes the path to the training data as an argument.
predict_labels()
This function will take in test data and make predictions based on the test data. The output of the function will be encoded labels (integers). The output will consist of a tuple of lists. One list will contain all the predicted labels. The other list will contain the gold labels. We can inverse_transform those labels with the LabelEncoder if we want to see the strings.
predict_labels_count()
This is the same as above, except for the CountVectorizer embeddings.
class Classifier:
def __init__(self):
"""
Initializes the classifier.
"""
self.label_encoder = LabelEncoder()
self.vectorizer = CountVectorizer(analyzer='char', ngram_range=(1, 10)) #we are using the "word" parameter because our characters are already seperated
self.model = LogisticRegression(solver="lbfgs", multi_class="multinomial", max_iter=5000,verbose=1)
def train(self, vectors, train_data_path):
"""
trains on the GloVe embeddings
"""
f = open(train_data_path,'r',encoding="utf8") #reading in the input data
input_string = f.read()
f.close()
input_list=input_string.split("\n") #storing each datum as a string in a list
input_list=input_list[0:-2] #getting rid of the last empty newline
feature_strings= []
label_strings= []
for datum in input_list:
temp = datum.split("\t") #separating the word from its label
feature_strings.append(temp[0])
label_strings.append(temp[1])
del input_list #deleting the initial list to be economic
vector_list=[] #here we will store our feature vectors
for feature in feature_strings: #selecting a feature
feature = feature[0:-1].split(" ") #getting rid of trailing space and splitting on space
temp_list = [] #here we will compile the feature vector
for char in feature: #iterating over characters of the word
if char in vectors.keys(): #making sure we are not running into unknown chars
vector = vectors[char] #getting the vector associated with the character
temp_list.append(vector)
if len(temp_list)==0:
print("cannot find:",feature)
base = temp_list[0] #selecting the first character-vector
for i in range(1,len(temp_list)):
base=np.add(base,temp_list[i]) #adding all other character vectors item-wise
array=base/len(temp_list) #dividing by the total amount of characters
vector_list.append(array) #putting the new and averaged array into our list
x_train = vector_list #transforming the feature strings into glove vectors based on our pretrained embedding
y_train = self.label_encoder.fit_transform(label_strings) #fitting and transforming the labels to integers
if len(x_train)!=len(y_train): #making sure we have as many feature vectors as we have labels
print("features:",len(x_train),"\n","labels:",len(y_train))
self.model.fit(x_train, y_train)
print("The classifier has finished training")
def train_count(self, train_data_path):
"""
This trains the classifier on vectors created by the CountVectorizer, as opposed to pretrained embeddings
"""
f = open(train_data_path,'r',encoding="utf8") #reading in the input data
input_string = f.read()
f.close()
input_list=input_string.split("\n") #storing each datum as a string in a list
input_list=input_list[0:-2] #getting rid of the last empty newline
feature_strings= []
label_strings= []
for datum in input_list:
temp = datum.split("\t") #separating the word from its label
feature_strings.append(temp[0])
label_strings.append(temp[1])
del input_list #deleting the initial list
x_train = self.vectorizer.fit_transform(feature_strings) #transforming the feature strings into vectors with the count vectorizer
y_train = self.label_encoder.fit_transform(label_strings) #fitting and transforming the labels to integers
self.model.fit(x_train, y_train) #training the model
print("The classifier has finished training (CountVectorizer)")
def predict_labels(self, vectors, test_data_path):
"""
Takes a testfile where tokens and labels are seperated with \t
returns the sequence of gold-lables and the sequence of predicted labels in INTEGER FORM
"""
f = open(test_data_path,'r',encoding="utf8") #reading in the input data
input_string = f.read()
f.close()
input_list=input_string.split("\n") #storing each datum as a string in a list
input_list=input_list[0:-2] #getting rid of the last empty newline
feature_strings= []
label_strings= []
for datum in input_list:
temp = datum.split("\t") #separating the word from its label
feature_strings.append(temp[0])
label_strings.append(temp[1])
del input_list #deleting the initial list to be economic
vector_list=[] #here we will store our feature vectors
for feature in feature_strings: #selecting a feature
feature = feature[0:-1].split(" ") #getting rid of trailing space and splitting on space
temp_list = [] #here we will compile the feature vector
for char in feature: #iterating over characters of the word
if char in vectors.keys(): #making sure we are not running into unknown chars
vector = vectors[char] #getting the vector associated with the character
temp_list.append(vector)
if len(temp_list)==0:
print("cannot find:",feature)
base = temp_list[0] #selecting the first character-vector
for i in range(1,len(temp_list)):
base=np.add(base,temp_list[i]) #adding all other character vectors item-wise
array=base/len(temp_list) #dividing by the total amount of characters
vector_list.append(array) #putting the new and averaged array into our list
x_test=vector_list
predictions = self.model.predict(x_test) #makes the predictions
gold_labels = self.label_encoder.transform(label_strings)
return predictions,gold_labels
def predict_labels_count(self, test_data_path):
"""
Takes a testfile where tokens and labels are seperated with \t
returns the sequence of gold-lables and the sequence of predicted labels in INTEGER FORM
This is the CountVectorizer based implementation
"""
f = open(test_data_path,'r') #reading in the test data
input_string = f.read()
f.close()
input_list=input_string.split("\n") #storing each datum as a string in a list
input_list=input_list[0:-2] #getting rid of the last empty newline
feature_strings= []
label_strings= []
for datum in input_list:
temp = datum.split("\t") #separating the word from its label
feature_strings.append(temp[0])
label_strings.append(temp[1])
del input_list #deleting the initial list to be economic
x_test = self.vectorizer.transform(feature_strings) #transforming the feature strings into vectors with the count vectorizer
predictions = self.model.predict(x_test) #makes the predictions
gold_labels = self.label_encoder.transform(label_strings)
return predictions,gold_labels
Training and testing the Classifier
Now that we have constructed our classifier, it is time to train and test it once with each embedding type. First we will train and test it with our own custom embeddings. We will calculate an accuracy score and a f1 score and will save that for later:
classifier = Classifier()
vec= glove('data/dickens_vectors.txt')
classifier.train(vec,'data/train_tokens.txt')
predictions = classifier.predict_labels(vec,"data/test_tokens.txt")
glove_f1= f1_score(predictions[1], predictions[0], average='weighted')
glove_accuracy=accuracy_score(predictions[1],predictions[0])
After that it is time to train and test the CountVectorizer embeddings. For this we will simply refit the Classifier using the CountVectorizer embeddings:
classifier = Classifier()
classifier.train_count('data/count_train_tokens.txt')
predictions = classifier.predict_labels_count("data/count_test_tokens.txt")
count_accuracy=accuracy_score(predictions[1],predictions[0])
count_f1=f1_score(predictions[1], predictions[0], average='weighted')
If you are particularly observant, you probably noticed that I am using a different data file for the CountVectorizer. This is because this vectorizer is already character based, so there is no need to seperate the characters with spaces. The files contain the exact same as their GloVe equivalents. If you are unconvinced, go open the repository and take a look.
Compare
Now we can finally compare the two:
print("Here are the results of the GloVe embeddings:\n","F1_score: ",glove_f1,"\n", "accuracy: ", glove_accuracy,"\n")
print("Here are the results of the CountVectorizer embeddings:\n","F1_score: ",count_f1,"\n", "accuracy: ", count_accuracy)
Here are the results of the GloVe embeddings:
- F1_score: 0.6131187454146257
- accuracy: 0.5547794117647059
Here are the results of the CountVectorizer embeddings:
- F1_score: 0.7720255415700806
- accuracy: 0.7511859582542695
Results
As you can see, the CountVectorizer implementation performs much better. However, even the CountVectorizer implementation does not yield amazing results. This is likely due to the fact that this is quite the difficult task. German, French, and English share many cognates (words that have similar roots). Also, the data contains a variety of names, which are a huge problem. For example “Irene” can be a name in French, English, and German. Considering these problems, both performances are still pretty good.
Why are the GloVe embeddings performing so much worse? That is a good question. It seems global co-occurence is not very useful in the character space. This might perhaps be because the vocabulary is so small and characters generally have a high probability to co-occur with each other. It may be the case that GloVe only really becomes useful with a high vocabulary size, where we can properly leverage global co-occurences. In a small vocabulary space, the count based methods may have the edge. These results are unexpected, but very interesting. Does this mean we should never do GloVe style character embedding? I would not say that definitively. If we performed a similar task with more diverse languages and a higher character count, we may end up with results that favor GloVe.