Comparing Neural Language Models for Classification, in Spanish

Leer este post en español

We are seeing a huge increase in the use of Natural Language Processing techniques, but there is still even a bigger potential that will be harnessed in the upcoming months and years. As Sebastian Ruder has said we are living NLP’s ImageNet moment. It seems that new and better Language Models are coming from research teams every week from the deep learning community.

As Machine Learning & Artificial Intelligence practitioners our jobs is to create real world applications considering both latest advances and practical implementation issues to solve business needs. Maybe we will not use the latest, biggest, state-of-the-art model as soon as it’s out on Github or Tensorflow Hub but rather use some other easier to implement, faster and/or lighter model until using a more complex model makes sense. Saying this, it’s helpful to have a method that enables us faster and easier evaluation of models for our specific business tasks. Also some results can be considered as general guidelines on the performance of the models hoping those can be generalized to other tasks.

I’ve been working on such a method and I’ll like to share what I’ve found from comparing several models based on the Neural-Network Language Model made available by Google at Tensorflow Hub focusing on Spanish. There are fewer resources for Spanish than other languages so I hope this may help. Most important results I found:

  • NNLM models trained in Spanish have a better performance than those trained in English.
  • Normalized models have better performance than non-normalized models.
  • In general 50 and 128 dimensions model performed similarly although it seems that hyperparameter optimization may improve both models, specially the later.

Please look at this Github repository for details and full results. I hope to be able to add more results with other models.

MeLi Data Challenge 2019 – Multiclass Classification in Keras

The MercadoLibre Data Challenge 2019 was a great competition Kaggle’s style with an awsome prize consisting on tickets (and accomodation & air tickets) to Khipu Latin American conference on Artificial Intelligence.

I gave it a try implementing several ideas I already had in my head. First I tried a standard BERT embeddings classifier which proved harder to implement in Keras with all the current changes that have been going on with Tensorflow 2.0. Next I tried a Multi Layer Perceptron (MLP) fed with fixed BERT precalculated sentence embeddings. These two approaches didn’t traveled too far but may be interesting starting points to try in the future.

Finally a simpler model gave me the best results I could get, with a Balanced Accuracy Score of 0.87, which put me in the 45st place in the leaderboard. The model uses a Tensorflow Hub Keras layer fine tunned from Neural Probabilistic Language Model described here, followed by a MLP, trained with Adam and Sparse Categorical Crossentropy loss. Check the Github repository for details and code.

Some interesting findings:

  • The language model that I used was spanish based but performed almost equally well in portuguese (nnlm-es-128dim-with-normalization)
  • I got slightly better results by training the same model separetly on both languages and merging the results. But a single model trained on the whole dataset perfomed almost as well

The kind guys at MercadoLibre did a kickstart workshop and gave away a model that uses ULMFiT and reaches also 0.87. I think it’s very interesting that my model, using a different approach reaches a very similar result, and that’s why I’m pretty happy with it despite being very far from the first places!

Dataset de Nombres y Apellidos de Personas en México

Para generar un dataset representativo de los nombres de personas en México se usó una idea tomada de datamx que utiliza una base de datos abierta de la Secretaría de Educación Pública con los nombres de 1,256,438 trabajadores federalizados.

La limpieza y el procesamiento de los datos está en analizar_nombres_sep.R. Realiza lo siguiente:

  • Elimina duplicados usando como llave el CURP
  • Obtiene el género a partir del caracter 11 del CURP
  • Obtiene el año de nacimiento de los caracteres 5 y 6 del CURP
  • Calcula la edad de cada registro al año 2012 que es el de actualización según la página de la SEP
  • Calcula las frecuencias de los primeros y segundos apellidos, elimina los que tienen una frecuencia menor a 5 y elimina algunos que son basura o nulos. Crea un solo data frame y lo guarda
  • Divide el dataset en Hombres y Mujeres, calcula las frecuencias de los nombres en cada caso, elimina los que tienen una frecuencia menor a 5, calcula la edad promedio para cada nombre y guarda el data frame

Se usa un formato similar al de los nombres y apellidos frecuentes en España de donde se inspiró este proyecto. Hay que notar que es una muestra grande pero dos órdenes de magnitud menor a lo que sería una completa y que está segmentada al ser únicamente trabajadores de la SEP.

Proyecto en Github | Repositorio de Datos Abiertos

Hombres

nombre     frec   edad_media prob
1 JOSE LUIS 7028 45.13 0.0181661682257485
2 MIGUEL ANGEL 5137 41.78 0.0132782592737151
3 FRANCISCO 4853 46.73 0.0125441682412575
4 JUAN 4655 47.27 0.0120323723806004
5 JESUS 4198 44.66 0.0108511061769625
6 ALEJANDRO 4042 41.72 0.0104478730746266
7 ANTONIO 3961 46.33 0.0102385020407214
8 JORGE 3847 45.3 0.00994383169670667
9 PEDRO 3830 46.09 0.00989988962786237
10 CARLOS 3765 45.34 0.00973187583522241

Mujeres

nombre   frec   edad_media prob
1 MARIA GUADALUPE 7105 42.81 0.0122739553749732
2 LETICIA 5848 43.66 0.0101024758666915
3 PATRICIA 5422 42.41 0.00936655679705909
4 GUADALUPE 5348 43.38 0.00923872109012763
5 MARIA DEL CARMEN 4881 44.04 0.00843197412881693
6 VERONICA 4772 38.18 0.008243675587526
7 MARGARITA 4674 45.41 0.00807437965131947
8 ELIZABETH 4661 38.18 0.00805192202712881
9 SILVIA 4223 45.43 0.00729527284285882
10 ROSA MARIA 4107 46.97 0.00709488173469599

Apellidos

        apellido   frec_pri   frec_seg
3362 HERNANDEZ 44095 44333
3061 GARCIA 33010 33351
4278 MARTINEZ 31080 31087
3995 LOPEZ 30288 30188
3185 GONZALEZ 25356 25362
5960 RODRIGUEZ 22642 22490
5438 PEREZ 22470 22353
6178 SANCHEZ 21800 21782
5769 RAMIREZ 18806 18632
2924 FLORES 14160 13907

When transforming vectors to their Principal Components, are their relations preserved?



pca test






Question: When transforming vectors to their Principal Components, are their relations preserved?

We have a set of vectors. After performing Principal Component Analysis (PCA) we now use the “rotated” vectors to perform analysis. Can we be confident that the original relations (cosine similarity between vectors) are preserved on the new vector space?

knitr::opts_chunk$set(echo = TRUE)
set.seed(1234)
#Cosine Similarity
cos.sim <- function(A,B) 
{
  return( sum(A*B)/sqrt(sum(A^2)*sum(B^2)) )
}   

1 – Generate a test set y=x+err

#"noisy" x=y
x = runif(n = 1000,min=-100,max=100) # x has a uniform distribution
y = x + rnorm(n=1000,mean=0,sd=20) # y has a normal distribution

d <- data.frame(x=x,y=y)
#ver primeros puntos
head(d)
##           x            y
## 1 -77.25932 -57.56371837
## 2  24.45988  -0.03487656
## 3  21.85495  36.04947094
## 4  24.67589  22.49148848
## 5  72.18308 107.83523462
## 6  28.06212  23.19322747
plot(d)

2 – Perform PCA

pca <- prcomp(d,center = F,scale. = F)
summary(pca)
## Importance of components:
##                           PC1      PC2
## Standard deviation     83.040 13.30478
## Proportion of Variance  0.975  0.02503
## Cumulative Proportion   0.975  1.00000
pca$rotation
##        PC1       PC2
## x 0.691987 -0.721910
## y 0.721910  0.691987
dt <- pca$x #d transformed
head(dt)
##            PC1        PC2
## [1,] -95.01826  15.940928
## [2,]  16.90074 -17.681966
## [3,]  41.14781   9.168461
## [4,]  33.31222  -2.249953
## [5,] 127.79708  22.510896
## [6,]  36.16204  -4.208914
plot(dt)

3 – For a given vector, calculate it’s relation (cosine similarity) to all the others in both spaces

a <- d[2,] #choose a point on the original space
s<-apply(d,1,function(x) cos.sim(a,x)) #calculate the angles to all the other vectors
at <- dt[2,] #get the same point in the new space
st<-apply(dt,1,function(x) cos.sim(at,x)) #calculate the angles in the new space

The angles are the same

head(data.frame(s=s,st=st))
##            s         st
## 1 -0.8010403 -0.8010403
## 2  1.0000000  1.0000000
## 3  0.5171996  0.5171996
## 4  0.7381007  0.7381007
## 5  0.5550765  0.5550765
## 6  0.7698978  0.7698978

Check closest points are the same

head(data.frame(s=order(s,decreasing = T),st=order(st,decreasing = T)))
##     s  st
## 1   2   2
## 2 898 898
## 3 387 387
## 4  47  47
## 5 571 571
## 6 299 299

Answer: YES

4 – Bonus: magnitudes are also the same

magd <- sqrt(rowSums(d*d))
magdt <- sqrt(rowSums(dt*dt))

head(data.frame(magd=magd,magdt=magdt))
##        magd     magdt
## 1  96.34617  96.34617
## 2  24.45991  24.45991
## 3  42.15689  42.15689
## 4  33.38812  33.38812
## 5 129.76453 129.76453
## 6  36.40616  36.40616

5 – Bonus: you can retreive the distributions

hist(as.data.frame(dt)$PC1) #this is close to uniform

hist(as.data.frame(dt)$PC2) #this is close to normal

Looks closer than the original x,y

hist(d$x)

hist(d$y)