Comparing Neural Language Models for Classification, in Spanish

March 31, 2020April 3, 2020 eduardofv Leave a comment

We are seeing a huge increase in the use of Natural Language Processing techniques, but there is still even a bigger potential that will be harnessed in the upcoming months and years. As Sebastian Ruder has said we are living NLP’s ImageNet moment. It seems that new and better Language Models are coming from research teams every week from the deep learning community.

As Machine Learning & Artificial Intelligence practitioners our jobs is to create real world applications considering both latest advances and practical implementation issues to solve business needs. Maybe we will not use the latest, biggest, state-of-the-art model as soon as it’s out on Github or Tensorflow Hub but rather use some other easier to implement, faster and/or lighter model until using a more complex model makes sense. Saying this, it’s helpful to have a method that enables us faster and easier evaluation of models for our specific business tasks. Also some results can be considered as general guidelines on the performance of the models hoping those can be generalized to other tasks.

I’ve been working on such a method and I’ll like to share what I’ve found from comparing several models based on the Neural-Network Language Model made available by Google at Tensorflow Hub focusing on Spanish. There are fewer resources for Spanish than other languages so I hope this may help. Most important results I found:

NNLM models trained in Spanish have a better performance than those trained in English.
Normalized models have better performance than non-normalized models.
In general 50 and 128 dimensions model performed similarly although it seems that hyperparameter optimization may improve both models, specially the later.

Please look at this Github repository for details and full results. I hope to be able to add more results with other models.

MeLi 2019 – Colab Notebook with best model

October 8, 2019October 8, 2019 eduardofv Leave a comment

Check the model in Google Colab and play with it. Should be pretty self explaining but ask me anything I can help with on Twitter.

MeLi Data Challenge 2019 – Multiclass Classification in Keras

October 4, 2019October 4, 2019 eduardofv 1 Comment

The MercadoLibre Data Challenge 2019 was a great competition Kaggle’s style with an awsome prize consisting on tickets (and accomodation & air tickets) to Khipu Latin American conference on Artificial Intelligence.

I gave it a try implementing several ideas I already had in my head. First I tried a standard BERT embeddings classifier which proved harder to implement in Keras with all the current changes that have been going on with Tensorflow 2.0. Next I tried a Multi Layer Perceptron (MLP) fed with fixed BERT precalculated sentence embeddings. These two approaches didn’t traveled too far but may be interesting starting points to try in the future.

Finally a simpler model gave me the best results I could get, with a Balanced Accuracy Score of 0.87, which put me in the 45st place in the leaderboard. The model uses a Tensorflow Hub Keras layer fine tunned from Neural Probabilistic Language Model described here, followed by a MLP, trained with Adam and Sparse Categorical Crossentropy loss. Check the Github repository for details and code.

Some interesting findings:

The language model that I used was spanish based but performed almost equally well in portuguese (nnlm-es-128dim-with-normalization)
I got slightly better results by training the same model separetly on both languages and merging the results. But a single model trained on the whole dataset perfomed almost as well

The kind guys at MercadoLibre did a kickstart workshop and gave away a model that uses ULMFiT and reaches also 0.87. I think it’s very interesting that my model, using a different approach reaches a very similar result, and that’s why I’m pretty happy with it despite being very far from the first places!

Dataset de Nombres y Apellidos de Personas en México

October 20, 2017October 25, 2017 eduardofv Leave a comment

Para generar un dataset representativo de los nombres de personas en México se usó una idea tomada de datamx que utiliza una base de datos abierta de la Secretaría de Educación Pública con los nombres de 1,256,438 trabajadores federalizados.

La limpieza y el procesamiento de los datos está en analizar_nombres_sep.R. Realiza lo siguiente:

Elimina duplicados usando como llave el CURP
Obtiene el género a partir del caracter 11 del CURP
Obtiene el año de nacimiento de los caracteres 5 y 6 del CURP
Calcula la edad de cada registro al año 2012 que es el de actualización según la página de la SEP
Calcula las frecuencias de los primeros y segundos apellidos, elimina los que tienen una frecuencia menor a 5 y elimina algunos que son basura o nulos. Crea un solo data frame y lo guarda
Divide el dataset en Hombres y Mujeres, calcula las frecuencias de los nombres en cada caso, elimina los que tienen una frecuencia menor a 5, calcula la edad promedio para cada nombre y guarda el data frame

Se usa un formato similar al de los nombres y apellidos frecuentes en España de donde se inspiró este proyecto. Hay que notar que es una muestra grande pero dos órdenes de magnitud menor a lo que sería una completa y que está segmentada al ser únicamente trabajadores de la SEP.

Proyecto en Github | Repositorio de Datos Abiertos

Hombres

	nombre	frec	edad_media	prob
1	JOSE LUIS	7028	45.13	0.0181661682257485
2	MIGUEL ANGEL	5137	41.78	0.0132782592737151
3	FRANCISCO	4853	46.73	0.0125441682412575
4	JUAN	4655	47.27	0.0120323723806004
5	JESUS	4198	44.66	0.0108511061769625
6	ALEJANDRO	4042	41.72	0.0104478730746266
7	ANTONIO	3961	46.33	0.0102385020407214
8	JORGE	3847	45.3	0.00994383169670667
9	PEDRO	3830	46.09	0.00989988962786237
10	CARLOS	3765	45.34	0.00973187583522241

Mujeres

	nombre	frec	edad_media	prob
1	MARIA GUADALUPE	7105	42.81	0.0122739553749732
2	LETICIA	5848	43.66	0.0101024758666915
3	PATRICIA	5422	42.41	0.00936655679705909
4	GUADALUPE	5348	43.38	0.00923872109012763
5	MARIA DEL CARMEN	4881	44.04	0.00843197412881693
6	VERONICA	4772	38.18	0.008243675587526
7	MARGARITA	4674	45.41	0.00807437965131947
8	ELIZABETH	4661	38.18	0.00805192202712881
9	SILVIA	4223	45.43	0.00729527284285882
10	ROSA MARIA	4107	46.97	0.00709488173469599

Apellidos

	apellido	frec_pri	frec_seg
3362	HERNANDEZ	44095	44333
3061	GARCIA	33010	33351
4278	MARTINEZ	31080	31087
3995	LOPEZ	30288	30188
3185	GONZALEZ	25356	25362
5960	RODRIGUEZ	22642	22490
5438	PEREZ	22470	22353
6178	SANCHEZ	21800	21782
5769	RAMIREZ	18806	18632
2924	FLORES	14160	13907

When transforming vectors to their Principal Components, are their relations preserved?

September 16, 2017October 20, 2017 eduardofv Leave a comment

pca test

Question: When transforming vectors to their Principal Components, are their relations preserved?

We have a set of vectors. After performing Principal Component Analysis (PCA) we now use the “rotated” vectors to perform analysis. Can we be confident that the original relations (cosine similarity between vectors) are preserved on the new vector space?

knitr::opts_chunk$set(echo = TRUE)
set.seed(1234)
#Cosine Similarity
cos.sim <- function(A,B) 
{
  return( sum(A*B)/sqrt(sum(A^2)*sum(B^2)) )
}

1 – Generate a test set y=x+err

#"noisy" x=y
x = runif(n = 1000,min=-100,max=100) # x has a uniform distribution
y = x + rnorm(n=1000,mean=0,sd=20) # y has a normal distribution

d <- data.frame(x=x,y=y)
#ver primeros puntos
head(d)

##           x            y
## 1 -77.25932 -57.56371837
## 2  24.45988  -0.03487656
## 3  21.85495  36.04947094
## 4  24.67589  22.49148848
## 5  72.18308 107.83523462
## 6  28.06212  23.19322747

plot(d)

2 – Perform PCA

pca <- prcomp(d,center = F,scale. = F)
summary(pca)

## Importance of components:
##                           PC1      PC2
## Standard deviation     83.040 13.30478
## Proportion of Variance  0.975  0.02503
## Cumulative Proportion   0.975  1.00000

pca$rotation

##        PC1       PC2
## x 0.691987 -0.721910
## y 0.721910  0.691987

dt <- pca$x #d transformed
head(dt)

##            PC1        PC2
## [1,] -95.01826  15.940928
## [2,]  16.90074 -17.681966
## [3,]  41.14781   9.168461
## [4,]  33.31222  -2.249953
## [5,] 127.79708  22.510896
## [6,]  36.16204  -4.208914

plot(dt)

3 – For a given vector, calculate it’s relation (cosine similarity) to all the others in both spaces

a <- d[2,] #choose a point on the original space
s<-apply(d,1,function(x) cos.sim(a,x)) #calculate the angles to all the other vectors
at <- dt[2,] #get the same point in the new space
st<-apply(dt,1,function(x) cos.sim(at,x)) #calculate the angles in the new space

The angles are the same

head(data.frame(s=s,st=st))

##            s         st
## 1 -0.8010403 -0.8010403
## 2  1.0000000  1.0000000
## 3  0.5171996  0.5171996
## 4  0.7381007  0.7381007
## 5  0.5550765  0.5550765
## 6  0.7698978  0.7698978

Check closest points are the same

head(data.frame(s=order(s,decreasing = T),st=order(st,decreasing = T)))

##     s  st
## 1   2   2
## 2 898 898
## 3 387 387
## 4  47  47
## 5 571 571
## 6 299 299

Answer: YES

4 – Bonus: magnitudes are also the same

magd <- sqrt(rowSums(d*d))
magdt <- sqrt(rowSums(dt*dt))

head(data.frame(magd=magd,magdt=magdt))

##        magd     magdt
## 1  96.34617  96.34617
## 2  24.45991  24.45991
## 3  42.15689  42.15689
## 4  33.38812  33.38812
## 5 129.76453 129.76453
## 6  36.40616  36.40616

5 – Bonus: you can retreive the distributions

hist(as.data.frame(dt)$PC1) #this is close to uniform

hist(as.data.frame(dt)$PC2) #this is close to normal

Looks closer than the original x,y

hist(d$x)

hist(d$y)

Category: Data