MeLi Data Challenge 2019 – Multiclass Classification in Keras

The MercadoLibre Data Challenge 2019 was a great competition Kaggle’s style with an awsome prize consisting on tickets (and accomodation & air tickets) to Khipu Latin American conference on Artificial Intelligence.

I gave it a try implementing several ideas I already had in my head. First I tried a standard BERT embeddings classifier which proved harder to implement in Keras with all the current changes that have been going on with Tensorflow 2.0. Next I tried a Multi Layer Perceptron (MLP) fed with fixed BERT precalculated sentence embeddings. These two approaches didn’t traveled too far but may be interesting starting points to try in the future.

Finally a simpler model gave me the best results I could get, with a Balanced Accuracy Score of 0.87, which put me in the 45st place in the leaderboard. The model uses a Tensorflow Hub Keras layer fine tunned from Neural Probabilistic Language Model described here, followed by a MLP, trained with Adam and Sparse Categorical Crossentropy loss. Check the Github repository for details and code.

Some interesting findings:

  • The language model that I used was spanish based but performed almost equally well in portuguese (nnlm-es-128dim-with-normalization)
  • I got slightly better results by training the same model separetly on both languages and merging the results. But a single model trained on the whole dataset perfomed almost as well

The kind guys at MercadoLibre did a kickstart workshop and gave away a model that uses ULMFiT and reaches also 0.87. I think it’s very interesting that my model, using a different approach reaches a very similar result, and that’s why I’m pretty happy with it despite being very far from the first places!

Dataset de Nombres y Apellidos de Personas en México

Para generar un dataset representativo de los nombres de personas en México se usó una idea tomada de datamx que utiliza una base de datos abierta de la Secretaría de Educación Pública con los nombres de 1,256,438 trabajadores federalizados.

La limpieza y el procesamiento de los datos está en analizar_nombres_sep.R. Realiza lo siguiente:

  • Elimina duplicados usando como llave el CURP
  • Obtiene el género a partir del caracter 11 del CURP
  • Obtiene el año de nacimiento de los caracteres 5 y 6 del CURP
  • Calcula la edad de cada registro al año 2012 que es el de actualización según la página de la SEP
  • Calcula las frecuencias de los primeros y segundos apellidos, elimina los que tienen una frecuencia menor a 5 y elimina algunos que son basura o nulos. Crea un solo data frame y lo guarda
  • Divide el dataset en Hombres y Mujeres, calcula las frecuencias de los nombres en cada caso, elimina los que tienen una frecuencia menor a 5, calcula la edad promedio para cada nombre y guarda el data frame

Se usa un formato similar al de los nombres y apellidos frecuentes en España de donde se inspiró este proyecto. Hay que notar que es una muestra grande pero dos órdenes de magnitud menor a lo que sería una completa y que está segmentada al ser únicamente trabajadores de la SEP.

Proyecto en Github | Repositorio de Datos Abiertos

Hombres

nombre     frec   edad_media prob
1 JOSE LUIS 7028 45.13 0.0181661682257485
2 MIGUEL ANGEL 5137 41.78 0.0132782592737151
3 FRANCISCO 4853 46.73 0.0125441682412575
4 JUAN 4655 47.27 0.0120323723806004
5 JESUS 4198 44.66 0.0108511061769625
6 ALEJANDRO 4042 41.72 0.0104478730746266
7 ANTONIO 3961 46.33 0.0102385020407214
8 JORGE 3847 45.3 0.00994383169670667
9 PEDRO 3830 46.09 0.00989988962786237
10 CARLOS 3765 45.34 0.00973187583522241

Mujeres

nombre   frec   edad_media prob
1 MARIA GUADALUPE 7105 42.81 0.0122739553749732
2 LETICIA 5848 43.66 0.0101024758666915
3 PATRICIA 5422 42.41 0.00936655679705909
4 GUADALUPE 5348 43.38 0.00923872109012763
5 MARIA DEL CARMEN 4881 44.04 0.00843197412881693
6 VERONICA 4772 38.18 0.008243675587526
7 MARGARITA 4674 45.41 0.00807437965131947
8 ELIZABETH 4661 38.18 0.00805192202712881
9 SILVIA 4223 45.43 0.00729527284285882
10 ROSA MARIA 4107 46.97 0.00709488173469599

Apellidos

        apellido   frec_pri   frec_seg
3362 HERNANDEZ 44095 44333
3061 GARCIA 33010 33351
4278 MARTINEZ 31080 31087
3995 LOPEZ 30288 30188
3185 GONZALEZ 25356 25362
5960 RODRIGUEZ 22642 22490
5438 PEREZ 22470 22353
6178 SANCHEZ 21800 21782
5769 RAMIREZ 18806 18632
2924 FLORES 14160 13907

When transforming vectors to their Principal Components, are their relations preserved?



pca test






Question: When transforming vectors to their Principal Components, are their relations preserved?

We have a set of vectors. After performing Principal Component Analysis (PCA) we now use the “rotated” vectors to perform analysis. Can we be confident that the original relations (cosine similarity between vectors) are preserved on the new vector space?

knitr::opts_chunk$set(echo = TRUE)
set.seed(1234)
#Cosine Similarity
cos.sim <- function(A,B) 
{
  return( sum(A*B)/sqrt(sum(A^2)*sum(B^2)) )
}   

1 – Generate a test set y=x+err

#"noisy" x=y
x = runif(n = 1000,min=-100,max=100) # x has a uniform distribution
y = x + rnorm(n=1000,mean=0,sd=20) # y has a normal distribution

d <- data.frame(x=x,y=y)
#ver primeros puntos
head(d)
##           x            y
## 1 -77.25932 -57.56371837
## 2  24.45988  -0.03487656
## 3  21.85495  36.04947094
## 4  24.67589  22.49148848
## 5  72.18308 107.83523462
## 6  28.06212  23.19322747
plot(d)

2 – Perform PCA

pca <- prcomp(d,center = F,scale. = F)
summary(pca)
## Importance of components:
##                           PC1      PC2
## Standard deviation     83.040 13.30478
## Proportion of Variance  0.975  0.02503
## Cumulative Proportion   0.975  1.00000
pca$rotation
##        PC1       PC2
## x 0.691987 -0.721910
## y 0.721910  0.691987
dt <- pca$x #d transformed
head(dt)
##            PC1        PC2
## [1,] -95.01826  15.940928
## [2,]  16.90074 -17.681966
## [3,]  41.14781   9.168461
## [4,]  33.31222  -2.249953
## [5,] 127.79708  22.510896
## [6,]  36.16204  -4.208914
plot(dt)