This is a small framework aimed to make easy the evaluation of Language Models with the STS Benchmark as well as other task-specific evaluation datasets. With it you can compare different models, or versions of the same model improved by fine-tuning. The framework currently use STSBenchmark, the Spanish portion of STS2017 and an example of a custom evaluation dataset.
The framework wraps models from different sources and runs the selected evaluation with them, producing a standarized JSON output.
Models can be sourced from:
Main Goal: Extension to other evaluation datasets
The main goal of this framework is to help in the evaluation of Language Models for other context-specific tasks.
Evaluation Results on current datasets
Check this notebook for the current results of evaluating several LMs on the standard datasets and in the context-specific example. This results closely resembles the ones published in PapersWithCode and SBERT Pretrained Models