sts_eval: Easy Evaluation of Semantic Textual Similarity for Neural Language Models

Check on Github

This is a small framework aimed to make easy the evaluation of Language Models with the STS Benchmark as well as other task-specific evaluation datasets. With it you can compare different models, or versions of the same model improved by fine-tuning. The framework currently use STSBenchmark, the Spanish portion of STS2017 and an example of a custom evaluation dataset.

The framework wraps models from different sources and runs the selected evaluation with them, producing a standarized JSON output.

Models can be sourced from:

Main Goal: Extension to other evaluation datasets

The main goal of this framework is to help in the evaluation of Language Models for other context-specific tasks.

Evaluation Results on current datasets

Check this notebook for the current results of evaluating several LMs on the standard datasets and in the context-specific example. This results closely resembles the ones published in PapersWithCode and SBERT Pretrained Models

STSBenchmark

STS-es Spanish to Spanish Semantic Textual Similarity

Check on Github