Fake news, natural language processing and ensemble models.

Ba in Applied Mathematics.

Graduated 2022, special mention.


In this project, different classification models were built using machine and deep learning architectures. The classification task was done to newspaper articles and social media posts in Spanish whose subjects vary but mainly refer to Latin American national politics and the current coronavirus crisis. The purpose of the models built was to detect when a text constitutes fake, true or misleading news. Due to the models’ input, Natural Language Processing was studied.

Example of the bidirectional attention mechanism used in BERT. Taken from Tonantzin's dissertation.


In this work we tested the natural capabilities of transformers-based models on classification tasks. Additionally, we tested the feature representation of the transformer model (BERT ) by coupling the encoding layer with ensemble models based on decision trees. This is because ensemble-based models such as Random Forests and Boosted Machines are state of the art on tabular data. As it is naturally an unbalanced classification task, more questions arise as we are worried about bias towards specific news outlets and the generalization of our predictive models. A publication will surely follow.

Machine learning, natural language processing, transformers, ensemble models.