Text Classification with Reuters

Huda
4 min readSep 4, 2022
Photo by Markus Winkler on Pexels

In the machine learning world. Almost 75% of the problems revolve around classification. Classification, thus, is one of the most popular machine learning use-case. Classification in itself can be divided into 2 major classes: Binary Class Classification and Multi-class Classification. In binary class classification, we have to predict between 2 classes, for e.g.: Yes/No, True/False, Spam/Not Spam etc. In multi-class classification, we predict the output from a range of options, for e.g.: Face Classification, Classification of different Fruits, cars, plants etc. Reuters-21578 is one such example of multi-class classification which deals with classification of different news articles.

Reuters-21578

The Reuters-21578 dataset is a collection of documents with news articles. It is collected from the Reuters financial newswire service in 1987. The data is available in the collection of 22 data files, an SGML DTD file describing the data file format, and six files describing the categories used to index the data. This is a very popular dataset when it comes to classification. Today we shall try to implement ML classification models to classify all the news articles of type “earn” and try to distinguish it from other articles.

Data Formatting

The Reuters-21578 collection is distributed in 21 files…

--

--

Huda

Data Scientist with recent experience in data acquisition and data modeling, statistical analysis, machine learning, deep learning and NLP