background-shape
feature-image

NLP has certainly been a domain that fascinated me for a rather long time, as language is one of the most complex concept humans have come up with. It’s only through language that we are able to convey our emotions to others, through proses, poems or even ranting to a friend.

One problem that we often encounter while building NLP systems is the lack of recent data, due to copyright and other laws. Often we sort to the famous Gutenberg or Brown corpus, however the text does not really represent the modern language that we speak.

About HuggingFace🤗

HuggingFace🤗 is one of the most robust AI communities out there, with a wide range of solutions from models to datasets, built on top of open source principles. The magnanimous leap we have taken in NLP has to be attributed HuggingFace, who created Transformers (and made it open source), a state of the art architecture which transformed all of Deep Learning.

Along with various open-source transformer implementations like BERT, RoBERTa etc. Hugging Face, also provides us with a large catalog of more than 100 public datasets, which enable us to work on model training rather than data collection. To make it even easier for programmers, Hugging Face have a library datasets from which one can directly access and use any of these datasets.

Datasets Library

The datasets library is easily installable in any python environment with pip using the below command.

pip install datasets

Once the installation is complete we can make sure that the installation is done right, and check the version using the below python code.

import datasets
print(datasets.__version__)

Let’s us now look at the various datasets present in the library. Using the below code, you can list out all the datasets that are available to us, and while writing this post, the number stood at 1510! There’s a constant effort taken to add more datasets to this already humongous library.

from datasets import list_datasets

datasets = list_datasets()
print("Total number of datasets in Datasets Library is", len(datasets))
print(datasets)

Let us now load a dataset using the Datasets Library. In this example we are going to load the SQUAD dataset and list out the details about the dataset. Obviously any other dataset can also be loaded instead of SQUAD, but it must noted that for large datasets often we have to also mention the module we are looking for.

from datasets import list_datasets, load_dataset
squad = list_datasets(with_details=True)[datasets.index('squad')]

print(squad.__dict__)

squad_dataset = load_dataset('squad')

Advantages of Datasets Library

So why should one use the Datasets Library when the datasets provided are public and can be downloaded from the internet? Besides the ease with which we can access the datasets, Hugging Face claim the following four reasons:

  • Datasets free up users RAM limitations as all the datasets are efficiently memory mapped.
  • Smart caching to avoid long waits for data processing.
  • Lightweight, fast and transparent pythonic API.
  • Interoperability with common libraries like Numpy, pandas, tensorflow etc.

These are among the reasons why Datasets Library probably is soon going to become the standard when dealing with NLP corpora and even other datasets in ML domain. The ease of use and the standardized format of datasets help engineers focus more on building new and fascinating models rather than breaking their head over gathering data and making it accessible across various formats.