Wednesday, 3 August 2022

Populate a Corpus from a List of URLs

GATE provides support for loading numerous different document formats, as well as a number of ways populating corpora. Until recently, however, we've not offered any way of populating a corpus from a simple list of URLs. Worse, even though it's now quite easy to do this in GATE it's unlikely you would come across the option by accident.

The support for this is actually hidden away inside the "Format CSV" plugin (you'll need to use version 8.7 or above) and in GATE Developer is exposed through the "Populate from CSV file..." option in the context menu of a corpus.
In this screenshot I've configured the populator ready to build a corpus from a simple text file with one URL per line. The important settings are:
  • Column Separator is set to "\t". This means we are using a tab character as the column separator. We do this simply as you can't have a tab in a URL whereas you could have a URL containing a comma and we don't want our URLs split in half.
  • Document Content is in column 0. We always count columns (or almost anything) starting from 0, so this just ensures we use the URL as the document content.
  • Create one document per row is selected. The important option isn't available if we don't first select this as it makes no sense to try and load multiple URLs into the same GATE document.
  • Cell contains document URL is selected. This is the new feature which makes this trick possible. Essentially it looks at the contents of a cell and if it can be interpreted as a URL then it creates a document from the contents of the URL, otherwise it uses the cell content as normal to build the document.
Once configured it's simply a case of selecting your text file, one URL per line, and hitting the OK button. Be aware that there is currently no rate limiting so be careful if you are listing a lot of URLs from a single domain etc. You may also want to combine this with the cookie trick from the previous post to ensure you get the correct content from each of the URLs.

Of course while this post has been about how to populate a corpus from a simple list of URLs you can use more complex CSV or TSV files which happen to contain URLs in one column. In that case the details from the other columns will be added as document features.

No comments:

Post a Comment