Wednesday 3 August 2022

Populate a Corpus from a List of URLs

GATE provides support for loading numerous different document formats, as well as a number of ways populating corpora. Until recently, however, we've not offered any way of populating a corpus from a simple list of URLs. Worse, even though it's now quite easy to do this in GATE it's unlikely you would come across the option by accident.

The support for this is actually hidden away inside the "Format CSV" plugin (you'll need to use version 8.7 or above) and in GATE Developer is exposed through the "Populate from CSV file..." option in the context menu of a corpus.
In this screenshot I've configured the populator ready to build a corpus from a simple text file with one URL per line. The important settings are:
  • Column Separator is set to "\t". This means we are using a tab character as the column separator. We do this simply as you can't have a tab in a URL whereas you could have a URL containing a comma and we don't want our URLs split in half.
  • Document Content is in column 0. We always count columns (or almost anything) starting from 0, so this just ensures we use the URL as the document content.
  • Create one document per row is selected. The important option isn't available if we don't first select this as it makes no sense to try and load multiple URLs into the same GATE document.
  • Cell contains document URL is selected. This is the new feature which makes this trick possible. Essentially it looks at the contents of a cell and if it can be interpreted as a URL then it creates a document from the contents of the URL, otherwise it uses the cell content as normal to build the document.
Once configured it's simply a case of selecting your text file, one URL per line, and hitting the OK button. Be aware that there is currently no rate limiting so be careful if you are listing a lot of URLs from a single domain etc. You may also want to combine this with the cookie trick from the previous post to ensure you get the correct content from each of the URLs.

Of course while this post has been about how to populate a corpus from a simple list of URLs you can use more complex CSV or TSV files which happen to contain URLs in one column. In that case the details from the other columns will be added as document features.

Wednesday 2 March 2022

GATE and the Cookie Jar

One of the useful features of GATE is that documents can be loaded directly form the web as well as from local files. This is specifically useful for pages which update frequently which you might want to process repeatedly. While using this feature recently we came across some pages that refused to load correctly. The page loaded fine in a web browser but returned a 403 unauthorised response when accessed via GATE.

After a bit of debugging it turned out that this issue was related to cookies. The specific URL we were trying to load went through a number of redirects before ending up at the final page. The problem was that the first redirect set a cookie, and that needed to be present for the further redirects to work. By default Java, and hence GATE, doesn't maintain cookies across requests, as each connection is handled independently.

If you are using GATE in an embedded context, then it is trivial to add support for cookies using the default Java cookie handler. This is a JVM level setting so once configured in your own code, all requests made by GATE to load documents will also gain support for handling cookies. The entire solution is the following single line of code:

java.net.CookieHandler.setDefault(new java.net.CookieManager());


The problem we faced though, was that we wanted to be able to load documents that required cookies from within GATE Developer and that required a little more thought. Whilst we could have just added the code to GATE there are a number of reasons not to (details of which are outside the scope of this blog post) and I wanted to make it easier for all existing GATE users to be able to use cookies without needing to upgrade. The answer is the rather versatile Groovy plugin.

If you load the Groovy plugin into GATE Developer you can then access the Groovy Console from within the tools menu. Simply pasting that single line of code into the console and executing it is enough to add the cookie support within that instance of GATE. It's slightly annoying that it won't persist across multiple instances of GATE, but as it's such a simple trick hopefully it's easy enough to apply when needed.

Monday 28 February 2022

How green is your recipe? Using GATE to calculate the environmental impact of recipes

 


The calculation of environmental impacts from recipes remains a barrier to effective uptake of sustainable diets.  In a recent project funded by Alpro, led by Dr Christian Reynolds from the Centre for Food Policy at City University London, we explored digitised recipe texts from websites in English, Dutch and German. We study recipes rather than individual ingredients because this is how people typically think about environmental impact and diet.

Recipes are hard to process because they use different weights and measures, and sometimes quite vague or obscure terms (e.g. "a pinch of salt", "a handful of lettuce"). Together with our project partner Text Mining Solutions, we used GATE to develop customised tools to automatically extract ingredients, quantities and units from 220,168 indexed recipes, and to match these to a food environmental database of 4500 ingredients (using the classification system FoodEx2). This database provided Land Use, GHG emissions, Eutrophying Emissions, Stress-Weighted Water Use, and Freshwater Withdrawals for each ingredient.

Nutrition information was sourced from the USDA FoodData Central (McKillop et al., 2021) and McCance and Widdowson's Composition of Foods Integrated Database (Public Health England, 2015). Environmental and Nutrition information was matched to two classification systems (FoodEx2, containing 4,500 ingredients, and USDA Nutrient Database, containing 2,484 ingredients). This allowed us to calculate these impacts at the mean, 5% and 95% confidence level per recipe and per portion, enabling us to explore the environmental impacts of vegan, vegetarian and non-vegetarian (omnivore) recipes if we were to cook these recipes using contemporary ingredients.

To validate the tool, we manually calculated the impacts of 50 recipes from 4 websites: BBC Good Food, Albert Heijn/Allerhande, AllRecipes.com and Kochbar, and compared these with the results from our tool. 

We created a website where you can enter a recipe and get back the calculation for the recipe and per portion (with confidence intervals). The image below shows a sample screenshot.





We presented some of our findings as a poster at the Livestock, Environment and People (LEAP) conference in December 2021. You can find more examples of our analysis and results there.

It's interesting to see how the recipes from the different countries, as well as recipes with different protein sources, lead to different median CO2 footprints. Below we see a chart showing the median GHGE per portion in recipes from different protein sources (e.g. those containing beef, those containing tofu) in omnivore, vegetarian, and vegan recipes. Unsurprisingly, the dishes containing meat have higher GHGE values on the whole, though we do find variations within individual recipes. We were particularly excited to find a recipe for chocolate cake that "beat" a salad in terms of low GHGE!

When we compared the different datasets (depicting recipes from different European countries) in terms of median GHGE per protein source, we found that Kochbar (German) recipes typically fared the worst, followed by the BBC Good Food recipes (British), and Albert Heijn (Dutch) faring much better.

The work is now continuing with the development of a dashboard enabling additional visualisations and further analysis to be produced.