Big Data

Using Data where you can find it…

In many developing countries, there is a general shortage of good data to use. But sometimes ‘big data’ can come to the rescue. Or even just ‘medium-to-large data’ …

Simple Example – In demand skills for job seekers

In this example, we wanted to know the top skills that employers were seeking. The Employment and Skills Survey data were not great, there were no private employment agencies we could run focus groups with, the Public Employment Service had only a very small number of vacancies  registered. None of these could give us an idea of the skills employers were looking for.

But there was a email list where job vacancies were posted. We recorded all the data for three months in a spreadsheet, separating the job requirement into different columns (education level, education sector, languages required, etc.). We also had one column where we dumped the job description, unedited. We then took this raw text from 150 job adverts, and created this tag cloud.

Its imperfect, but it gives job-seekers an idea of what employers are looking for in terms of skills (at least for the sub-set of jobs that are advertised online).

Additionally it was possible to do this using a spreadsheet and TagCrowd, a free web-based service with a Creative Commons licence (I bought the creator, Daniel Steinbock, a coffee though :).


More complex example – comparing price and quality of accommodation across Laos

We had some administrative data on number of hotels and guest-houses in Laos, but that was incomplete, and we wanted a better idea of the cost and quality of accommodation.

Many hotels and guest-houses had started advertising on (close to 50% of the number we had from the administrative data on registrations). We made a web-scraper, to go though each entry for the whole country and gather some price, quality and other data. We also got the latitude and longitude of each establishment, and thus were able to plot them on a map, and compare price and quality across provinces.

The quality data is based on users’ ratings across eight categories (value for money, breakfast, cleanliness, etc.), and this was something we could find nowhere else – customers’ views of the product. Unfortunately, most of the rating bunch around 6 to 8 out of 10. This is typical of survey data, where users rarely use the full scale to rate their experience. But this is useful for identifying outliers – where something is outstandingly good or bad.

The price data were also of great interest to both the government (how closely does this correlate to reported earnings and tax paid?) and the private sector (how do I compare to my competitors on price?).

Once again, this is a case of the data all being freely available, but by presenting it in a more accessible format, it becomes more useful.