By Ian Mulvany, Head of Product Innovation, SAGE Publishing @IanMulvany
Photos © Katie Metzler 2016
A DataDive is a focused hacking weekend that works with a selected non-profit to help them improve their understanding of their own data. Significant effort goes into the preparation of the data before the event to make it as productive as possible.
This event worked with the non-profits WeFarm and OneWorld. WeFarm operates a network that farmers in the developing world can access via text message to ask questions about farming. The questions go to about 40 other people in the network who can offer answers. Their data for this event consisted of the entire messaging history of the service. OneWorld have a number of initiatives and the data for this hackathon was about reporting from election monitoring stations. Bringing transparency and accountability to the election process is vitally important, but it is an activity that can also risk sparking civil unrest if the data is misinterpreted or abused.
Family reasons meant that I was not able to attend on the Saturday, but I was keen to see how the event was going so I turned up at 10am on Sunday hoping to find a team and a project to help out with. There were already plenty of people there continuing on the work that they had started the day before.
One team working on the WeFarm data had built an LDA topic model of the users, and used this to categorize the messages. The hypothesis was that if message content could be classified, it would help WeFarm route questions to users who had more of a chance of being able to answer the question. They had the classification model, and I took on the task of manually verifying if the 10 topics that were generated made sense. I wrote a couple of short scripts to batch the entire message corpus into topic based batches, and then to extract 50 random messages from each batch. I put these messages into a spreadsheet and started to manually tag whether they seemed to make sense from the point of view of what we thought the topics might be about (based on looking at a sample vector of words that characterized the topic). The real question was could LDA give any insight at all, or will all the messages be too similar? Remember, this was a topic model build in just a few hours the day before with few refinements. It turned out that some of the clusters were not very discriminatory, but at least four of them were highly discriminatory and allowed us to identify crop based questions, disease based questions, and people who had answered “yes”, as well as draw out text that was not in English, which showed a lot of promise for this approach. I was also very happy to have been able to help out, in just a very short period of time.
About sixty data scientists participated, with an even split between the two projects. All of the work was presented at the end of the Sunday session. I couldn’t stay to see all of the results for the OneWorld but I saw all of the results for the WeFarm project. These included the following:
- Good work identifying usage patterns that lead to some users abandoning the service (mostly stemming from them not understanding it well). By identifying these patterns, WeFarm might be able to intervene to improve the experience for these users.
- A big cleanup of the data, allowing fine grained visualization of users by demographics and location.
- There was one null result in which messages were analyzed to see if any specific message characteristic would lead to an increased likelihood of getting an answer; no patterns were found.
- Some work on creating a network of users (you might be able to use it to identify key participants).
- Creation of a topic model, and engine for classifying both message and users, could be highly valuable to help routing messages to members with the highest chance of answering that question.
Most of the work was packaged in a way that would be of use to the non-profits, as it came with documentation.
I have been to a lot of hackathons, and I’ve organized a fair few too. Even though I was only at this event for a few hours on Sunday I can say hands down that it was the best organized hackathon that I’ve ever been to.
The key factor towards success was the groundwork that the DataKind UK “data ambassadors” did in the weeks leading up to the hackathon. They worked closely with the non-profits to prepare datasets that were ready to be worked on from the start, and they worked to help define questions that were important to the non-profits. I saw a lot of discipline from the teams and an amazing ability not to get sidetracked! I thought the results were phenomenal, and the reaction from the non-profits representatives was extremely positive.
DataKind UK is always looking for more volunteers, so if you have data science skills and you want to use them for social good, get in touch with them! And if you want to hear more about what SAGE is doing in the area of big data and computational social science sign up to our monthly Big Data and Social Research Newsletter. You can view the first one here. Any further questions please do get in touch via email firstname.lastname@example.org