In this elections series, we'll explore various aspects of the 2016 Philippine National Elections, from fraud detection to the differences in how our country votes. In this first instalment, we learn about election fingerprints and how they may be used to detect fraud in the form of ballot stuffing or vote padding.

Election data geekery

For the first time, the data geeks have finally gotten some love. Highly detailed elections results, broken down all the way to the precinct level, have been published online by the Commission on Elections (COMELEC) as well as poll watchers and the media.

There are many things I imagine we could do with this data, but one of the most popular uses is to assess the risk of elections irregularities. For the first few parts of this series, we'll try to carefully and scientifically assess the risk of election irregularities.

Going back to the methodology highlighted in a 2014 post, part 1 of this series will focus on detecting elections irregularities through vote padding, defined as the adding of fraudulent votes into the count to increase a candidate's probability of a win, or, conversely, the shaving of legitimate votes from the count to decrease a candidate's probability of a win.

Statistical detection of vote padding

Vote padding, sometimes called ballot stuffing, is a form of electoral fraud that involves adding fake votes or shaving legitimate votes to favor a particular candidate. This is not be detectable in the final aggregated election results. However, if vote padding only occurs in a subset of jurisdictions it can change the distribution of voter turnout and vote share in a way that allows detection from granular election data.

This method was demonstrated in this PNAS Paper¹, where they showed that Russian and Ugandan elections, known to be marred with electoral fraud, contained "election fingerprints" that we smeared towards the top left:

Let's think about this: what happens when fake votes are added to the count?

Increase in voter turnout - because there are now more voters than actual, there is an increase in the % of voters that voted in particular cities/municipalities.
Increase in candidate vote share - the favored candidate will see an increase in the percentage of votes won.

When you have a significant proportion of areas that have this high turnout, high vote share combination, there is an increased risk that electoral irregularities have occurred.

If we replicate this analysis for our elections, we find that there isn't anything that jumps out immediately. You can explore the plots in the following section:

For the presidential race, nothing seems to be out of order, as most of the fingerprints are concentrated around a central mass and with minimal "smearing." For the vice presidential race, you can see a bit of bimodality in terms of the winning percentage for MARCOS, BONGBONG, but the voter turnout is not high enough to cause "smearing." This is a symptom of a polarizing candidate â€“ some areas voted heavily for the candidate, and some did not at all. For the senatorial race, nothing is out of order.

What if the fraud was not as widespread, and it is not immediately detectable by a simple visual inspection? Perhaps, constructing a single index of vote padding risk can allow us to tease out the subtle differences.

Creating a vote padding risk score

The authors of the PNAS¹ paper have devised a simple logarithmic transformation for the vote counts. The distribution of this transformed variable is most likely to be normal (i.e. bell-shaped) for elections with minimal irregularity. Details of this transformation are outlined in the paper. As expected, logarithmic vote counts from the Russian and Ugandan elections show highly negative skewness and highly positive kurtosis, inconsistent with a normal distribution that has skewness and excess kurtosis of 0.

So what does it mean in this case? When we compute the skewness and kurtosis of the logarithmic vote counts, the further they are from 0 (negative skewness and positive kurtosis), the higher the risk of vote padding. Computing these values for all national-level candidates, we can construct the following chart:

How to read this chart: The closer the values are to the top left corner, the higher the risk of vote padding.

Apart from a few party list and senatorial candidates that have understandably strong vote shares in one particular group of cities/municipalities but fall extremely flat in others (BALIGOD, LEVITO, ALONA, KGB, ANG KASANGGA), there seem to be no particular candidates that stand out.

What does this mean?

Let me be clear: This does not mean that there was no electoral fraud - it simply means that the risk of fraud through this particular form - vote padding or ballot stuffing - is significantly low. Remember, data cannot serve as definitive proof â€“ it can only guide investigation and quantify risk. I highly encourage you to go through these important caveats.

Interactive: View the underlying data

If you're interested in finding out more (and potentially sniffing out vote padding for yourself), I highly encourage you to play around with this small Shiny widget. If it does not respond, it might mean that there is too much load on the server. Hover over the points to see more information about how each city/municipality voted.

Important caveats

I've used careful language in presenting this analysis, and that's mainly to avoid misinterpretation; these elections have been very heated, both between candidates and among the general public. I have to make certain things clear:

Statistics can't prove nor disprove fraud. At the most, it can assess the risk of fraud and guide investigation.
The results of an analysis should be taken in the context of its scope, limitations, and assumptions. Sometimes, these are more important than the findings themselves.
Just because this particular analysis shows/does not show signs of electoral irregularity, does not mean that there was/wasn't fraud committed. Each analysis is designed to detect a particular kind of fraud only.

Data notes

The data was scraped from the COMELEC's public election results page, as of May 25, 2016. At that time, 96.69% of election returns were transmitted, and 99.93% of city/municipality certificates of canvass were received. For a full list of cities and municipalities that have no results see here.
Data, code, and computations are available on Github.

Klimek, Yegorov, Hanel, Thurner (2012). Statistical detection of systematic election irregularities. Proceedings of the National Academy of Sciences of the United States of America 109(41). ↵ ↵ ²

Troy is a data nerd that wants you to give numbers a chance; passionate about data driven decision making in businesses and government. City Operations at Uber and data science blogger at tjpalanca.com. The opinions expressed herein are those of the author only and not of his employer nor of GMA News Online.