Data Cleaning: Who Spends More? First Time Tourists or Returners?

Deviating from my usual focus on education and some health analyses on the side, I wanted to find an interesting-to-me dataset in the government open data portal that I might dive into. In the past, I have been somewhat curious about tourism in the country and whether that market is ripe for innovation. My last grand idea was a one-stop shop for all tourists prior to and during their stay in Tanzania. Turns out Triposo pretty much has that covered. So for now, I tried a more humble approach of asking questions.

Based on the 2014 Tourism Survey, can we determine who spends more while in Tanzania: first time tourists or returners?

As always, please find my GitHub repository on this here. Go here for the Jupyter notebook.

Now, this question in itself is not difficult (in fact, it’s incredibly easy in one line of Python).  Who spends more? It turns out that on average, returners spend 2.5 times more money than than first-timers while in Tanzania. This post will be more about how I approached the cleaning process for this messy dataset before arriving at this conclusion.

Here are the different ways in which this dataset needed to be cleaned before we can answer this question:

  1. Nonsense (Inconsistent) Responses to Unclear Questions?
    • When I first took a cursory look at some of the answers in the data, I noticed that some answers did not align or didn't make sense given the answers on other columns. A person who said they travelled with a group did not list the age ranges of other members in the group. So, did they travel alone? Another person who said they travelled alone, listed people in their group. So what gives? Even with the questionnaire in hand, there is no way to make sense of this data with any assumptions. so I dropped most of those columns.
  2. Clean Responses, But No Variation
    • Some columns had clean responses, but all values were the same thus producing zero variation. It added no real value to the dataset for me and I dropped it. (For clarity, these were things like 0s or NaNs, so couldn't even be used to generate new columns).
  3. Did Someone Accidentally Find and Replace all 8s with "Others (please specify)"?
    • This was the wildest experience of my data cleaning life to date. I can understand if all integer values of 8 were turned to a specific value but values like 808 were spared. However, what actually happened is that no 8s were spared. As per my linked tweet example, something like 808s -> "Others (please specify)0Others (please specify)s". Nothing was spared. It was pretty impressive. I wrote functions to correct these errors using some list comprehensions and changing them all to numeric values again instead of strings.
  4. Lambda Currencies
    • Survey asks people to report how much they spent and allows them to use the currency of their choice. Some mark the currency, others do not. Nonetheless, kudos to the survey design, but the actual dataset should use a single currency for calculations. If I were a true Ujamaaist, I would probably advocate using TZS, but realistically USD is a typical convention and the numbers are nicer for it as opposed to being in the millions. I wanted to use the classical switch/case statements here: change the value of this new column to its USD equivalent using this formula if the currency is X or that formula for currency Y. Alas, Python still doesn't support switch/case. Suggestions on StackOverflow were to use a dictionary of currency as the key and the lambda function as the value. A tricky land to survey in the sleep-deprived state I was in, but I managed to brute force through it and was so happy. Functional programming is fun, they said.
  5. Useful Helper Columns and Dummies
    • Hinted a little bit in the previous bullet, but I found that I needed to make some helper columns using (but not deleting) the original features. Norming to USD and counting the total number of travelers based on number of female and number of male were one such example. The latter turns out to have its own 'problems' that will need to be fixed. Sometimes the total is 0 or otherwise inconsistent with other answers. At some point you just have to trust some inconsistencies over others I guess? It could also be by design that some survey participants did not identify their gender as either of the options. Finally, a note on dummies. Columns with binary answers were immediately dummied (this question actually uses one of my dummied features!). My convention for dummying is affirmatives are 1s and negatives 0s. There are still some categorical variables that I have not elected to dummy yet - will do so in the next analyses!
And that's about all of the steps I took! I did not really address NaNs or some of the messy string free-responses (my preference was to drop them, but others may find them useful) and this was already quite a labor of love. I am happy that I knew how I wanted to approach most of my cleaning and have learned how I might approach future cleaning processes as well (re: switch/case for instance). As always, I have posted my conservatively cleaned final dataset on GitHub in hopes of saving someone else some time!