I recently took a look at a Kaggle Competition about automatically labeling toxic comments on Wikipedia. I thought it would be a good opportunity to showcase how well the tidytext and keras R interfaces work together. This data includes a lot of really troubling language, including racial slurs, misogynistic language, and homophobic content and so I decided to not include most of the exploratory data analysis because this is a family blog.
I have a pretty strange background for a data scientist. In my career I’ve sold electric razors, worked on credit derivatives during the 2008 financial crash, written market reports on orthopaedic biomaterials, and practiced law. I started programming in R during law school, partly as a way to learn more about data visualization and partly to help analyze youth criminal justice data. Over time I came to enjoy programming more than law and decided to make the switch to data work about three years ago.
My last job was as a data scientist at Upworthy, which is a 100% remote company. Prior to starting the position I was worried about whether I could be happy and productive on a remote team. I wondered how project planning would work, whether it would be terribly lonely, and how communication would function when things got hectic. What I discovered is that the company was one of the more efficient and friendly places that I’ve worked, and I think the changes that they have made to accommodate remote work deserve much of the credit.
Over the past couple of months, I’ve been rebuilding the Shambhala Meditation Timer using React Native and Redux. The idea behind the Shambhala app was to create a kind of modular framework for building meditation timers in order to allow people to create complex timers out of simple components. The three build blocks for a timer are time intervals, gong sounds, and recorded audio contemplations, and the user can stack these building blocks to create whatever kind of meditation session they want.
Tidy Data Tidy data has become the dominant way of thinking about problems in R. The idea behind tidy data is to develop an ecosystem of R packages which all work around a similar kind of data structure. That way you can easily compose many different tools together to accomplish very complex tasks in an iterative, easy to understand fashion. There are lots of excellent presentations about why this is a great approach but the one I would recommend if you are new to this area is Hadley Wickham’s keynote from the 2017 rstudio conference.
Like most people, I first learned to work with numbers through an Excel spreadsheet. After graduating with an undergraduate philosophy degree, I somehow convinced a medical device marketing firm to give me a job writing Excel reports on the orthopedic biomaterials market. When I first started, I remember not knowing how to anything, but after a few months I became fairly proficient with the tool, and was able to build all sorts of useful models.