The Data Sleuth

Weather Observation Station 18

Fri, 14 Sep 2018 00:00:00 +0000

Prompt

Consider P₁(a,b) and P₂(c,d) to be two points on a 2D plane.

a happens to equal the minimum value in Northern Latitude (LAT_N in STATION).

b happens to equal the minimum value in Western Longitude (LONG_W in STATION).

c happens to equal the maximum value in Northern Latitude (LAT_N in STATION).

d happens to equal the maximum value in Western Longitude (LONG_W in STATION).

Query the Manhattan Distance between points P₁ and P₂ and round it to a scale of 4 decimal places.

The STATION table is described as follows:

Field	Type
ID	NUMBER
CITY	VARCHAR2(21)
STATE	VARCHAR2(2)
LAT_N	NUMBER
LONG_W	NUMBER

where LAT_N is the northern latitude and LONG_W is the western longitude.

Answer

SELECT ROUND(ABS(MIN(LAT_N) - MAX(LAT_N)) + ABS(MIN(LONG_W) - MAX(LONG_W)),4) FROM STATION

Voter Registration by New York Congressional Districts for 2018 Midterm Elections

Wed, 12 Sep 2018 00:00:00 +0000

Analysis of Voter Registration for the 27 New York Congressional Districts in anticipation of the 2018 Midterm Elections with interactive Dash charts.

In anticipation of the 2018 mid-term elections, Enigma Public is hosting a data science competition. Using their new SDK package which facilitates API calls on their public database, contestants are encouraged to explore topics that have some explanatory power over the midterm elections. The competition ends on Friday, September 21, 2018.

Under their Elections collection, there are six umbrella datasets: Congressional Candidate Disbursements, Federal Election Commission, New York State Board of Elections, PAC Summary, Presidential Campaign Disbursements, and Tafra. While there’s definitely something to dig into in each of these datasets, I chose to explore the New York State Board of Elections dataset filled with 5 years of voter registration statistics by Congressional District.

I initially started off this analysis plotting in Matplotlib but halfway through decided to challenge myself and build out my plots using Dash. I wanted to do this for several reasons: First, I’ve used Plotly in a previous project and was really impressed with the interactive hover-over feature and the crispness of the colors. With Plotly-Dash, I was able to take the user interaction to the next level by including drop-down menus to facilitate more focused analysis. Second, I felt like building out interactive plots like the ones below encourage engagement with the data as opposed to just reading someone’s findings. I can’t think of a better topic than election season to get people engaged with data and thinking about their right to vote. Also, the kid in me really likes pushing the buttons.

Have a click through the charts below and feel free to share your own findings in the comments section.

Voter Registration across Political Parties from 2014-2018

The chart below displays voter registration numbers for all 27 districts of New York which can be filtered by year using the drop-down menu. Click on any combination of the boxes in the legend to compare voters across political parties. Here are a few questions to get you started on your analysis: Did the Democrats lose any districts to the Republicans? If so, where and when? What’s the greatest party affiliation after the Democrats and Republicans? Does that political affiliation represent opportunity or apathy? Where does the Independent Party have the most registered voters? What other trends can you see?

Percentage Change of Voter Registration

Sometimes it’s not enough to just look at the numbers themselves but to actually calculate how the numbers have changed from year to year. Using the .pct_change() in Pandas, I was able to identify the percentage change of voter registration across the political parties. This analysis produced some interesting results, especially as it relates to the fringe parties of New York, namely, the Women’s Equality Party (WEP) and the Reform Party (REF).

Founded in 2014, the Women’s Equality Party (WEP) was created by Andrew Cuomo, the current governor of New York. The party reached full political party status in 2014 during the gubernatorial elections after receiving the requisite 50,000 votes. While Cuomo is no longer in control of the party on paper, the party remains deeply supportive of the incumbent governor, supporting his most recent re-election bid in 2018.

The Reform Party (REF), another fringe political party founded in 2014, is a ballot-access qualified party created by Curtis Sliwa. Silwa initially founded another group called the Guardian Angels in 1979 and is a regular contributor to several talk radio and television shows, including Fox 5.

Click through the chart below to see how each district has responded to the introduction of these new parties. Do you see any patterns emerging across districts that are represented by a Democrat as opposed to a Republican?

Active v. Inactive Voters in 2018

Counting the number of registered voters, however, does not equate to how many people will actually vote. In the 2016 Presidential election, for example, 58% of eligible voters cast a ballot. So how can you tell if someone will vote or not? One way to identify potential voter turnout is to take a look at the active and inactive voter registration¹. Voters who are active are more likely to show up to the polls than are inactive voters.

Click through the chart below to see the breakdown of active and inactive voters across all districts for each party in 2018. Is there a party that is consistently active across all districts? Which party is the most inactive? Who’s more active between the Democrats and Republicans? Thinking back to the fringe parties percentage change chart, what do you think it means that a newly founded fringe party has inactive voters?

Final Thoughts

The 2018 midterm elections are significant for a number of different reasons. Republicans retaining control of both the Senate and the House of Representatives means they will be able to continue pushing a more conservative agenda that may include trying to repeal Obamacare again, cutting budgets for social services like welfare, Medicare, and Social Security, and restricting access to abortion. Moreover, because many gubernatorial and state legislature elections determine how state districts are drawn, whoever is elected will be able to help shape congressional district boundaries in the upcoming 2020 Census, and those boundaries will be in effect for the following ten years. Here’s an interactive map created by the Washington Post detailing who determines how congressional districts are drawn for each state.

Many of the 2018 Midterm elections will be held on Tuesday, November 6, 2018. All 435 seats in the United States House of Representatives and 35 of the 100 seats in the United States Senate will be contested.² Vote early and vote often.

A voter in inactive status who does not vote in two consecutive Federal Elections is, in the fifth year, removed from the list of the register. The voter must re-register in order to vote. ↩
https://en.wikipedia.org/wiki/United_States_elections,_2018 ↩

5 Not So Obvious Ways to Get a Job in Data Science

Mon, 10 Sep 2018 00:00:00 +0000

The search for a job can be exhausting, frustrating, and perplexing, even in a field as hot as data science is right now. To get some actual traction with employers, don’t do what everyone else does. Here are five ways to look for a data-oriented job that can actually connect you to a hiring manager.

We’ve all been there before. You see a job on an aggregator site like Monster, Glassdoor, or Indeed that matches your interests and skillset. You fill out basic demographic information about yourself and upload your resume. Maybe you attach a general cover letter, maybe you don’t. Who even reads those things anymore anyway, you wonder. You click the “Apply” button at the bottom of the screen and feel temporarily productive about the state of your job search and the acceleration of your career only to never hear back from the company - not even to acknowledge they’ve received your application. You do this about a hundred more times, expecting different results.

While unemployment is at the lowest since the beginning of the century as of the writing of this post, it still remains difficult to land a good job, even for an industry as hyped as Data Science. Hundreds, if not thousands, of candidates apply for a single job posting, many of whom have advanced degrees and years of professional and relevant job experience.

I used to think landing a job was just a numbers game, and if I only applied to enough postings, statistically speaking, I would eventually get an offer. During this shot-gun approach, I applied to over 20 jobs that were entry level data science jobs in a single day. I got exactly zero invitations for a phone screen, let alone an actual in-person interview. It was then that I decided to switch up my job search strategy because feeling productive is not always the same thing as actually generating solid leads. Here’s a few of the strategies that I’ve used in my own job search that have connected me to real human beings who’ve had a genuine interest in speaking to me about a job that actually exists.

Slack Channels

Slack is a new platform launched in 2014 to to help facilitate communication among common interest user groups. For example, companies use Slack in lieu of email to track work projects and provide updates to relevant team members. I’ve used Slack during my 12-week Data Science course at General Assembly to follow along in lectures and ask questions of my instructor and classmates. My apartment building even has a Slack channel where residents receive building updates and invitations for get-togethers. Slack is user-friendly platform for people to connect - which means it can also be a way for job seekers to connect with hiring managers.

I’ve joined four Slack channels (apart from GA and my apartment building) that are focused exclusively on Data Science topics: two are focused on supporting women in Data Science, one is focused on the tech industry in New York, and the fourth is a private channel for Data Camp subscribers. Here are the Slack channels I’ve joined to stay connected to the Data Science community, and more specifically, to identify job opportunities:

wimlds (Women in Machine Learning and Data Science)

R-Ladies Community

NYCTech

Data Camp USA

Within each of these channels, there is a sub-channel dedicated to #jobs. Yes, employers actually post real jobs in these channels and usually include the email handles of hiring managers. It is not only ok but encouraged to email the poster directly to further inquire about opportunities that interest you. Congrats, you just connected to a real live human for a real live job.

My list is by no means an exhaustive one, and neither is this one. These are just some of the channels that are out there related to data. I would encourage you to do a little bit of research yourself to find a few Slack channels that would be good fits for you. When trying to identify Slack channels to join, consider your stack, your location, and your demographic. Connect with people who are similar to you and/or have similar interests. Get the emotional buy-in from someone who actually wants to read your cover letter/email instead of a hoping to be discovered.

Start a Blog

Every job hunter has a resume and a LinkedIn profile. For prospective employers, it can be difficult to actually gauge a candidate’s technical ability based on self-reported bullet points, and women in particular underreport their abilities. To really give employers a better understanding of your skills, start detailing your analyses in a blog. This serves the twinned purpose of not only demonstrating your coding ability but also proving that you can actually present your findings in an organized way. Data Science is a multidisciplinary field in which practitioners can be expected to report their findings to a range of technical and non-technical stakeholders. Having the skills to communicate efficiently and methodically is just as important as slicing through a dataset with Pythonic prowess.

Moreover, having a blog demonstrates a deeper level of interest in data than the average applicant. I’ve even used my blog to supplement my skills pitch during interviews by inviting prospective employers to read through some of my posts to gain a better understanding of my skills and interests. Multiple employers have actually followed up to tell me they read one of my posts, allowing me to stay connected to them long after I left their office.

Lunch Clubs

When I think of networking, I imagine a room full of people standing around awkwardly working the crowd to figure out who might be able to help them advance their agendas, whether it’s to find a job, pitch an investor, or meet a potential co-founder. But without any structure, it can feel like a giant waste of time. When I meet up with someone in a professional setting, I like for both people to know why they are there and what the goal of the meeting is.

Lunch clubs like Lunchclub.ai and LetsLunch cut through the ambiguity by pairing people up with similar networking intentions. As a member, you state your role during the initial sign-up period, and upon acceptance to the club, you receive invites to meet up with people who can actually help you. Job seekers can pair up with prospective employers, start-ups can match with investors, and people who just want to connect with other professionals in their field can meet up one-on-one, too. And even if it’s a total bust, at least you got to sit down and eat.

Hackathons

I recently went to a SexTech Hackathon in Brooklyn which focused on leveraging technology to improve the sex lives of 8 billion people. While I didn’t have specific plans to break into the SexTech industry, it was a really great way to meet people from a variety of different backgrounds. Teams were self-selected, and I ended up on a team comprised of a “humanitarian” (soft skills), a “hipster” (design), a “hustler” (business savvy), and another “hacker” (coding) besides me. In just a day and a half, my team built a rudimentary chatbot decision tree in JavaScript (which I had never used before that weekend) designed to improve intimacy between couples. To me and the other hacker, it mattered not what our inputs were, whether they were NSFW-related or about flowers - the code and the logic was the same. And while I did go just to improve my coding skills and to do a fun, crazy thing in Brooklyn for a weekend, I actually ended up connecting with the other hacker on my team who ended up referring me for a gig with her part-time employer to tutor Python to kids aged 8-10. So really, you never know who you’ll meet and how they might help you. Keep an open mind and sign up for any hackathon that looks like it could be fun. If nothing else, you’ll have quite the story for Monday.

Cold email companies you admire

Companies post jobs on aggregator sites because they have an immediate need to fill. Looking at it in the negative, however, just because a company hasn’t posted a job description that squarely aligns with your skillset doesn’t mean there isn’t a space (or funding) for you.

In a confusing age of both anonymity and oversharing, it can be difficult for employers (or anyone really) to get a read on someone without meeting them. Interviews will always be preferable, but second to meeting someone in person is writing them a sincere email. For companies that interest you, research the management team and identify someone you could write to. Keep it short and direct - no more than 4-5 sentences - to explain who you are, what about the company interests you, and what specific value you could add. For illustrative purposes, here’s a sample email intro:

Hi Ms. Employer,

My name is Jess Chace, and I writing to you to inquire about a position with Data Science Co. DS Co. has been on my radar for several years now, and I’ve watched your growth and increasing number of tech solutions with great interest, especially (include specific example here). Having recently updated my own tech stack, I now feel competitive enough to join your team to help build out (whatever your stack can build here). Do you have time to meet up for an exploratory interview this week or next?

Looking forward to hearing from you,

Jess

Now, I’m not suggesting you try this technique on someone like Jeff Bezos or Andrew Ng. This strategy is more geared towards companies that are range from 10 to 1000 employees. Worst case scenario, you don’t hear back from them. Best case scenario, you’re invited in for an exploratory interview, bypassing the phone screen and getting straight to that face-to-face. More likely, however, is that you’ll get a thank you for your interest or maybe even an invitation to check back in 3-6 months. Take them up on that offer. Check back in in 3-6 months, update them on what you’ve been up to since you last spoke, whether it’s learning a new language or completing a new project. Think like a salesperson when it comes to advocating for your career: great job opportunities are like leads that need to be cultivated through months of check-ins and follow-ups. Play the long game.

Final Thoughts

No one is going to place a 150k salaried job in your lap because they recognized your technical aptitude and great cultural fit from your LinkedIn profile. Even if you have a degree from a fancy university with multiple years of professional experience, you are not owed a great job. You have to fight for it. When it’s a numbers game like it is with the job search, and especially in a competitive field like Data Science, you have to think of ways to stand out from the crowd, because in reality, there are many candidates with similar qualifications.

The only way I’ve been able to get around that is to get the emotional buy-in of someone on the inside who can then advocate for you, whether it’s writing an email to them, connecting over a private Slack channel, or meeting up with them individually. Otherwise, you’re just another bag of words.

What about you? Have you had any success with these or other non-traditional job search techniques? Feel free to share them in the comments below!

Hacker Rank

Wed, 22 Aug 2018 00:00:00 +0000

Answers to Hacker Rank challenges in the SQL module.

Hacker Rank is another great resource to improve programming and analysis skills offering modules like “Problem Solving”, “Language Proficiency”, and “Specialized Skills”. Within each module there is a topic, so for example, within the Language Proficiency module, there are seven languages to practice, including Python, Java, and Ruby. Programmers can work through a series of challenges to improve their proficiencies, earning points that eventually accumulate to badges. Bronze, silver, and gold badges represent what proficiency a programmer has within a given topic. You may wonder why someone would care about achieving a badge that isn’t seemingly worth anything, but this is actually another great way to showcase coding ability to prospective employers. Demonstrating that you’ve achieved a gold badge in SQL on Hacker Rank is a quantifiable data point that corroborates bullet points on a resume.

Within each challenge, there is a discussion section where coders can ask questions and post answers. I’ve found this particularly helpful when I was stuck on a small aspect of the challenge or to identify alternative methods of approaching the problem.

Below is a series of posts containing my answers to challenges rated at medium difficulty for the SQL module.

SQL Zoo

Tue, 21 Aug 2018 00:00:00 +0000

A complete list of my answers to the SQL Zoo Tutorials.

Lately I’ve been focusing on my Python projects so I thought I’d switch it up and do some SQL practice using SQL Zoo’s interactive tutorials. These tutorials are filled with prompts to query databases using SQL. What’s great about this site is that it shows the results of the query right next to the prompt line, allowing the user to verify that the query pulled the desired results. SQL Zoo confirms answers with a :)

Below is a list of posts that contain prompts and answers to each of the chapters in SQL Zoo.

SUM and COUNT Tutorial

Sat, 18 Aug 2018 00:00:00 +0000

Total world population

Show the total population of the world.

SELECT SUM(population)
FROM world

List of continents

List all the continents - just once each.

SELECT DISTINCT continent FROM world

GDP of Africa

Give the total GDP of Africa

SELECT SUM(gdp) FROM world
WHERE continent = 'Africa'

Count the big countries

How many countries have an area of at least 1000000

SELECT COUNT(name) FROM world
WHERE area >= 1000000

Baltic states population

What is the total population of (‘Estonia’, ‘Latvia’, ‘Lithuania’)

SELECT SUM(population) FROM world
WHERE name IN ('Estonia', 'Latvia', 'Lithuania')

Counting the countries of each continent

For each continent show the continent and number of countries.

SELECT continent, COUNT(name) FROM world
GROUP BY continent

Counting big countries in each continent

For each continent show the continent and number of countries with populations of at least 10 million.

SELECT continent, COUNT(name) FROM world
WHERE population >=10000000
GROUP BY continent

Counting big continents

List the continents that have a total population of at least 100 million.

SELECT continent FROM world
GROUP BY continent
HAVING SUM(population) >= 100000000

SELECT within SELECT Tutorial

Fri, 17 Aug 2018 00:00:00 +0000

Bigger than Russia

List each country name where the population is larger than that of ‘Russia’.

SELECT name FROM world
WHERE population >
     (SELECT population FROM world
      WHERE name='Russia');

Richer than UK

Show the countries in Europe with a per capita GDP greater than ‘United Kingdom’.

SELECT name FROM world
WHERE continent = 'EUROPE' AND gdp/population >
    (SELECT gdp/population FROM world
      WHERE name='United Kingdom');

Neighbors of Argentina and Australia

List the name and continent of countries in the continents containing either Argentina or Australia. Order by name of the country.

SELECT name, continent FROM world
WHERE continent IN ('South America', 'Oceania')
ORDER BY name;

Between Canada and Poland

Which country has a population that is more than Canada but less than Poland? Show the name and the population.

SELECT name, population FROM world
WHERE population >
    (SELECT population FROM world
    WHERE name = 'Canada')
AND population <
    (SELECT population FROM world
    WHERE name = 'Poland');

Percentages of Germany

Germany (population 80 million) has the largest population of the countries in Europe. Austria (population 8.5 million) has 11% of the population of Germany.

Show the name and the population of each country in Europe. Show the population as a percentage of the population of Germany.

SELECT name, CONCAT(ROUND(population/(SELECT population FROM world
WHERE name = 'Germany') * 100), '%') FROM world
WHERE continent = 'Europe';

Bigger than every country in Europe

Which countries have a GDP greater than every country in Europe? [Give the name only.] (Some countries may have NULL gdp values)

SELECT name FROM world
WHERE gdp >= ALL(SELECT gdp FROM world
                    WHERE continent = 'Europe' AND gdp>0)
AND continent != 'Europe';

Largest in each continent

Find the largest country (by area) in each continent, show the continent, the name and the area:

SELECT continent, name, area FROM world x
WHERE area >= ALL
  (SELECT area FROM world y
      WHERE y.continent=x.continent
          AND area>0);

First country of each continent (alphabetically)

List each continent and the name of the country that comes first alphabetically.

SELECT continent, name FROM world x
WHERE name <= ALL
  (SELECT name FROM world y
      WHERE y.continent=x.continent);

Difficult Questions That Utilize Techniques Not Covered In Prior Sections

Find the continents where all countries have a population <= 25000000. Then find the names of the countries associated with these continents. Show name, continent and population.

SELECT name, continent, population FROM world
WHERE continent IN
  (SELECT continent FROM world x
    WHERE
      (SELECT MAX(population) FROM world y
        WHERE x.continent = y.continent) <= 25000000);

Some countries have populations more than three times that of any of their neighbors (in the same continent). Give the countries and continents.

SELECT name, continent FROM world x
WHERE population > ALL
	(SELECT 3*population FROM world y
		WHERE y.continent=x.continent AND x.name <> y.name);

SELECT FROM world Tutorial

Sat, 11 Aug 2018 00:00:00 +0000

Introduction

Read the notes about this table. Observe the result of running this SQL command to show the name, continent and population of all countries.

SELECT name, continent, population FROM world;

Large Countries

How to use WHERE to filter records. Show the name for the countries that have a population of at least 200 million. 200 million is 200000000, there are eight zeros.

SELECT name FROM world
WHERE population >= 200000000;

Per capita GDP

Give the name and the per capita GDP for those countries with a population of at least 200 million.

SELECT name, gdp/population FROM world
WHERE population >= 200000000;

South America In millions

Show the name and population in millions for the countries of the continent ‘South America’. Divide the population by 1000000 to get population in millions.

SELECT name, population/1000000 FROM world
WHERE continent = 'South America';

France, Germany, Italy

Show the name and population for France, Germany, Italy

SELECT name, population FROM world
WHERE name IN ('France', 'Germany', 'Italy');

United

Show the countries which have a name that includes the word ‘United’

SELECT name FROM world
WHERE name LIKE '%United%';

Two ways to be big

Two ways to be big: A country is big if it has an area of more than 3 million sq km or it has a population of more than 250 million.

Show the countries that are big by area or big by population. Show name, population and area.

SELECT name, population, area FROM world
WHERE area > 3000000 OR population > 250000000;

One or the other (but not both)

Exclusive OR (XOR). Show the countries that are big by area or big by population but not both. Show name, population and area.

Australia has a big area but a small population, it should be included. Indonesia has a big population but a small area, it should be included. China has a big population and big area, it should be excluded. United Kingdom has a small population and a small area, it should be excluded.

SELECT name, population, area FROM world
WHERE area > 3000000 XOR population > 250000000;

Rounding

Show the name and population in millions and the GDP in billions for the countries of the continent ‘South America’. Use the ROUND function to show the values to two decimal places.

For South America show population in millions and GDP in billions both to 2 decimal places.

SELECT name, ROUND(population/1000000, 2), ROUND (gdp/1000000000, 2) FROM world
WHERE continent = 'South America';

Trillion dollar economies

Show the name and per-capita GDP for those countries with a GDP of at least one trillion (1000000000000; that is 12 zeros). Round this value to the nearest 1000.

Show per-capita GDP for the trillion dollar countries to the nearest $1000.

SELECT name, ROUND(gdp/population, -3) FROM world
WHERE gdp >= 1000000000000;

Name and capital have the same length

Greece has capital Athens.

Each of the strings ‘Greece’, and ‘Athens’ has 6 characters.

Show the name and capital where the name and the capital have the same number of characters.

You can use the LENGTH function to find the number of characters in a string

SELECT name, capital FROM world
WHERE LENGTH(name) = LENGTH(capital);

Matching name and capital

The capital of Sweden is Stockholm. Both words start with the letter ‘S’.

Show the name and the capital where the first letters of each match. Don’t include countries where the name and the capital are the same word. You can use the function LEFT to isolate the first character. You can use <> as the NOT EQUALS operator.

SELECT name, capital FROM world
WHERE LEFT(name,1) = LEFT(capital, 1) AND name <> capital;

All the vowels

Equatorial Guinea and Dominican Republic have all of the vowels (a e i o u) in the name. They don’t count because they have more than one word in the name.

Find the country that has all the vowels and no spaces in its name.

You can use the phrase name NOT LIKE ‘%a%’ to exclude characters from your results. The query shown misses countries like Bahamas and Belarus because they contain at least one ‘a’

SELECT name
   FROM world
WHERE name LIKE '%A%' AND name LIKE '%E%' AND name LIKE '%I%' AND name LIKE '%O%' AND name LIKE '%U%'
  AND name NOT LIKE '% %';

Review of General Assembly's Data Science Immersive Course

Tue, 17 Jul 2018 00:00:00 +0000

An honest, post-graduation review of the course’s layout, syllabus, and instructors.

General Assembly bills itself as a leader in Education. Founded in 2011, GA claims to have “transformed tens of thousands of careers through pioneering, experiential education in today’s most in-demand skills”¹ such as data science, web development, and user-experience design. GA offers a wide range of classes and workshops, from 2-hour info sessions to 3-month, full time immersion courses.

My first introduction to General Assembly was in early 2017 when I enrolled in their Excel and SQL bootcamps. These workshops were either one (Excel) or two-day (SQL) accelerated overviews of the respective tools.

After six hours in the Excel bootcamp, I actually was able to walk away with a much better understanding of conditional statements, formulas, and pivot tables and was able to implement the techniques I learned in the class to my job immediately. The SQL class also helped, but I wasn’t able to practice writing queries as much after the class.

But as the months went by, I had a lingering feeling that if I wanted to advance my career in any kind of meaningful way, I needed harder-to-obtain skills than knowing how to write some formulas or manipulate some pivot tables. In the year prior up to enrolling in GA’s Data Science Immersive course, I kept returning to the curriculum in which GA promised to teach its students the fundamentals of machine learning in 12 weeks. I was frustrated with how stagnant and limited my skillset was at my current job and concerned with my ability to advance beyond a lower-mid level role. I wanted to take on more interesting (and analytical) projects but was routinely overlooked in favor of someone with more technical skills.

For about a year, I kept returning to that curriculum, nervous about how I could possibly swing taking three (plus) months off of work while living in NYC, and more nervous about what would happen if I didn’t.

Orientation Day

What GA (and any other bootcamp for that matter) proposes to accomplish is somewhat of an impossible task: Take students from varying backgrounds, skillsets, and aptitudes and teach a wide breadth of new and complicated material in what ends up being a sprint to the finish line every single week. My cohort was no different. At one end of the academic spectrum, a student decided to matriculate through the bootcamp in lieu of going to college. On the other, there was a student who had tripled majored in math, physics, and philosophy at Yale. There were students who had had professional experience as data analysts and were looking to advance their careers in their established fields and there were those that were looking to transition from completely unrelated industries. The majority of the class had their Bachelor’s degree, but a healthy amount also had Master’s. One student had a Ph.D in Statistics. About half of the class had majored in Math or a hard Science. This is all to say, while it was a diverse cohort, many of my fellow students were quantitatively inclined.

Syllabus

The Data Science course is taught simultaneously across the U.S. divided into two sections - East Coast and West Coast using a “connected classroom” format. Using a video call, a global instructor would teach the lecture material to all five of the East locations (New York, Boston, DC, Atlanta, and Austin) and then turn around and teach the same material to the West Coast cohort.

Surprisingly, it was fairly easy to engage with the lecturer via a dedicated Slack channel. During lecture, students would pose questions on Slack, and either the global instructor would answer them on the spot or, more often, lecture would continue while the local instructors, who were in the classroom with us, would answer them via Slack. If you think this sounds like there was a lot happening all of the time during lecture, you’re right. There were multiple topic-related side discussions going on at once during lecture that you could choose to follow. (They were almost always related to a spin-off question posed during lecture.) This was my first time using Slack, but I soon learned to love it because it allowed me to go back and review a written record of the lecture, and specifically, answers to questions that I may not have fully understood the first time around.

I included a diagram of the schedule (seen below) from GA’s website because it more or less describes how class was run every day. (NB: While the course material is the same across the different locations, the overall experience of the course varies slightly based on the local instructor, the makeup of the class, and class-time restrictions mandated at the state level. Because I am based in New York and took the course in New York, this post only describes my experience in that location.)

I would caveat that while the post-lecture reviews are technically optional, they really aren’t in practice. This is time that the local instructor uses to go over questions related to the lecture material or work with students individually to answer questions related to labs or projects. Trust me, you won’t want to skip out on this. In New York, there wasn’t a “Community Meetup” that I can recall. We basically just ended lecture around 4 PM, took a ten minute break, and then had our afternoon lab/review time right up until the end of the day at 6 PM when our local instructor congratulated us for surviving another day.

While I’m sure the material changes slightly with each cohort, this is a summary of the topics we covered each week:

Week 1: Intro to Python

Week 2: “Pandas Appreciation Week”, SQL

Week 3: Linear Regression

Week 4: Classification

Week 5: Web-scraping, APIs, Natural Language Processing

Week 6: Advanced Modeling

Week 7: Neural Networks

Week 8: Unsupervised Learning

Week 9: Bayesian Statistics

Week 10: Spatio-temporal Modeling

Week 11: Big Data (AWS, Scala, Spark)

Week 12: Capstone Project Workshops

Overall, I was happy with the breadth of topics we covered. As with any bootcamp, depth is always going to be a challenge given how much material there is to cover within a certain amount of time. That being said, GA provided many resources to further explore any given topic, and post-course, I have found myself reviewing materials and lesson plans.

There is, however, one change I would make to the syllabus: Week 9. We spent a whole week on high-concept Bayesian statistics, and given that I majored in Ancient Greek, this was an uphill battle. While I do believe that a fundamental grounding in statistical analysis is important to Data Science, I believe our time would have been better spent learning R - another highly sought-after programming skill geared more toward statistical analysis. When time permits, that will be my next challenge.

Meet the Instructors

There were four global instructors, all with impressive credentials in statistics, machine learning, and development. Three out of the four lecturers were excellent. They were able to explain difficult concepts in an organized and coherent way, often with supplemental powerpoint materials that enhanced my ability to grasp new concepts, especially ones that were more math and theory-based. There were also many in-class “code-alongs” during which we practiced actual execution and wrote code in tandem with the instructor to reinforce programming skills. This technique was particularly useful when we were first learning how to use the neural network libraries, Keras and TensorFlow.

We also had a local instructor and a TA in class with us. I cannot say enough good things about my local instructor. He had the impossible task of raising 20 baby Data Scientists and did it with the patience of a demi-god. He explained and reviewed difficult concepts multiple times until things began to stick. He debugged coding errors like a ninja. He supported my ideas for projects and my capstone and he helped me achieve my overall goals for the course. He was approachable, likable, and compassionate and one of the main reasons I was able to make it through 12 weeks of Data Science.

Graduation Day

We also had a weekly Outcomes meeting every Thursday afternoon during which we discussed topics related to our pending job searches.

In addition to our weekly homework assignments and bi-weekly projects, we were also required to hand-in weekly “Outcomes” assignments, including Brand Statements, updated resumes, LinkedIn profiles, and written responses to questions like, “Tell me about yourself.”

A little Commencement Speech

Matriculating through this bootcamp was one of the harder things I’ve ever done. Unlike college where I had at most three one hour long lectures every day, in the bootcamp we consistently had about four hours of lecture followed by about two hours of review, in addition to another two hours or so of workshop time. In 12 weeks, we had to complete 20 out of 25 (80%) labs in addition to four projects and a capstone, which averages to about two assignments due per week. It was a pretty grueling pace made more difficult by the fact that these were topics I had never been exposed to before.

There are also personal and financial considerations. I don’t think I saw my friends more than once during the entire three months I was in the course because I spent my nights and weekends working on labs and/or projects. I hired a dog walker to help out with my big girl and ended up enrolling in MealPal because I just didn’t have time to cook. By Week 8, I was exhausted struggling to keep up with assignments, and looking down the barrel of a whole other month. In Week 9, my 10-year old computer broke. (Fortunately, I had my work backed up.)

With all of that being said though, if you are thinking of enrolling in a bootcamp, I would recommend it. While I am still learning and improving my Data Science skills everyday, I would absolutely say that the bootcamp gave me a foundation in programming that I previously thought unattainable. Before GA, I had never coded in my life. In 12 weeks, I was able to build predictive models, scrape the web, apply natural language processing, analyze complex datasets, and probably most importantly, gain confidence that I could learn programming techniques. My unsolicited advice to anyone considering one of these programs is:

Get help with the household (kids, dog, cleaning, meal prep) if you swing it/afford it. Every task taken off of your plate during this time will help relieve pressure.
Prepare financially to cover living expenses for the three months while in bootcamp and the three months after while job searching. (My experience thus far has been that jobs worth getting move slowly.) General Assembly (and maybe other bootcamps) offers loans for tuition as well as living expenses, if, like me, you don’t have about 25k cash lying around while living in NYC.
And finally, find your grit. Unless you are a data ninja before enrolling in the course (to which I would then wonder why you would feel the need for such a course to being with) it’s going to be hard. You will feel pressed for time, constantly. There will be concepts that you will not get right away. You will get stuck in your code. You may even fail a quiz, or five. This is a high-stakes environment given the cost of the class, the pace of the syllabus, the difficulty of the concepts, and the pressure to secure employment after. Knowing in your gut why you are doing this and what it will mean for you after - higher income, greater job stability, maybe even a better work/life balance - will carry you through. That, and a healthy supply of snacks.

If I can answer any more questions specifically about my own experience as a now alum of GA’s Data Science Immersive course, feel free to reach out to me via my contact form.

Here’s a picture of my cohort on our (very long) last day of class after 20 capstone presentations!

https://generalassemb.ly/why-ga-is-worth-it ↩

Can SEC Filings Predict Enforcement Actions?

Fri, 06 Jul 2018 00:00:00 +0000

Complex financial schemes usually involve a lot of documents, but what if machine learning could help financial regulators identify which corporations to investigate?

The FBI and the American Association of Certified Fraud Examiners estimate that white collar crime costs 300 to 600 billion dollars a year. That’s the equivalent of Warren Buffet’s net worth 3.5 times over, or buying out the debt of New York City’s MTA 9 times over, or purchasing the New York Knicks 90 times over. It’s incredible amount of money and also a very wide range, suggesting that enforcement agencies don’t actually know the extent of the damage cause by white collar crime.

And that damage is not just limited to the crime itself - the ripple effects of financial crime are even wider, more opaque, but undoubtedly affect every single one of us. Companies that are defrauded will raise prices on consumer goods to recoup losses. Scammed insurance companies will demand higher premiums on their policies. Hedge fund schemes that manipulate equity markets can decrease stock prices. And if the government gets taken for a ride, we all know what happens then - more government taxes.

Identifying these bad actors is difficult. Complex financial schemes usually involve a lot of documents, some of which demand law or accounting degrees to truly understand. Moreover, as with many government agencies, there’s just not enough people to review all of those complex documents. And the select few who are tasked with reviewing these complex cases are doing it with outdated and insufficient technology.

Corporations that are listed on any U.S. stock exchange are required to submit financial reports to the Securities and Exchange Commission (SEC) to inform potential investors of the company’s financial health. There are hundreds of filing types that a corporation can submit, and these filings include information like quarterly and annual financial statements, major events, changes to the institution’s organizational structure, and foreign investments, to name a few. The SEC makes these filings publicly available on their website.

To that end, I wanted to use data science techniques like web scraping, exploratory data analysis, and machine learning to see whether I could identify ‘bad actor’ corporations by the types of reports they file in concert with other demographic information to help financial regulators like the SEC focus their investigations of white collar crime.

Identifying the Target

My first task was to identify my targets so I could train my model on what suspicious cases might look like. I researched SEC indictments reports and found that the SEC releases annual reports of their enforcement cases. For example, in these reports I found, there is a list of indictments with information about the defendant (either an individual or a corporation) as well as date information and related charges. I located reports for the years 2006-2016. Given that these reports were in pdf format, my first task was to find a way to extract the information I needed into a format that I could eventually feed into my model. I first tried using a package I found that reads in material from pdfs, but because of the inherent unstable process of extracting text, the formula I wrote to extract the text from the first report did not extract text in the same way as the next. Because I only had a couple of weeks to complete this project and needed to extract text from eleven reports just to identify my targets, I quickly abandoned this method. I later found a free website that more accurately (and quickly) converted my reports to excel files. In data science as in many disciplines, sometimes the simplest tool is the best.

The next big hurdle to overcome was the fact that names of individuals and corporations are not easily fed into searches. In a future iteration of this project, I would try to use some of kind fuzzing matching algorithm like the fuzzy wuzzy package to feed the names of the defendants into a search box.

For now, I manually entered the names of the cases into the EDGAR search field to identify their corresponding CIK ID Number. Because I didn’t want to manually search 5,000 entries, I pared my initial list of defendants down to corporations. Many of the names of the corporate defendants did not turn up a CIK number. (For some cases, this makes sense given that these corporations were accused of financial crime and may never have filed the required reports). From the 1600 or so corporate cases I had, I was able to identify 175 corporates with CIK ID numbers. I finally had my target set.

Putting the Case Together

But what about my Not Indicted class? If I was going to make predictions, I needed to build out my dataset with values associated with corporations that have not had enforcement action against them, given that that is the overwhelming norm. Initially, I thought about doing the same process over again with just a random list of companies from the S&P500, but after more research, I found a similar project that had already done some of the work that I wanted to. Given my time constraints, I decided to download their datasets, which also identifies corporations by the Central Index Key (CIK). These datasets contained the types of data points I was looking for, namely, locations and years active. I suspected that the 175 target corporations I had previously identified were somewhere in the 65,000 rows of corporates, so I subset my target data set out of the much larger dataset. When I compared the shapes of the prior set to the new set, sure enough, 175 rows were dropped. I created a new column within my new dataset called ‘Indicted’ and set it to ‘0’. Then, I concatenated my target dataset on the 0 axis with its corresponding ‘Indicted’ column and set it to 1, meaning, positive for having been indicted.

With the CIK ID numbers for both of my Indicted and Not Indicted classes, I was able to scrape the SEC website for their filings using a nested for loop with the requests and Beautiful soup libraries. Having saved all of my filings in a list for each of my 65,000 rows, I then transformed those individual filings into their own unique dummy columns and filled each column with the value counts of the filing. With this final transformation, I had a working dataset that I could analyze and subsequently feed into a model.

Witness Selection

I knew that I was likely going to use some ‘blackbox’ models that would limit my ability to interpret what weights were given to which features in predicting the Indicted class. So, I decided to use a tool that does allow for interpretability of what factors are most predictive of the minority class. SelectKBest, identified features that were most predictive of my Indicted class. One of the top ten features identified was the number of years a corporation was active. As visualized in the plot below, half of the Indicted class was active for less than a year, whereas there is more of a normal distribution among the Not Indicted class.

In addition to the number of years a corporation has been active, SelectKBest also identified several filing types that were highly predictive of the minority class, including the ‘2-A’, a report of sales and uses of proceeds,¹, the ‘ADV-H-T’, an application for temporary hardship,² the ‘DEFR14C’, a filing that discloses pertinent information that is required to be disclosed to shareholders but does not required a shareholder vote for approval,³ and the ‘SC 14F1/A’, a filing disclosing a majority change in a corporation’s directors.⁴

The Indicted class filed these types of filings with slightly more frequently than did the Not Indicted class, as illustrated in the plot below.

In analyzing the purpose of these filings, I posited that some corporations might be “hiding” pertinent information that in fact should be voted on by shareholders, or, in the case of the ‘SC 14F1/A’ registering the corporation with a ‘straw owner’ to give an air of legitimacy before the “real” directors take over to pursue potentially criminal activities.

Balancing the Scales

My baseline accuracy for this project was 99.7%. This means that if you guessed that a corporation was not indicted (0 in the target column), you would be right 99.7% of the time. With my majority and minority classes so imbalanced, it is difficult for the machine to pick up on the (very subtle) signals that the minority class is sending out. To counter-act this imbalance, I used several different class balancing techniques to amplify the minority class’s signals.

First, I used a resampling technique called downsampling from scikit-learn’s utils package. This technique allowed me to reduce the majority population to try to “quiet” the amount of signals the majority class sends out. I chose to downsample my majority class from 65,324 down to 10,000 and assumed that 10,000 samples of the majority class would have enough variety to represent the full 65,324 set.

With my majority class muffled a little bit, I then used another sampling tool called SMOTEENN to try to amplify the signals of my minority class (the Indicted class). Imblearn’s SMOTEENN is a slight variation of SMOTE (Synthetic Minority Oversampling TEchnique which randomly selects data points in the minority class and creates copies of them that mimic the characteristics of the minority class based on a k nearest neighbors distance calculation. SMOTEENN differs slightly from SMOTE by additionally using Edited Nearest Neighbors (ENN), which in essence “cleans” the data set of points that are not strong representatives of their class. Put another way, it reduces noise within the data set for both classes. I verified that the were properly balanced by checking their shapes. In fact, both had 10,892 rows of data.

I then performed two train-test-splits on the dataset - the first set would be used to initially train my models on my X_train and y_train. Within this first set, I would conduct another train-test-split on my X_train and y_train which would be split into my cross-validation set (X_train_val, y_train_val, X_test_val, y_test_val). Cross validating my models on a dataset that the model hasn’t been trained on guards against overfitting my models to any one particular dataset. If and when I was satisfied with the results of my models and satisfied with how my parameters were tuned, I would then predict on the X_test from the initial train-test-split - another dataset that the models have not yet seen.

Reaching a Verdict

When running my models, I decided to score on precision because I was most concerned with reducing false positives while also trying to increase the number of true positives as much as possible. The equation for precision, seen below, calculates the positive predictive value but penalizes the score by how many positive values were predicted incorrectly. In my project, a false positive would be a corporate that was predicted to be indicted but should not have been.

$Precision = \frac{True Positive}{True Positive + False Positive}$

In a slightly different metric, recall measures the true positive rate by penalizing the score with false negatives (seen below). In this project, a false negative would be a corporate that should have been indicted but wasn’t.

$Recall = \frac{True Positive}{True Positive + False Negative}$

In sum, both precision and recall are measures of truthiness but with slightly different perspectives. Because it is a much more serious thing to indict a corporation that is innocent (both in terms of potential monetary penalties when the corporation turns around and sues the government as well as lost faith in the justice system) than to not indict a potentially guilty corporation, I decided that precision was the most important measure of success.

To visualize this decision, I plotted two confusion matrices of the models I ran to represent the cross-validation run (left) and the test run (right). Despite the name, a confusion matrix is actually quite a simple tool to interpret. Simply, it is a matrix of four values: true positives (values that were predicted positive and supposed to be positive), false positives (values that were predicted positive but supposed to be negative), true negatives (values that were predicted negative and supposed to be negative), and false negatives (values that were predicted negative but supposed to be positive) to compare the number of false positives, true positives, and false negatives. In the cross-validation run, both the Decision Tree and XGBoost Classifiers kept the false positives to 0 while predicting 14 true positives. They both left 19 false negatives on the table (corporations that should have been indicted but weren’t). The Random Forest and Balanced Bagging Classifiers both predicted more true positives (18 and 21, respectively) but they also predicted more false positives (5 and 17, respectively). In my mind, the benefit of indicting 4 to 7 more corporations does not outweigh the cost of inappropriately accusing 5 to 17 corporations (and the attendant damage caused to the company for being embroiled in litigation).

To focus in on the Indicted class even more, I also plotted the classification reports of each the models. Similar to a confusion matrix, a classification report is another tool that can be used to evaluate model performance other than accuracy. Some of the outputs of a classification report are: precision, recall, and an f1-score (a weighted average of the precision and recall metrics). Because the Decision Tree and XGBoost Classifiers both reduced false positives to 0, they have perfect precision scores of 1.0. However, because they both left corporations “on the table”, their recall scores are not as great. Because the goal of this project was to identify as many corporations as possible to further investigate without investigating a corporation that should not be, I considered these metrics a success.

Areas of Further Research

While this was a toy project based on a handful of inputs related to corporate demographics and filing types, I think it demonstrates the point that machine learning techniques can help financial regulatory agencies identify potential targets for further investigation. If I were to continue researching this topic, I would be interested in pulling in more data to feed the model, such as the names of the directors as well as any associated negative news. I would also be interested in “reading” the filings themselves using natural language processing, and specifically, a Term Frequency - Inverse Document Frequency (TF-IDF) analysis to identify any rare terminology that might signal unusual behavior.

To see all of the code associated with this project, head on over to the related GitHub repo.

Predicting West Nile Virus in Chicago Using Advanced Modeling Techniques

Fri, 15 Jun 2018 00:00:00 +0000

In June 2015, the City of Chicago released data to Kaggle and asked competitors to predict which areas of the city would be more prone to West Nile Virus outbreaks.

West Nile Virus (WNV) was first discovered in Uganda in 1937 and persists through the present day in many disparate parts of the world, spreading primarily through female mosquitoes and affecting several types of bird species, as well as other mammals like horses and humans. Infected victims can either be asymptomatic (75% of cases) or display symptoms like fever, rash, vomiting (20% of cases), encephalitis/meningitis (1% of cases), and in rare instances death (1 out of 1500).¹ While the first case of WNV in the US was reported in Queens, NYC in 1999, Chicago has been particularly plagued by the disease since it was first reported in the Windy City in 2001. As of 2017, Chicago has reported 90 cases of WNV, eight of which resulted in death.²

Since 2004, Chicago has had set up a surveillance system of mosquito traps designed to monitor which areas of the city were positive for WNV which remains in effect today. Every week from late spring through the fall, mosquitos in traps across the city are tested for the virus. The results of these tests influence when and where the city will spray airborne pesticides to control adult mosquito populations.

In June 2015, the City of Chicago released their data to Kaggle in the form of a competition, asking competitors to analyze their datasets and predict which areas of the city would be more prone to WNV outbreaks, thereby allowing the City and Chicago Department of Public Health (CPHD) to allocate resources more efficiently. The prize, $40,000.

While I was a little late to the competition, this dataset remains one of the more popular sets to analyze, given its depth, complexity, and uniquely inherent challenges. Below, I identify the challenges that came up for me when analyzing this dataset as well as the techniques I used to resolve them.

Going in for a check-up

As with many projects, the raw data was split across a several different dataframes, including a train.csv, a test.csv, a weather.csv, and a spray.csv. The spray.csv was a collection of locations and timestamps where pesticides were sprayed - I initially set this aside given that it was not immediately obvious to me how I could incorporate it into my model.

There was also a file filled with location points that represented the city of Chicago. I used this file to identify the traps where WNV was present. These traps are represented by red ‘x’ marks in the map below.

For the weather.csv, I decided to drop the ‘Water1’, ‘Depart’, and ‘Depth’ columns given that volume of values that were missing. I also decided to drop the CodeSum category. There were also some columns with strange values like ‘ T’ which I decided to replace with 0 instead. Finally, I noticed that weather data was collected by two weather stations and seemed largely duplicative. I decided to slice on station1.

Initially, my train (and test) sets had 10 columns and 10506 rows. While there were no null values in the train set, there were values labeled ‘M’ which effectively were nulls. I decided to fill the ‘M’ values with NaNs that I could drop later. I then grouped my train dataframe on ‘Species’, ‘Date’, and ‘Trap’ by the mean to mirror how the test set was organized. I then dropped the ‘Block’, ‘Latitude’,’Longitude’, and ‘AddressAccuracy’ features because I later engineered columns that I thought would be more helpful (and predictive) of WNV.

We’re going to have to operate

I decided to engineer a column called ‘LatLong’ which was a tuple of the ‘Latitude’ and ‘Longitude’ columns zipped together. I did this because I wanted to calculate the distance of any given point in my train data from ‘hot spots’ I identified through aggregating my WNV column. The package I used to calculate distances was vincenty from geopy.distance and takes two args, both of which need to be tuples.

For my first arg in the vincenty function, I created two ‘hot spot’ locations of coordinates by identifying which coordinates had the highest mean value of WNV. I did this by grouping the tupled coordinates in the ‘LatLong’ column, slicing on the ‘WNVPresent’ column and sorting the ‘WNVPresent’ values in descending order. I named the two variables for the top two locations where WNV was present centroid1 (41.974689, -87.890615) and centroid2 (41.673408, -87.599862).

With my centroids identified, I was able to write a for-loop that calculated the distances in miles for each of the tupled location points from the two centroids and saved the result in a list. I applied these list to my dataframe in new features called ‘Distances1’ and ‘Distances2’. I also created another column called ‘Close_to_Centroid2’ which was a lambda function of my ‘Distances2’ feature that binarized whether a point was within five miles of the centroid or not.

Given that the ‘Species’ and ‘Trap’ features were categorical data, I label encoded them into numerical values for modeling purposes. I also applied the datetime function within two lambda functions that I mapped to the ‘Date’ feature. This transformed the one ‘Date’ one column into two - the first date column represented the day of the year (out of 365), and the second column represented the year. This way, these date features could be fed into the model as numerical values instead of objects.

Finally, I merged the weather set onto my train set in preparation of modeling.

We’re going to need to run some tests

One of the most difficult aspects of this project was the fact that the baseline accuracy for the dataset was 95% accuracy. This meant that if you were to guess that WNV was not present, you would be correct 95 out of 100 times. While this may not seem like a big deal, it is very costly to the City to A. spray expensive pesticides where WNV is not present and run the risk of unnecessarily exposing residents to the toxins (a false negative) and B. not spray pesticides where WNV is present and run the risk of residents becoming ill with the virus and potentially being sued for negligence (a false positive).

Keeping this in mind, I installed the imblearn package which contains a class balancer, SMOTEENN. Imblearn’s SMOTEENN is a slight variation of SMOTE (Synthetic Minority Oversampling TEchnique which randomly selects data points in the minority class and creates copies of them that mimic the characteristics of the minority class based on a k nearest neighbors distance calculation. SMOTEENN differs slightly from SMOTE by additionally using Edited Nearest Neighbors (ENN), which in essence “cleans” the data set of points that are not strong representatives of their class. This allows the model to learn more deeply about what characteristics are inherent to the target class.

After balancing the classes, I ran three models: a Random Forest Classifier, XGBoost, and a Balanced Bagging Classifier. I chose these models based on their powerful algorithms to extract information from features and, in the case of XGBoost, learn from mistakes made in the Decision Tree learning process and incorporate that information gain back into training process.

What’s the prognosis?

ROC-AUC is short for Receiver Operating Characteristic - Area Under the Curve. It is a metric created by plotting the true positive rate against the false positive rate.

The true positive rate is also known as recall, where true positives are divided by the sum of true positives and false negatives. The true positive rate is a metric that evaluates the proportion of positive data points that were correctly predicted as positive. A higher true positive rate means better accuracy with respect to positive data points.³

The false positive rate is called the fall-out and is calculated by the number of false positives divided by the sum of false positives and true negatives. This can be interpreted as the proportion of negative data points that were incorrectly predicted as positive. A higher false positive rate means more negative data points are classified incorrectly as positive.⁴

I plotted the ROC-AUC curves of my three models below.

Random Forest	XGBoost	Balanced Bagging Classifier

Because the baseline accuracy score was so high already, I wanted to take a closer look at how many positive data points I correctly (and incorrectly) predicted, so I ran a confusion matrix on all three models. I was mostly concerned with reducing the amount of false positives as much as I could while increasing the true positives. For example, while the XGBoost model produced 73 true positives, it also predicted 247 false positives. In comparison, the Balanced Bagging Classifier model only predicted 56 positives correctly, however, it also only incorrectly predicted 202 positives. The decision of which model’s trade-off to choose depends on the goals of the project. On the one hand, it might be better to spray pesticides on more areas that were correctly predicted to have WNV. On the other hand, it also might be more expensive to the City. For illustrative purposes, I plotted the true positives and false positives of all three models below.

Having tuned my models, I submitted my predictions to Kaggle. While my best score was only a 0.61 (score on ROC-AUC), this project was well worth my time. Not only did it allow me to continue learning more about tuning my models and scoring on different metrics, but it also was my first introduction to imbalanced classes. In the real world, (and especially the industries I’m interested in working in, like financial crime) many datasets are severely imbalanced because there just are not many instances of the minority class because it is a rare event. However, in order to create a model that can more accurately predict the minority class, it is up to the data scientist to make the model more sensitive to the signals sent out by the minority class. Knowing how to balance the majority and minority classes allows for better models.

If you’re interested in seeing the code associated with this project, head on over to the related GitHub repo.

The Writing's on the Wall, Part 1 (Introduction to Neural Networks Using Keras)

Thu, 07 Jun 2018 00:00:00 +0000

Visions of the future are never completely true, but with the right key, some can be more truthful than others. Here’s an introduction to the neural network library, Keras, which I used to predict handwritten numbers.

When dreaming of the day her husband, Odysseus, would return from war, Penelope describes a vision she had that portends his return. Having waited twenty years already, she is wary of the dream’s validity, saying:

“Stranger, dreams verily are baffling and unclear of meaning, and in no wise do they find fulfilment in all things for men. For two are the gates of shadowy dreams, and one is fashioned of horn and one of ivory. Those dreams that pass through the gate of sawn ivory deceive men, bringing words that find no fulfilment. But those that come forth through the gate of polished horn bring true issues to pass, when any mortal sees them.” Homer, Odyssey 19. 560–569 ff (Murray translation)

Penelope describes dreams that pass through the ivory gates as deceitful whereas those that pass through gates of horn can be trusted. As Arthur Murray notes, the Greek word for horn, κέρας, recalls a similar sounding word κραίνω meaning “fulfill” whereas ἐλέφας, the Greek word for ivory, similarly recalls the word ἐλεφαίρομαι meaning “deceive.” Finally, my degree in Ancient Greek paid off.

François Chollet, a Google enigeer, drew inspiration from this imagery which was first referenced in the Odyssey nearly three thousand years ago when naming his popular neural network library, Keras. Designed explicitly for ease of use and speed, Keras is a deep neural network library that makes use of TensorFlow on the backend.

For my first neural network project, I applied Keras to the MNIST (“Modified National Institute of Standards and Technology”) handwriting database made available by Kaggle to predict numbers based on their handwritten equivalent. Below is a walkthrough of the steps I took to make predictions.

Prepping the Canvas

To start, I read the training set into pandas, identifying 42,000 rows and 785 columns of data, inclusive of the target column, ‘label.’ With the exception of the target column, all of the rest of the columns were pixel features of the images.

Because Keras requires a numpy array (as opposed to a pandas dataframe), I transformed the X set into a numpy matrix and then normalized the dataset by dividing each value by the maximum number of pixels (255). I also transformed the target (y) values into a one-hot encoded matrix given that this dataset is a multi-classification problem.

Before feeding the data into the neural network model, I train-test-split my data for training purposes.

Adding Nuance

I initialized the model by making it Sequential() and adding layers to it. Within a neural network, there are different layers that can be added to improve model performance. First, I added a dense layer which connected all of the neurons. Then, I added a “hidden” dropout layer. A hidden layer does not mean that it is not present or isn’t working - it just means that it is not the initial input layer or the final output layer, as seen in the image below.

A neural network can have any number of hidden layers with any number of neurons. However, within each layer, each neuron will transform the data in different (and difficult to quantify) ways based on the assigned and change in weights within that neuron.

I decided to set my dropout layer to 0.5, which means that 50% of the neurons would be randomly “turned off” from training. I did this to prevent the model from being overfit to the training set and to allow for some flexibility when faced with new data.

model = Sequential()
model.add(Dense(X_train.shape[1], input_shape=(784,), activation='relu'))
model.add(Dropout(.5))
model.add(Dense(y_train.shape[1], activation='softmax'))

Next, I activated the training set with ‘relu’ (Rectified Linear Unit) which only activates neurons whose output will be positive. For the target, I activated with ‘softmax’ to essentially normalize the inputs within a scale of 0 to 1 to create a probability for an input falling within one of the target classes.

Knowing Your Audience

The next step to modeling a neural network is to compile the model. I chose the ‘adam’ optimizer given its ease of use, computational efficiency, and ability to handle large sets of data. While there is much more complex math behind how ‘adam’ is derived, essentially it works similar to (but different from) stochastic gradient descent in which the loss function is computed during each epoch and, as a result, neurons are re-weighted until the loss is sufficiently minimized. ‘Adam’ differs slightly from stochastic gradient descent by incorporating a dynamic and adaptive learning rate determined by the magnitude of change to the gradients.¹

Because this was a muticlassification problem, I set the loss metric to ‘categorical_crossentropy’ as suggested by the Keras documentation.

Failing Better

When training the model, a user can specify how many epochs used to train the model. An epoch is a “roundtrip” for information to be passed forwards and backwards through the network.

During forward propagation, data is passed from the initial input layer, through the hidden layers, and finally to the output layer during which weights, biases, and activation functions are applied to the neurons.

During back propagation, the data is then passed back from the output towards the input layer, measuring the loss produced from the weights assigned to the neurons and, using gradient descent, changes the weights accordingly in order to make more accurate predictions over a user-defined number of iterations. The amount of epochs to use during training can be identified when either the accuracy or the loss of the predictions stabilizes at an acceptable value.

To illustrate this point, I plotted a summary chart of the four runs I conducted, tweaking the number of epochs and the dropout percentage with each new run.

The top left plot shows the loss from both the training and the cross-validation (test) during which I trained 50 epochs without a dropout layer. Notice how the training loss dramatically reduces within the first ten epochs but continues to improve only slightly as the training continues to the 50th epoch. The loss from the test set, however, increases with the number of epochs with a wider amount of variance - a clear sign of overfitting. The top right plot shows a similar training run but with an added dropout layer of 50%. Both the training and test sets appear to be more stable and the loss from the test set is reduced. In my third run, illustrated by the bottom left plot, I increased the dropout percentage to 75% while still running 50 epochs. Again, the test loss amount is reduced. Finally, in the fourth run, noticing that in the last three runs there was a shift in performance between the tenth and twentieth epoch, I decided to set my training epochs to 15 while keeping the dropout rate at 75%. In this run, the test set loss hued closely to the loss of the training set. Now I had a model that was trained up on the dataset while remaining flexible enough to predict on unseen data. To further illustrate this point, I input the training and test loss values into a table for easier comparison.

Run	Epochs	Dropout	Train Loss	Test Loss
1	50	0%	0.002	0.178
2	50	50%	0.019	0.140
3	50	75%	0.073	0.117
4	15	75%	0.104	0.093

Art is a Lie that makes us Realize Truth

As machine learning continues to be integrated into our daily lives, we should be as discerning as Penelope was when determining which visions of the future are true and which are false.

To see all of the code associated with this post, check out the associated GitHub repo.

https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/ ↩

Salary Predictions Using Only 3 Features and XGBoost

Tue, 05 Jun 2018 00:00:00 +0000

Sometimes you can’t get everything you want. Here’s an afternoon hackathon project that challenged me to make the most of what I had.

We’ve all heard it before - you can get something fast and cheap, but it probably won’t be very good; you could get something good and cheap, but it will take a long time; or you can get something good and fast, but it’s going to cost you. In that vein, I participated in an afternoon hacking challenge that prompted us to predict whether a person’s salary was above $50,000.

But there was a catch: each of the three teams were handicapped by a constraint. ‘Team Samples’ was limited to the number of samples they were allowed to train on but could engineer an unlimited amount of features and choose their own models. ‘Team Features’ was bound to the number of features they could use (engineered or pre-existing) but could train on the full set of data and choose their own model. ‘Team Algorithm’ was not bound by the number of features or samples but was constrained by the model they were allowed to use. None of the three teams knew which team they were on until after the class had voted on the parameters of the constraint. While these constraints were not dealbreakers, they were highly unfavorable.

I was randomly assigned to ‘Team Features’ and constrained to using only three features from the dataset. In three and a half hours, I worked with my two teammates to build out the best model we could given our project and time constraints. Below is a summary of what we did that afternoon.

Ready? Go!

The dataset had 30,298 rows across 14 columns related to demographic information of workers, such as age, sex, education, marital-status, and occupation, to name a few. In a separate target set, we were given their corresponding salaries. Thankfully, there were no null values, which I would have likely dropped given the time constraint of the project.

There were eight categorical columns and six numerical columns. I decided to LabelEncode all of the categorical data in order to get the data into a model as quickly as possible.

With my data in numbers I could model with, I applied SelectKBest to the data and the salaries as my target. Given that I was constrained by the number of features I could feed into a model, I set my k value to three. SelectKBest identified that the three most predictive features of a salary were ‘education-num’, ‘relationship’, and ‘capital-gain’. To be honest, this surprised me, as I was expecting to see ‘sex’ as a major predictor. From there, I scaled and train-test-split my data before feeding it into a model. Given that this project was a competition, I wanted to challenge myself to use a new model that I hadn’t yet tried.

Power Up!

XGBoost, short for eXtreme Gradient Boosting, is a powerful algorithm used in many Kaggle competitions and is known for its performance as well as computational speed. The XGBoost algorithm is an ensemble model based on a Decision Tree methodology that creates new models based on predictions of errors from prior models and adds them together sequentially to make a final prediction. To calculate those errors, XGBoost employs the gradient descent algorithm which identifies optimal parameter values that minimize loss.

More specifically, gradient descent is an iterative method that chooses a random parameter value to optimize from a starting point dictated the user and calculates the loss of the function for that parameter value. It continues to randomly guess values for parameters until the loss is sufficiently minimized, meaning that the difference of loss function between two iterations of parameters is sufficiently small. The distance between one guess and the next is determined by the learning rate. Move too fast, however, and the algorithm might just skip over the best possible value, move too slowly, and the model might never converge. If the model fails to converge, the learning rate (alpha) will need to be increased.

To understand this process visually, imagine yourself on a sled at the top of the peak in the graph below. Gradient descent will pick a random direction and sled down the mountain a little bit, calculate the loss function, and then sled down the mountain some more until it reaches a valley that satisfies the loss threshold, set as one of the parameters within the algorithm. XGBoost prevents the potential for overfitting by using L1 and L2 regularization penalties.¹

And the Winner Is…

When modeling my data using XGBoost, I also GridSearched across some parameters, namely ‘max_depth’, ‘n_estimators’, ‘learning_rate’ , and ‘booster’ to hypertune the model. The fitting took about four and a half minutes across 13,500 possible fits and returned the following best parameters:

‘booster’: ‘gbtree’,
‘learning_rate’: 0.7054802310718645,
‘max_depth’: 5,
‘n_estimators’: 10

The model scored a 0.847 for the training set and a 0.851 for the test set within the initial train-test-split.

Having fit the model and verified that it wasn’t overfit, we then applied our fitted XGBoost model to the hold-out test set and waited for our final score - a 0.841! Not bad for a model with only three features!

This was a really fun afternoon project and a great introduction to what Data Science might be like in a professional setting - I suspect that as a Data Scientist, I will have to make (and defend) all sorts of decisions when faced with client deadlines, budget constraints, and incomplete or limited datasets. Plus, it challenged me to learn a new and powerful model that I later applied to larger-scale projects.

To view the underlying code for this post, check out the corresponding GitHub repo.

https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/ ↩

How to Be Popular on Reddit

Tue, 05 Jun 2018 00:00:00 +0000

With an endless supply of things to read on the internet, it seems impossible to write a post that anyone else but your mom will read. But with a few (hundred) lines of code, even a platform as wild as Reddit can be neatly distilled into a handful of targeted insights.

Reddit claims to be the front page of the internet, and that’s because they are. With an average of 542 million monthly visitors, of which 234 million are unique, Reddit is the third most visited site in the U.S. and ranked sixth in the world. Reddit is subdivided into subreddits, which are themed discussion boards created and populated by Reddit users with links, text, videos and images. These subreddits span an endless array of interests including world news, sports, economics, movies, music, fitness, and more. Reddit members discuss proposed topics in the comments section, and the most popular comments are “up-voted” to the top of the discussion board. In 2015, Reddit users submitted nearly 75 million posts and followed up with nearly three quarters of a billion comments. With so many users, submissions, comments, and up-votes, it can seem impossible to craft a post that will ever see the light of day.

For this project, I was tasked with analyzing a subset of Reddit’s “Hot Posts” section to identify what, if any, features of a post determine its popularity. Using data science techniques like exploratory data analysis, predictive modeling, and natural language processing allowed me to efficiently search through a sample of 5,000 posts faster than I ever could have if I had just read through the posts manually.

Let’s give ‘em something to talk about

This was my first project wherein I had to gather my own data (as opposed to downloading a dataset from Kaggle, for example.) In order to build out my dataset, I scraped 5,000 posts from the “Hot Posts” section of Reddit’s website using the ‘requests’ and ‘Beautiful Soup’ libraries. Later on in the project, I used Reddit’s popular API wrapper PRAW to grab some comments from my top 50 ‘hot’ posts. After a few hours (with three second delays in between each set of 25 posts pulled so as not to be black listed by Reddit’s servers), my scrape was complete.

With any Reddit post, there are upwards of 85 identifying features, and while I did pull all 85 of those features for each of my 5,000 posts, I decided to limit the features that I built into my dataframe to those I thought would be most predictive of popularity. I pared the 85 features down to 14 and stored them in variables. The features I chose to use in my model were: ‘author’, ‘title’, ‘subreddit’, ‘subreddit_subscribers’, ‘score’, ‘ups’, ‘downs’, ‘num_comments’, ‘num_crossposts’, ‘selftext’, ‘pinned’, ‘stickied’, ‘wls’, ‘created_utc.’ If a post had actual original text, it was stored in the ‘selftext’ feature, however, when I began to explore my dataset in more depth, I learned that many Reddit posts don’t actually have original content, but rather, are links to a previous post.

With my dataframe finally built out, I was able to start exploring the posts I pulled. My un-engineed dataset began with 5000 rows and 17 columns. When I grouped by ‘subreddit’, I identified 1950 unique subreddits within my dataset. While it would be too much to name every single subreddit in this blog post, a few of my favorites were: ‘todayilearned’, ‘mildlyinteresting’, and ‘Showerthoughts.’

In exploring the null values of my ‘Text’ column which corresponded to the ‘selftext’ feature of my scrape from Reddit, only 416 of posts that I pulled had actual text in the post. The rest were empty and presumably re-posts. Given this, I decided to fill the missing values with 0 in order to keep track of which rows had actual text.

With the exception of ‘wls’ which had 684 missing entries, the rest of my dataset was filled with values. I decided to fill the missing ‘wls’ values with 6 since that was the mode for the column.

The score stats directly mirrored the ‘ups’ stats with a maximum score of 133,613, a mean score of 1,590 and a minimum score of 25. There was nothing to be gleaned from the ‘downs’ stats given that all of the data were 0s.

I also created two boolean columns ‘IsPinned’ and ‘IsStickied’ which I mapped to the ‘pinned’ and ‘stickied’ variables. I then dropped the original ‘pinned’ and ‘stickied’ object columns for modeling purposes.

The comments column, however, was the column I was most interested in, since it made sense that a post’s popularity might be greatly correlated with its volume of comments. For the posts that I pulled, the max number of comments was 20,246, the minimum, 0, the mean, 76, and the median, 16. While I did pull these posts from the ‘Hot Posts’ section from Reddit to begin with, I was tasked with identifying and mapping posts whose comments were above a certain threshold. I used a lambda function to create a new column called ‘Hot’ which represented posts 16 comments (50% percentile) and above and another column called ‘Super Hot’ which represented posts with 44 comments (75% percentile) and above.

Popular, you’re going to be pop-u-lar

I then created a bag of words of all of the words from the ‘Title’ feature using a for-loop running across the rows in the ‘Title feature’. In total, there were 10,445 unique words not including common ‘stop’ words such as ‘the’, ‘a’, & ‘and.’ I was able to exclude these common ‘stop’ words using the ‘stop_words’ arg in my CountVectorizer, specifying ‘english’ as my language.

After performing a train-test-split, I then vectorized the top 20 most common ‘Title’ words into a sparse dataframe, which I then concatenated onto my X_train and X_test matrices with the hope that more information would improve my model performance. After standardizing my dataset with Standard Scaler, I was ready to run my models.

In total, I gridsearched over six different models, including a LogisticRegression, an SGDClassifier, a KNeighborsClassifier, a BernoulliNB, a DecisionTreeClassifier, and a RandomForestClassifier. While I remain skeptical of the perfect test scores for the Decision Tree and Random Forest Classifiers, the R2 scores of the other four models seem to make sense. I summarized the results of the models into a summary bar plot, with the bold lines at the top of each bar representing the mean standard deviation of the cross validation runs. The red horizontal line represents the baseline accuracy score of 0.42779.

Having fit and scored my models, I chose the Random Forest Classifier to run feature importances on to understand which of my 32 features were most predictive of ‘hot’ posts. Unsurprisingly, ‘num_comments’ was the most predictive feature with a coefficient of 0.761. The next most predictive columns did not have nearly the same predictive power as num_comments, with ‘score’, ‘ups’, and ‘subreddit_subscribers’ as the next most predictive with coefficients of 0.0863, 0.0725, and 0.0247, respectively.

Gossip Folks

With modeling done, I wanted to continue exploring the text in my posts to see what, if any, themes rose to the top. Because I had already done a CountVectorizer on the titles of my posts, I wanted to try another natural language processing technique to help me better understand my data. So, I subset my data to ‘hot’ posts only (posts with 16 or more comments) and ran a for-loop across those rows to create another bag of words - this time, filled with words that were linked to ‘hot’ posts. I did the same thing again to my ‘super hot’ posts. With these two bags of words, I then used a word cloud on the data set to visually inspect any themes. The images below represent the results of this analysis. Given that one of the most common words in the ‘super hot’ bag of words was ‘need’, I suspect that more than a few people on Reddit are giving some (unsolicited) advice.

‘Hot’ Posts	‘Super Hot’ Posts

Having analyzed the text of the posts, I then wanted to further explore Reddit’s data by pulling the comments associated with my top 50 most commented on posts. First, I created a mask for which I sorted the value counts of the ‘num_comments’ feature with ‘ascending’ set to False. I then sliced on the first fifty rows.

With my top fifty most commented on posts in hand, I was ready to pull their associated comments. Because this was my first time using PRAW, I had to sign up for a developer’s account. With this account, I was given a user-id, an API key and a secret key which allowed me to use PRAW.

I wanted to pull the associated top comments so I could run a sentiment analysis on the comment’s text and see which posts received positive comments and which received negative comments. To parse out sentiment, I installed nltk’s Vader Sentiment Analyzer package.

To calculate the ‘sentiment intensity’ of each comment, I ran a for-loop across each of my fifty rows and stored the scores in a list. Vader produces four scores when performing its analysis: negative, neutral, positive, and compound. The negative, positive, and neutral scores are all bound between 0 and 1, whereas the compound score is bound between -1 and 1, to effectively give an overall score. To illustrate the range of scores my comments received, I plotted them out in Plotly. The x axis represents the negative score, the y axis, the positive score, and the color shading represents the overall compound score - compound scores that were more positive were closer to the blue spectrum and compound scores that were overall more negative were hued closer to the red spectrum. This explains why one of the data points is blue despite having greater than a 0 score on the negative x axis.

While I initially thought that having only positive comments was ideal for posts, after reading through some of the comments and their corresponding scores, I realized that everything but super negative scores is usually fine. I then joined my compound_scores dataframe with my comments_scores dataframe to more easily visualize which scores lined up with which comments and sliced on individual subreddits to identify which categories would be the least inflammatory and best received for someone interested in pitching an article to Reddit.

Finally, I wanted to try to create my own machine learning sentences using a Markov chain generator. In another post, I walk through more in depth what Markov chain’s are and how they work, but for this project, I installed the markovify package from PyPi and applied it to my ‘hot’ bag of words. Using a very simple for-loop and limiting my range to one sentence, I used the ‘beginning’ arg of the ‘make_sentence_with_start function to create one-liner sentences starting with words I specified. The following is the output of that code.

Women need to give me pity cash.

I will be replaced, and unfortunately, i don’t think Ameer will be stickying this for four days.

I can’t remember if it was pursuing me. I glanced up at her, and five days after my hospitalization it started to happen. I started working here almost two months ago. We had talked on the couch, took off my sunglasses, and checked my phone. Still no service.

Why would someone do this? There was just going to be literally upside down.

Again, these sentences were created from the corpus of ‘hot’ words I created from text in the posts I pulled from Reddit.

If you’d like to see all of the code associated with this project, head on over to the corresponding GitHub Repo.

How to Predict Housing Prices with Linear Regression

Wed, 23 May 2018 00:00:00 +0000

When buying a new home, everyone wants the most bang for the buck. To that end, I analyzed homes in Ames, Iowa to identify what features of a house contribute the most to its sale price.

Buying a house is one of life’s most significant milestones, and according to the most recent census report on residential vacancies, around 64% of homes are occupant-owned.¹ But within the homeownership population, there is a wide range of what is considered desirable in a home which is unique to each home buyer. Some might want more bedrooms to accommodate a larger family. Others might want more garage space to fit their growing car collection. And still others might want a wrap-around porch for those hot summer nights. Regardless of individual preferences, however, it seems safe to assume that everyone wants the most bang for their buck. To that end, participated in a private Kaggle challenge during which I analyzed a dataset of homes located in Ames, Iowa to identify what features of a house contribute the most to its sale price.

Building the Foundation

To start, I wanted to get a sense of what the dataset looked like, both externally and internally. The initial shape of the data consisted of 2051 rows and 234 columns, with both numerical (numbers) and categorical (words) data. I knew I would need to convert the categorical data into a form that I could later feed into a model, but first, I wanted to see if I had any null values to contend with. Sure enough, both my numerical and categorical columns had null values, as seen in the plot below.

Based on the fact that ‘Pool QC’, ‘Misc Feature’, ‘Alley’, and ‘Fence’ all were missing over 80% of their data, I decided to drop those columns entirely, thinking that what few values existed in these columns would not be enough to inform the model one way or another, and would also not be statistically significant from which to impute new data.

For the rest of the columns, I decided to impute the missing data based on the data I had with the fancyimpute package which creates values that mimics the characteristics of values similar to them using a k-nearest neighbors distance calculation. This is similar to another technique related to upsampling the minority class that I’ve used in other projects.

I first imputed the numerical columns given that the data was already in the correct data type. I decided to use 3 k-nearest neighbors for calculation purposes and saved the imputed (and existing) data in a dataframe called ‘imputed_numerical’ which I decided to tack on to the initial train dataframe later.

The categorical columns, however, were not as easy to impute. Because fancy impute only takes in numbers, I decided to effectively “LabelEncode” my values while simultaneously skipping over the null values (so as not encode them and nullify the imputation exercise) using the following code snippet:

With my string values encoded with numerical values, I imputed the remaining null values with nine k-nearest neighbors. Having imputed all of the missing values, I then dropped the columns in the train set with the missing values and concatenated both the imputed_numerical and imputed_categorical onto the train dataframe, effectively replacing the initial columns with the missing values. I dummified the categorical columns that were not missing data to try to maintain as much interpretability as possible by assigning a new column to each unique value in the categorical columns. Having finished cleaning the data, it was ready to be fed into some models.

Choosing the Framework

Because I was most interested in which features contribute the most to a home’s price, I applied SelectKBest to the dataset. Unsurprisingly, the top ten features that are most predictive of a home’s sale price were: ‘Overall Qual’, ‘Year Built’, ‘1st Flr SF’, ‘Gr Liv Area’, ‘Garage Yr Blt’, ‘Total Bsmt SF’, ‘Garage Area’, ‘Garage Cars’, ‘Bsmt Qual’, and ‘Exter Qual_TA.’ Looking at these features, it makes sense that the overall quality of a home is most predictive of the value of a home, as does the square footage of the first floor, the basement, and garage area. Visually, we can see that these features are correlated to the sale price by plotting them on a heatmap.

The bottom row on the left hand side of the plot follows the correlations of the SelectKBest features to the ‘SalePrice’ target. As we can see, Overall Quality is most positively correlated with Sale Price at a 0.8 correlation, followed by ‘Gr Liv Area’ with a score of 0.7, and so on. This means that when the values of these features increase, so too does the sale price of a home. Notably, ‘Bsmt Qual’ and ‘Exter Qual_TA’ were highly negatively correlated to ‘SalePrice’ meaning that when their values increase, the sale price decreases. The ‘BsmtQual’ feature measured the heights of basements, so based on this analysis, homebuyers don’t particularly value basements with high ceilings. The ‘Exter Qual_TA’ feature tracked homes with an exterior quality rated ‘TA’ - short for Typical/Average. Based on this finding, it appears that the sale price of a home decreases when its exterior quality is only rated ‘Average’ as opposed to ‘Excellent’ or ‘Good.’

No matter how complicated a data science project is, I always find it useful to take a step back sometimes and ask myself whether the results make sense. Machine learning isn’t some magical black box (although sometimes it feels that way) but rather, it’s a tool that extends the human analytical capabilities to crunch numbers faster and across far larger datasets than anyone has time for manually.

Having checked the top ten features, I then used SelectKBest to identify the top 100 most predictive features of a home’s sale price to feed into the model. I then polynomialized those features to capture the information gained from the interactions between features. I limited the features I polynomialized to 100 out of concern for computation issues.

For illustrative purposes, I plotted the SelectKBest top ten features that I then polynomialized and zipped them to the coefficient values of the linear regression model without regularization. In the models that I actually ran, I included the top 100 polynomialized features but, because of computation (and space) issues, was not able to plot them in the same manner. As you can see, the coefficients varied widely and needed to be regularized.

Crunching the Numbers

Without regularizing my coefficients, my Linear Regression model scored a -3.154, which isn’t a viable R2 score and suggests that the model was overfit. To counter this issue, I used several different regularizing techniques, namely, Ridge, Lasso, and Elastic Net to “penalize” oversized coefficients in order to best minimize the loss function during model training. When loss functions are minimized, optimal coefficient values are revealed.

Another reason to regularize data is the fact that while features are correlated to the target to varying degrees, they can also be correlated to each other. Multicollinearity that goes unaddressed can affect the predictions made by the model and can be contingent upon small changes in sample sets that should not affect predictions. Moreover, variables that are correlated to each other cannot be independently interpreted as to how they contribute to the target, given that a change in one variable conditionally necessitates a change in another.

The Ridge regularization technique handles multicollinearity by penalizing the coefficients and effectively crunches them down. Lasso also addresses multicolinearity, but in a different way. Instead of “crunching” down the coefficients, it “zeroes” out the coefficients that are not statistically significant in predicting the target value. In effect, Lasso acts as a feature selector by zeroing out redundant or unimportant variables. The Lasso regularization technique was particularly useful in this dataset given that polynomializing the top 100 features identified by SelectKBest produced 5,050 features all together. Lasso pared this number down to 75 features, illustrated in the plot below.

With the exception of dropping the Overall Qual Gr Liv Area polynomialized feature which had a coefficient value of 25000, the remaining 74 features ranged from just above 0 to 5000 (as opposed to -600,000 to over 700,000 in the first linear regression plot of the top ten polynomialized features.) This model is particularly useful when interpretability (as in the case of buying a house) is important.

I also ran an Elastic Net regression, which is a combination of both the Ridge and Lasso penalties.

Evaluating the Offer

I ran 10 cross-validation runs across the three regularized regression models as seen in the plot below. It looks like all three models fared similarly, with the exception of Lasso’s ninth cross-validation run, where it fell down hard - perhaps because of the variation within the sample set it modeled.

Put another way, I plotted the mean cross-validation scores of the three models in the bar plot below. Because of the ninth cross validation run of Lasso, it’s overall mean score was lower than that of the Ridge and Elastic Net models, however, it might still help a buyer choose a home given how interpretable the coefficients are.

Having conducted this analysis, it appears that the overall quality of a house, the amount of space in the living area, the square footage of the basement, whether there is a fireplace, and the year the home was built and/or remodeled all contribute greatly to the sale price of a house. In a future iteration of this project, I would be interested in analyzing which neighborhoods are considered most desirable based on their predictive values to the home’s sale price. Already, from the Lasso coefficients, we can see that the No Ridge and Stone Bridge neighbors have large coefficients, suggesting a statistical significance in their ability to predict a home’s sale price.

To view the code associated with this project, head on over to the corresponding GitHub repo

https://www.census.gov/housing/hvs/files/currenthvspress.pdf ↩

Would I Have Survived the Titanic?

Mon, 21 May 2018 00:00:00 +0000

While the infamous shipwreck happened over 100 years ago, its cultural significance is sunk deep into our collective memory as one of the most tragic manifestations of hubris, classism, and dumb luck. To that end, I analyzed the data provided on Kaggle’s website to determine more specifically how features such as age, gender, class, and wealth predetermined a passenger’s fate on April 15, 1911 aboard the RMS Titanic.

“Where to, Miss?”

via GIPHY

For this project, I wanted to predict a passenger’s fate using machine learning across several different types of models. I was also interested in identifying which features had the greatest impact on a person’s chances of survival.

This dataset is neatly packaged in a .csv file that can be downloaded from Kaggle’s Titanic competition page. In some of my other projects where the data is not so readily accessible, I have used research and web-scraping techniques to assemble my datasets.

“Remember, they love money, so pretend like you own a gold mine and you’re in the club”

Whenever I begin with a new dataset, I like to get an understanding of its shape, stats, and potential pitfalls (null values, outliers, categorical data). I use descriptors like df.info(), df.shape(), df.describe() as well as a mask I wrote to slice on columns that specifically have null values.

In sum, this dataset has 891 rows across 12 columns, five of which are categorical columns: ‘Name’, ‘Sex’, ‘Ticket’, ‘Cabin’, ‘Embarked’ and seven of which are numerical columns: ‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’.

Given that the ‘Cabin’ category was missing almost 80% of its data, I decided not to try to impute the missing values and instead relabeled them as “UI” (Unidentified) from which I then stripped just the first letter from all the cabins in order to make a dummy column called Cabin_category. This column organized the cabins in this neat array: ‘A’, ‘B’, ‘C’, ‘D’, ‘E’, ‘F’, ‘G’, ‘T’, ‘U’. The ‘Age’ feature, however, was only missing about 20% of its data, so it made sense to me to impute the missing values using MICE within the fancyimpute package. I decided to drop the two rows in the ‘Embarked’ feature entirely.

For illustrative purposes, I plotted some graphs with Plotly to help visualize the dataset. In the ‘Age’ plot, we can see that the majority of people who died were in the 20-39 bins, whereas the majority of people who lived were in the 0-9 bin. It makes sense that children would have been prioritized both in terms of what was considered the moral thing to do as well as how much space they took up on the life rafts.

We can also see in the plot below that analyzes the gender breakdown of the people who died that the majority of those who perished were men - over four times as many as those who survived. On the other hand, the women who survived outnumbered those who perished by nearly 3:1.

I was also interested in analyzing whether a person’s wealth played a role in their chances of survival, so I plotted a graph that analyzed the price of a passenger’s ticket as well as their class and cabin. The size of the bubble corresponds to the price of the ticket (the bigger the bubble, the more expensive the ticket.) From this plot, it looks like the passengers who were located in Passenger Class 1, Cabin B and held an expensive ticket were more likely to survive.

“You don’t know what hand you’re gonna get dealt next”

Here’s where I engineered my features, whether it was dropping columns entirely or creating new columns. For this dataset, I decided to drop the PassengerId, Name, and Ticket columns since their values do not contribute to predicting whether someone survived or not. Whether someone was male or female, however, did so I encoded those columns: 0 for male and 1 for female. I also dummified the Embarked and Cabin_category columns to determine whether those features played a role in survivability. Finally, I added columns like ‘IsReverend’ which encoded whether someone was a Reverend or not and ‘FamilyCount’ which combined the values of ‘SibSp’ and ‘Parch’ to give further detail to the model.

via GIPHY

“It is a mathematical certainty.”

Because this is a binary prediction, I used Classifier models to predict whether a passenger survived or not, specifically, Logistic Regression, an SGD Classifier, a kNN Classifier, a Bernoulli Naive Bayes Classifier, a Random Forest, and XGBoost.

To more easily compare my R^2 scores, I generated a table to catalogue my training and test scores across all of my models. As expected, my training scores were higher than my test scores given that I fit my models to my training data. It looks like the models all fared pretty well, however, it looks like the kNN and Random Forest models were overfit, given the discrepancy between their training and test scores. Logistic Regression, SGD Classifier, and XGBoost all looked like they performed well in both the training as well as the test sets.

Model	Penalty	Alpha	Train Score	Test Score
LogReg	Lasso	0.077	0.844	0.816
SGD	N/A	N/A	0.830	0.816
kNN	N/A	N/A	1.000	0.794
BNB	N/A	N/A	0.757	0.757
RF	N/A	N/A	0.992	0.785
XGBoost	N/A	N/A	0.830	0.821

“A woman’s heart is a deep ocean of secrets.”

Based on the analysis performed in this post, it appears that men between the ages of 20 and 39 were the most likely to die while on board the RMS Titanic. Women, children, and those with higher priced tickets fared better. To illustrate this point even further, I applied the LIME package to the XGBoost model predictions to get a statistical likelihood of a passenger’s survivability based on the ‘Age’, ‘Sex’, and ‘Pclass’ features.

For example, a 28 year old male passenger in third class had an 81% likelihood of dying.

In contrast, a three year old female in first class had a 75% chance of survival.

To see all of the code associated with this project, check out the corresponding GitHub repo.

Why I Decided to Learn Data Science

Mon, 23 Apr 2018 00:00:00 +0000

On April 23, 2018, I enrolled in General Assembly’s Data Science course, a full-time immersion program designed to teach programming languages, data analysis techniques, and machine learning skills in 12 weeks. On the first day, I was writing code in Python, parsing out a dictionary full of Pokemon characteristics, and wondering if I had made a huge mistake.

I started off my career as a Paralegal for the Violent and Organized Crime Unit of the U.S. Attorney’s Office for the Southern District of New York. I knew from the first day I stepped inside of that dilapitated office, I had found my people and my purpose. For the next three and a half years, I worked on a wide range of cases related to gang violence, drug trafficking, human trafficking, and securities fraud, helping Assistant U.S. Attorneys (AUSAs) prepare for trial. I later left that office to join the New Jersey Attorney General’s Office where I continued my investigative work as a Detective for the Public Corruption Unit.

The decision to enroll in a data science bootcamp started the year prior when I was working at a consulting firm tasked with the five-year, court-appointed monitorship of HSBC. I frequently found myself on teams responsible for investigating transaction-monitoring alerts (given my investigations background) but was routinely passed over for more analytics-based projects in favor of someone with more advanced skills. I knew what I wanted the data to bear out, I just didn’t know how to do it.

Frustrated with the limitations of my skillset, I initially took one-off Excel courses to boost my understanding of pivot tables and functions but did not feel satisfied with the results. While Excel is a powerful tool and can be useful in some types of analyses, it cannot compute on datasets larger than a million rows and cannot identify the kind of insights machine learning packages can. I realized if I wanted to keep moving forward in my career as an investigator, to be able to interrogate large datasets, to identify unprecedented insights, I needed to invest in upping my tech skills.

And that’s the path that led me to writing for-loops over a dictionary of full of Pokemon names, gym locations, and combat techniques. Over the course of the next 12 weeks, I would learn even more about data mining, data enrichment, natural language processing, data visualization, and many different machine learning packages.

This blog, The Data Sleuth, was inspired by my experiences in both investigative work as well as data science. It is a collection of my work that I began in my course and that I have continued to refine in the following weeks and months post-graduation. Feel free to read through the blog posts which are usually shorter topical pieces or have a looked at my portfolio in the ‘Projects’ section, where I go more in depth into the methodology I followed for that particular project. Should you have any questions related to the topics I write about, any code associated with my projects, or just want to chat about my experience in the coding bootcamp, do not hesitate to reach out via email by clicking on the envelope icon on the left. I look forward to hearing from you!

Welcome to The Data Sleuth.