+ All Categories
Home > Documents > Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big...

Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big...

Date post: 17-Jun-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
28
Data Wrangling Data Science: Jordan Boyd-Graber University of Maryland JANUARY 14, 2018 Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 1 / 14
Transcript
Page 1: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

Data Wrangling

Data Science: Jordan Boyd-GraberUniversity of MarylandJANUARY 14, 2018

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 1 / 14

Page 2: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

Download Data

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 2 / 14

Page 3: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

Big Picture

� Data are messy (this isn’t so messy!)

� The first step to doing anything cool is using data

� Need to use common sense and brute force often

� You’ll see more in first real homework

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 3 / 14

Page 4: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

First Steps: Get Data

� From FEC

� Odd formatting

� Today: pure Python (easier with Pandas), will help expose level ofPython you’ll need

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 4 / 14

Page 5: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

First Steps: Get Data

� From FEC

� Odd formatting

� Today: pure Python (easier with Pandas), will help expose level ofPython you’ll need

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 4 / 14

Page 6: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

Look at file . . .

� Periods instead of commas (vice versa)

� Odd New York parties

� Semi-colon delimiters

� Includes totals

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 5 / 14

Page 7: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

Look at file . . .

� Periods instead of commas (vice versa)

� Odd New York parties

� Semi-colon delimiters

� Includes totals

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 5 / 14

Page 8: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

Read in Data

from csv import DictReadervotes = list(DictReader(open("2012pres.csv", ’r’),

delimiter=";"))

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 6 / 14

Page 9: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

Read in Data

from csv import DictReadervotes = list(DictReader(open("2012pres.csv", ’r’),

delimiter=";"))

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 6 / 14

Page 10: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

Read in Data

from csv import DictReadervotes = list(DictReader(open("2012pres.csv", ’r’),

delimiter=";"))

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 7 / 14

Page 11: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

Read in Data

from csv import DictReadervotes = list(DictReader(open("2012pres.csv", ’r’),

delimiter=";"))

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 7 / 14

Page 12: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

How many votes were cast?

Total votes 129085410

total_votes = sum(int(x["TOTAL VOTES #"].replace(".", "")) \for x in votes if x["TOTAL VOTES #"])

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 8 / 14

Page 13: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

How many votes were cast?

Total votes 129085410

total_votes = sum(int(x["TOTAL VOTES #"].replace(".", "")) \for x in votes if x["TOTAL VOTES #"])

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 8 / 14

Page 14: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

How many votes were cast?

Total votes 129085410

total_votes = sum(int(x["TOTAL VOTES #"].replace(".", "")) \for x in votes if x["TOTAL VOTES #"])

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 8 / 14

Page 15: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

What state had the largest numerical margin between first and second

place?

Largest numerical margin 3014327 in California

margins = {}for ss in set(x["STATE"] for x in votes):

margins[ss] = winner(votes, ss)[1] - second(votes, ss)[1]num_margin = argmax(margins)print("Largest numerical margin %i in %s" %

(max(margins.values()), num_margin))

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 9 / 14

Page 16: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

What state had the largest numerical margin between first and second

place?

Largest numerical margin 3014327 in California

margins = {}for ss in set(x["STATE"] for x in votes):

margins[ss] = winner(votes, ss)[1] - second(votes, ss)[1]num_margin = argmax(margins)print("Largest numerical margin %i in %s" %

(max(margins.values()), num_margin))

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 9 / 14

Page 17: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

What state had the largest numerical margin between first and second

place?

Largest numerical margin 3014327 in California

margins = {}for ss in set(x["STATE"] for x in votes):

margins[ss] = winner(votes, ss)[1] - second(votes, ss)[1]num_margin = argmax(margins)print("Largest numerical margin %i in %s" %

(max(margins.values()), num_margin))

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 9 / 14

Page 18: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

What state had the largest percentage margin between first and second

place?

Largest percentage margin 48.04 in Utah

margins = {}for ss in set(x["STATE"] for x in votes

if x["STATE"] != "District of Columbia"):margins[ss] = winner(votes, ss)[2] - \

second(votes, ss)[2]num_margin = argmax(margins)print("Largest percentage margin %f in %s" %

(max(margins.values()), num_margin))

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 10 / 14

Page 19: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

What state had the largest percentage margin between first and second

place?

Largest percentage margin 48.04 in Utah

margins = {}for ss in set(x["STATE"] for x in votes

if x["STATE"] != "District of Columbia"):margins[ss] = winner(votes, ss)[2] - \

second(votes, ss)[2]num_margin = argmax(margins)print("Largest percentage margin %f in %s" %

(max(margins.values()), num_margin))

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 10 / 14

Page 20: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

What state had the largest percentage margin between first and second

place?

Largest percentage margin 48.04 in Utah

margins = {}for ss in set(x["STATE"] for x in votes

if x["STATE"] != "District of Columbia"):margins[ss] = winner(votes, ss)[2] - \

second(votes, ss)[2]num_margin = argmax(margins)print("Largest percentage margin %f in %s" %

(max(margins.values()), num_margin))

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 10 / 14

Page 21: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

What state had the largest numerical third party vote (and for whom)?

Johnson had largest third party vote in California with 143221

all_third_vote = {}top_third_vote = {}for ss in set(x["STATE"] for x in votes):

try:all_third_vote[ss] = \dict((x["LAST NAME"],

parseint(x["GENERAL RESULTS"]))for x in votesif x["STATE"] == ssand x["LAST NAME"] not in kMAJORand x["LAST NAME"])

except ValueError:all_third_vote[ss] = {}

if all_third_vote[ss]:top_third_vote[ss] = max(all_third_vote[ss].values())

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 11 / 14

Page 22: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

What state had the largest numerical third party vote (and for whom)?

Johnson had largest third party vote in California with 143221

all_third_vote = {}top_third_vote = {}for ss in set(x["STATE"] for x in votes):

try:all_third_vote[ss] = \dict((x["LAST NAME"],

parseint(x["GENERAL RESULTS"]))for x in votesif x["STATE"] == ssand x["LAST NAME"] not in kMAJORand x["LAST NAME"])

except ValueError:all_third_vote[ss] = {}

if all_third_vote[ss]:top_third_vote[ss] = max(all_third_vote[ss].values())

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 11 / 14

Page 23: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

What state had the largest numerical third party vote (and for whom)?

Johnson had largest third party vote in California with 143221

all_third_vote = {}top_third_vote = {}for ss in set(x["STATE"] for x in votes):

try:all_third_vote[ss] = \dict((x["LAST NAME"],

parseint(x["GENERAL RESULTS"]))for x in votesif x["STATE"] == ssand x["LAST NAME"] not in kMAJORand x["LAST NAME"])

except ValueError:all_third_vote[ss] = {}

if all_third_vote[ss]:top_third_vote[ss] = max(all_third_vote[ss].values())

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 11 / 14

Page 24: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

What state had the largest percentage vote (and for whom)?

Johnson had largest third party percent in New Mexico with 3.55

all_third_vote = {}top_third_vote = {}for ss in set(x["STATE"] for x in votes):

try:all_third_vote[ss] = \dict((x["LAST NAME"],

parseint(x["GENERAL RESULTS"]))for x in votesif x["STATE"] == ssand x["LAST NAME"] not in kMAJORand x["LAST NAME"])

except ValueError:all_third_vote[ss] = {}

if all_third_vote[ss]:top_third_vote[ss] = max(all_third_vote[ss].values())

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 12 / 14

Page 25: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

What state had the largest percentage vote (and for whom)?

Johnson had largest third party percent in New Mexico with 3.55

all_third_vote = {}top_third_vote = {}for ss in set(x["STATE"] for x in votes):

try:all_third_vote[ss] = \dict((x["LAST NAME"],

parseint(x["GENERAL RESULTS"]))for x in votesif x["STATE"] == ssand x["LAST NAME"] not in kMAJORand x["LAST NAME"])

except ValueError:all_third_vote[ss] = {}

if all_third_vote[ss]:top_third_vote[ss] = max(all_third_vote[ss].values())

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 12 / 14

Page 26: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

What state had the largest percentage vote (and for whom)?

Johnson had largest third party percent in New Mexico with 3.55

all_third_vote = {}top_third_vote = {}for ss in set(x["STATE"] for x in votes):

try:all_third_vote[ss] = \dict((x["LAST NAME"],

parseint(x["GENERAL RESULTS"]))for x in votesif x["STATE"] == ssand x["LAST NAME"] not in kMAJORand x["LAST NAME"])

except ValueError:all_third_vote[ss] = {}

if all_third_vote[ss]:top_third_vote[ss] = max(all_third_vote[ss].values())

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 12 / 14

Page 27: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

Summary

� Data are messy

� Easier with formatted data (e.g., csv)

� Need basic data structures

� Check whether answers are reasonable

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 13 / 14

Page 28: Data Wrangling - University of Colorado Boulder Computer ...jbg/teaching/INST_414/lab01.pdf · Big Picture — Data are messy (this isn’t so messy!) — The first step to doing

Next Time . . .

� Lecture: make sure to do reading

� Probability foundations (if you found today boring . . . )

� Math needed for the course (quiz likely)

Data Science: Jordan Boyd-Graber | UMD Data Wrangling | 14 / 14


Recommended