+1 (208) 254-6996 essayswallet@gmail.com
  

Let’s begin by profiling the two US state columns. We want to… Let’s begin by profiling the two US state columns. We want to assess whether the valuescontained in those columns are syntactically valid. In the FEC’s data dictionary,column 5 represents the state of the office for which the candidate is running, andcolumn 14 represents the state of the candidate’s mailing address.Focusing on column 5, we can begin by collecting a list of all the unique values in thecolumn. This is a common first step when performing syntactic profiling. This operationproduces a list of 57 unique values. We know that there are 50 US states that havevoting representatives in the US Congress, and 5 US territories and the District ofColumbia with nonvoting representatives in the US Congress. Additionally, if we lookat the data dictionary, we can see that column 5 can contain a nonstate value, “US,” inrecords that represent candidates who are running for president. So, at first glance, itseems reasonable that there would be 57 possible syntactically valid locations in column5.We can dig a little deeper and examine each value in column 5 individually to see if itmatches one of the known 50 state abbreviations, 6 territory abbreviations, or “US.”We performed this check by using a lookup to a reference dataset of all 57 valid values.In fields that matched one of the 57 valid values, we inserted a “1,” and in fieldsthat did not match one of the 57 valid values, we inserted a “0.” Ultimately, all theProfiling Individual Values in the Candidate Master File | 47values in this column are syntactically valid; in our Boolean indicator column, 100percent of the records contained “1.”Because column 14 also contains state abbreviations, we can perform a similar set ofprofiling checks on this column. Again, a count of all the distinct values in this columnreveals that there are 57 possible values. However, because this column representsa mailing address, there are only 56 possible valid values: 50 US stateabbreviations, “DC” for the District of Columbia, and 5 US territory abbreviations. Atfirst glance, we can assume that at least some of the records contain syntacticallyinvalid entries in column 14. In addition to 57 distinct values, this column also containsmissing values. We can consider missing values syntactically invalid because thedata dictionary does not indicate that missing values should appear in this column.We’ll use the same procedure that we applied when profiling column 5 to see whichof the individual values in column 14 are syntactically valid. Performing a lookup to areference table and generating a Boolean indicator column shows that there is a singlerecord that contains an erroneous state: “ZZ.”You can perform a similar set of syntactic checks on the other columns in the CandidateMaster File. We recommend generating a series of Boolean indicator columns toshow whether the values in each record are permissible given the constraints definedin the FEC’s data dictionary.Set-Based Profiling in the Candidate Master FileLet’s profile the distribution of values in column 4 of the Candidate Master File.According to the data dictionary, this column represents the year of the election forwhich each candidate registered. Since this dataset can include candidates for anyelection with active campaign committees, we would expect to see the years distributedso that there are relatively few records for elections prior to 2016, a largenumber of records for the 2016 election year, and possibly a small number of recordsthat represent future elections (perhaps 2018 or 2020).After you’ve generated a summary view that counts the number of records that occurin each year, you should see a very wide range of values in column 4. The earliestrecorded date is 1990; the date farthest in the future is 2064.At this point, we would recommend stepping back to determine the utility of recordsin this column. If you remember our discussion in Chapter 2, assessing the utility ofyour data involves generating custom metadata, or metadata specific to your use case.That means that we should assess the distribution of the values in column 4 in thecontext of our specific project to see how many of these records are relevant to ouranalysis. The goal of this project is to see if there are any trends in the campaign contributionsreceived by each of the two major candidates in the 2016 presidential election,Hillary Clinton and Donald Trump. Since we’re interested in only the 201648 | Chapter 4: Profilingpresidential election. records that represent candidates registered for the elections in1990 or 2064 are ultimately irrelevant to our task. We can insert additional metadatainto our dataset at this stage, perhaps flagging records that contain a value other than”2016″ in column 4 as invalid.I submitted the links as a chatThe data for this R inquiryhttps://www.fec.gov/files/bulk-downloads/2020/cn20.zip https://www.fec.gov/campaign-finance-data/candidate-master-file-description/ https://www.fec.gov/files/bulk-downloads/data_dictionaries/cn_header_file.csv  Engineering & Technology Computer Science DSI 104

Don't use plagiarized sources. Get Your Custom Essay on
Top answer: Let’s begin by profiling the two US state columns. We want to…
Just from $10/Page
Order Essay

Order your essay today and save 10% with the discount code ESSAYHELP