Shared posts

01 Jan 17:46

Robust Inference with Variational Bayes. (arXiv:1512.02578v1 [stat.ME])

by Ryan Giordano, Tamara Broderick, Michael Jordan

In Bayesian analysis, the posterior follows from the data and a choice of a prior and a likelihood. One hopes that the posterior is robust to reasonable variation in the choice of prior and likelihood, since this choice is made by the modeler and is necessarily somewhat subjective. Despite the fundamental importance of the problem and a considerable body of literature, the tools of robust Bayes are not commonly used in practice. This is in large part due to the difficulty of calculating robustness measures from MCMC draws. Although methods for computing robustness measures from MCMC draws exist, they lack generality and often require additional coding or computation.

In contrast to MCMC, variational Bayes (VB) techniques are readily amenable to robustness analysis. The derivative of a posterior expectation with respect to a prior or data perturbation is a measure of local robustness to the prior or likelihood. Because VB casts posterior inference as an optimization problem, its methodology is built on the ability to calculate derivatives of posterior quantities with respect to model parameters, even in very complex models. In the present work, we develop local prior robustness measures for mean-field variational Bayes(MFVB), a VB technique which imposes a particular factorization assumption on the variational posterior approximation. We start by outlining existing local prior measures of robustness. Next, we use these results to derive closed-form measures of the sensitivity of mean-field variational posterior approximation to prior specification. We demonstrate our method on a meta-analysis of randomized controlled interventions in access to microcredit in developing countries.

Donate to arXiv

10 Mar 22:36

CRAN download statistics of any packages #rstats

by Daniel

(This article was first published on Strenge Jacke! » R, and kindly contributed to R-bloggers)

Hadley Wickham announced at Twitter that RStudio now provides CRAN package download logs. I was wondering about the download numbers of my package and wrote some code to extract that information from the logs…

The first code snippet is taken from the log website itself:

# Here's an easy way to get all the URLs in R
start <- as.Date('2013-11-28')
today <- as.Date('2015-03-04')

all_days <- seq(start, today, by = 'day')

year <- as.POSIXlt(all_days)$year + 1900
urls <- paste0('http://cran-logs.rstudio.com/', year, '/', all_days, '.csv.gz')

Then I downloaded all files into a folder:

for (i in 1:length(urls)) {
  download.file(urls[i], sprintf("~/Desktop/rstats/temp%i.csv.gz", i))
}

Unzipping did not work with unzip, so I just “opened” all files with the OS X unarchiver, which was quite convenient.

Than I read all csv-files and extracted the information for my package, sjPlot, from each csv-file and merged everything into one data frame:

sjPlot.df <- data.frame()
library(dplyr)
pb <- txtProgressBar(min=0, max=length(urls), style=3)

for (i in 1:length(urls)) {
  df.csv <- read.csv(sprintf("~/Desktop/rstats/temp%i.csv", i))
  pack <- tolower(as.character(df.csv$package))
  my.package <- which(pack == "sjplot")
  if (length(my.package) > 0 ) {
    dummy.df <- df.csv %>% dplyr::slice(my.package) %>% dplyr::select(date, package, version, country)
    sjPlot.df <- dplyr::bind_rows(sjPlot.df, dummy.df)
  }
  setTxtProgressBar(pb, i)
}
close(pb)
sjPlot.df$date.short <- strftime(sjPlot.df$date, format="%Y-%m")

Finally, the download-stats as plot:

library(sjPlot)
library(ggplot2)

mydf <- sjPlot.df %>% dplyr::count(date.short)

sjp.setTheme(theme = "539", axis.angle.x = 90)
ggplot(mydf, aes(x = date.short, y = n)) +
  geom_bar(stat = "identity", width = .5, alpha = .5, fill = "#3399cc") +
  scale_y_continuous(expand = c(0, 0), breaks = seq(250, 1500, 250)) +
  labs(x = sprintf("Monthly CRAN-downloads of sjPlot package since first release until 4th March (total download: %i)", sum(mydf$n)), y = NULL)

sjPlot-downloads

By the way, there’s already a shiny app for this…


Tagged: data visualization, ggplot, R, rstats, sjPlot

To leave a comment for the author, please follow the link and comment on his blog: Strenge Jacke! » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
03 Jul 10:41

Last chance to register for the R in Insurance conference

by Markus Gesmann

(This article was first published on mages' blog, and kindly contributed to R-bloggers)

The registration for the 2nd R in Insurance conference at Cass Business School London will close this Friday, 4 July.

The programme includes talks from international practitioners and leading academics, see below. For more details and registration visit: http://www.rininsurance.com.

Still unsure? Review some impressions and presentations from last year's conference.

On behalf of the committee and sponsors, Mango Solutions, Cybaea, RStudio and PwC, we look forward to seeing you in London on 14 July!

To leave a comment for the author, please follow the link and comment on his blog: mages' blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
26 Nov 07:06

Ranked Choice Voting

by Andy

(This article was first published on Citizen-Statistician » R Project, and kindly contributed to R-bloggers)

The city of Minneapolis recently elected a new mayor. This is not newsworthy in and of itself, however the method they used was—ranked choice voting. Ranked choice voting is a method of voting allowing voters to rank multiple candidates in order of preference. In the Minneapolis mayoral election, voters ranked up to three candidates.

The interesting part of this whole thing was that it took over two days for the election officials to declare a winner. It turns out that the official procedure for calculating the winner of the ranked-choice vote involved cutting and pasting spreadsheets in Excel.

The technology coordinator at E-Democracy, Bill Bushey, posted the challenge of writing a program to calculate the winner of a ranked-choice election to the Twin Cities Javascript and Python meetup groups. Winston Chang also posted it to the Twin Cities R Meetup group. While not a super difficult problem, it is complicated enough that it can make for a nice project—especially for new R programmers. (In fact, our student R group is doing this.)

The algorithm, described by Bill Bushey, is

  1. Create a data structure that represents a ballot with voters’ 1st, 2nd, and 3rd choices
  2. Count up the number of 1st choice votes for each candidate. If a candidate has 50% + 1 votes, declare that candidate the winner.
  3. Else, select the candidate with the lowest number of 1st choice votes, remove that candidate completely from the data structure, make the 2nd choice of any voter who voted for the removed candidate the new 1st choice (and the old 3rd choice the new 2nd choice).
  4. Goto 2

As an example consider the following sample data:

Voter  Choice1  Choice2  Choice3
    1    James     Fred    Frank
    2    Frank     Fred    James
    3    James    James    James
    4    Laura 
    5    David 
    6    James              Fred
    7    Laura
    8    James
    9    David    Arnie
   10    David

In this data, James has the most 1st choice votes (4) but it is not enough to win the election (a candidate needs 6 votes = 50% of 10 votes cast + 1 to win). So at this point we determine the least voted for candidate…Frank, and delete him from the entire structure:

Voter  Choice1  Choice2  Choice3
    1    James     Fred    Frank
    2    Frank     Fred    James
    3    James    James    James
    4    Laura 
    5    David 
    6    James              Fred
    7    Laura
    8    James
    9    David    Arnie
   10    David

Then, the 2nd choice of any voter who voted for Frank now become the new “1st” choice. This is only Voter #2 in the sample data. Thus Fred would become Voter #2′s 1st choice and James would become Voter #2′s 2nd choice:

Voter  Choice1  Choice2  Choice3
    1    James     Fred
    2     Fred    James
    3    James    James    James
    4    Laura 
    5    David 
    6    James              Fred
    7    Laura
    8    James
    9    David    Laura
   10    David

James still has the most 1st choice votes, but not enough to win (he still needs 6 votes!). Fred has the fewest 1st choice votes, so he is eliminated, and his voter’s 2nd and 3rd choices are moved up:

Voter  Choice1  Choice2  Choice3
    1    James
    2    James
    3    James    James    James
    4    Laura 
    5    David 
    6    James              
    7    Laura
    8    James
    9    David    Laura
   10    David

James now has five 1st choice votes, but still not enough to win. Laura has the fewest 1st choice votes, so she is eliminated, and her voter’s 2nd and 3rd choices are moved up:

Voter  Choice1  Choice2  Choice3
    1    James
    2    James
    3    James    James    James
    4     
    5    David 
    6    James              
    7    
    8    James
    9    David    Laura
   10    David

James retains his lead with five first place votes…but now he is declared the winner. Since Voter #4 and #7 do not have a 2nd or 3rd choice vote, they no longer count in the number of voters. Thus to win, a candidate only needs 5 votes = 50% of the 8 1st choice votes + 1.

The actual data from Minneapolis includes over 80,000 votes for 36 different candidates. There are also ballot issues such as undervoting and overvoting. This occurs when voters give multiple candidates the same ranking (overvoting) or do not select a candidate (undervoting).

The animated GIF below shows the results after each round of elimination for the Minneapolis mayoral election.

2013 Mayoral Race

The Minneapolis mayoral data is available on GitHub as a CSV file (along with some other smaller sample files to hone your programming algorithm). There is also a frequently asked questions webpage available from the City of Minneapolis regarding ranked choice voting.

In addition you can also listen to the Minnesota Public Radio broadcast in which they discussed the problems with the vote counting. The folks at the  R Users Group Meeting were featured and Winston brought the house down when commuting on the R program that computed the winner within a few seconds said, “it took me about an hour and a half to get something usable, but I was watching TV at the time.”.

To leave a comment for the author, please follow the link and comment on his blog: Citizen-Statistician » R Project.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
20 Nov 16:18

Rock'n'roll стар

Так получилось, что многие мои знакомые -  55-56 годов рождения (это середина прошлого века, если что). В середине девяностых (тоже прошлого века) они мне казались вполне ещё не старыми и бодрыми сорокалетними дядьками (и тётьками) - именно так, через мягкий знак. Я в свои нынешние (почти) сорок конечно, гораздо моложе их, тогдашних сорокалетних.

19 ноября 1955-го Карл Пёркинс записал свою "Blue Suede Shoes" в Sun Studios в Мемфисе. Ракинрол тоже стар и даже, как полагает некоторые, мёртв. А дядьки и тётки в основном живы и почти счастливы в своих вторых - третьих - четвёртых браках. Можно было бы проводить параллели между этапами развития ракинрола и матримониальных отношений знакомых мне особ, но на улице темно, бесснежно и сонно. Зима - медвежье время.