High-resolution experimental and computational profiling of tissue-specific known and novel miRNAs in Arabidopsis
Genome Research (2012)
Breakfield NW, Corcoran, DL, Petricka JJ, Shen J, Sae-Seaw J, Rubio-Somoza I, Weigel D, Ohler U, Benfey PN (2012) High-resolution experimental and computational profiling of tissue-specific known and novel miRNAs in Arabidopsis. Genome Res, 22(1):163-76.
You Driving? Talk to you Later
ACM Mobisys (2011)
Best Conference Poster: Chu HL, Raman V, Shen J, Roy Choudhury R, Kansal A, Bahl V, (June 2011) You Driving? Talk to you Later, poster at ACM Mobisys.
Market Power & Reciprocity Among Vertically Integrated Cable Providers
Duke Journal of Economics (2011)
Shen J, Marx L (2011) Market Power & Reciprocity Among Vertically Integrated Cable Providers. Duke Journal of Economics, XXVIII.
Boston Globe
Techniques: SQL, AWS, feature engineering, and Random Forest (supervised machine learning).
The typical cyber-life of a BostonGlobe user starts with anonymous visits - from casually visiting the site, to ultimately becoming a subscriber. The Boston Globe would like to understand the idiosyncrasies and patterns of a subscriber and use that knowledge to increase subscriptions.
This semester, we partnered with Boston Globe to build a subscriber detection model. We were able to increase subscriber detection rate from 1/3,200 to 1/300 by performing extensive data exploration, feature engineering, and tuning a statistical machine learning classifier. Overall, a 10-fold increase!
More coming soon!
Bayesketball
Techniques: Random Forest for variable selection, Bayesian logistic regression, Metropolis-Hastings, slice sampling, and Monte Carlo simulations.
Featuring 351 first division men’s teams, college basketball is a major source of excitement in American sports. The annual NCAA Tournament regularly attracts more advertising spending than the Super Bowl, and tens of millions of Americans pore over box scores and statistics to win a share of the $12 billion in associated gambling transactions. In that tradition—making predictions, not indulging our vices—we are interested in applying Bayesian analysis of basketball team characteristics to predicting game outcomes. We will use past years’ data when building our model, and evaluate its performance using actual game results from the 2014-15 season and tournament.
Predicting Social News Reach
Techniques: unsupervised machine learning, PCA, supervised machine learning, regressions, sentiment analysis, and topic modeling.
In my data science project, we used Twitter, Facebook, and Bitly data to understand 25 news organizations’ (CNN, NY Times, etc) social media presence.
Sabermetrics
Techniques: regression, and Monte Carlo.
We will be exploring a baseball data set and hope to create a little bit of sabermetrics magic.
A look into Singular Vector Decomposition (SVD)
Techniques: SVD, PCA, and KNN.
We look to explore the usage of SVDs in relation to machine learning models.
Statistical Analysis of Startup Funding Using Crunchbase Data
Techniques: Multivariate linear regression and random effects model.
In recent years, startups such as Facebook and WhatsApp have made big headlines with impressive Initial Public Offerings (IPOs) and acquisitions. In 2014 alone, combined startup funding in San Francisco, New York, and Boston reached a staggering $28.2 billion. That same year saw 7,168 funding rounds in the United States. Clustered in several large cities, the startup scene features complex interactions between product and place.
We seek to understand which product markets are best-funded, and which regions attract the most capital investment. Specifically, we explore the effects of market type and startup region on startup funding. We also explore the effects of population and mean income on startup funding by region, theorizing that certain demographic characteristics define startup clusters.
The funding relationships we describe are of interest to startup founders, venture capitalists, and policymakers. Our findings can help these stakeholders understand where funding is available, and which markets to invest resources in.
Prediction of Dental Visits: Kaggle Competition
Techniques: Generalized Linear Models (GLMs), Generalized Addictive Models (GAMs), CARTs, and random forest.
Graphs, Iterators, and More!
Techniques: functional programming, proxy pattern, functors, iterators, and more.
In my CS 207 “Systems Development for Computational Science,” a software engineering design class, we have been working on a semester long project to develop a robust Graph class.