PROJECTS

PUBLICATIONS

High-resolution experimental and computational profiling of tissue-specific known and novel miRNAs in Arabidopsis
Genome Research (2012)

Breakfield NW, Corcoran, DL, Petricka JJ, Shen J, Sae-Seaw J, Rubio-Somoza I, Weigel D, Ohler U, Benfey PN (2012) High-resolution experimental and computational profiling of tissue-specific known and novel miRNAs in Arabidopsis. Genome Res, 22(1):163-76.

  • Developed a novel statistical classification tool, PIPmiR, performed classification using machine learning algorithms on unstructured miRNA data, and predicted novel plant miRNA in the Arabidopsis genome.
  • Prepared data for analysis, performed feature extractions, used predictive algorithms such as naïve Bayes and logistic regression to train classifier, ran cross-validation, successfully predicted a set of novel miRNAs that were validated in lab, and published in peer-reviewed journal, Genome Research.

You Driving? Talk to you Later
ACM Mobisys (2011)

Best Conference Poster: Chu HL, Raman V, Shen J, Roy Choudhury R, Kansal A, Bahl V, (June 2011) You Driving? Talk to you Later, poster at ACM Mobisys.

  • Implemented a Driver Detection System (DDS) for Android and iPhone devices that differentiates between driver and passenger, in order to eliminate cell phone distraction for the driver.
  • Used sensory inputs from accelerometer, gyroscope, and microphone to perform classification algorithms, trained classifier using Support Vector Machine (LIBSVM) in Java and MATLAB, and ran DDS on real world data collected from multiple users.
  • Presented findings at Association for Computing Machinery (ACM) MobiSys 2011.

Market Power & Reciprocity Among Vertically Integrated Cable Providers
Duke Journal of Economics (2011)

Shen J, Marx L (2011) Market Power & Reciprocity Among Vertically Integrated Cable Providers. Duke Journal of Economics, XXVIII.

  • Engaged in research to find evidence of collusion among vertically integrated multichannel video programming distributors (MVPDs) in the cable industry.
  • Performed statistical regressions such as Multivariate OLS Regression and Probit Model using 2007 and 2010 industry data comprising over 8,000 cable television headend statistics in STATA to find carriage tendencies among MVPDs.
  • Results indicate reciprocity exists among MVPDs and have a strong tendency to carry their own affiliated networks.

PROJECTS

Boston Globe
Techniques: SQL, AWS, feature engineering, and Random Forest (supervised machine learning).
The typical cyber-life of a BostonGlobe user starts with anonymous visits - from casually visiting the site, to ultimately becoming a subscriber. The Boston Globe would like to understand the idiosyncrasies and patterns of a subscriber and use that knowledge to increase subscriptions.

This semester, we partnered with Boston Globe to build a subscriber detection model. We were able to increase subscriber detection rate from 1/3,200 to 1/300 by performing extensive data exploration, feature engineering, and tuning a statistical machine learning classifier. Overall, a 10-fold increase!

More coming soon!

Bayesketball
Techniques: Random Forest for variable selection, Bayesian logistic regression, Metropolis-Hastings, slice sampling, and Monte Carlo simulations.
Featuring 351 first division men’s teams, college basketball is a major source of excitement in American sports. The annual NCAA Tournament regularly attracts more advertising spending than the Super Bowl, and tens of millions of Americans pore over box scores and statistics to win a share of the $12 billion in associated gambling transactions. In that tradition—making predictions, not indulging our vices—we are interested in applying Bayesian analysis of basketball team characteristics to predicting game outcomes. We will use past years’ data when building our model, and evaluate its performance using actual game results from the 2014-15 season and tournament.

Predicting Social News Reach
Techniques: unsupervised machine learning, PCA, supervised machine learning, regressions, sentiment analysis, and topic modeling.
In my data science project, we used Twitter, Facebook, and Bitly data to understand 25 news organizations’ (CNN, NY Times, etc) social media presence.

Sabermetrics
Techniques: regression, and Monte Carlo.
We will be exploring a baseball data set and hope to create a little bit of sabermetrics magic.

A look into Singular Vector Decomposition (SVD)
Techniques: SVD, PCA, and KNN.
We look to explore the usage of SVDs in relation to machine learning models.

Statistical Analysis of Startup Funding Using Crunchbase Data
Techniques: Multivariate linear regression and random effects model.
In recent years, startups such as Facebook and WhatsApp have made big headlines with impressive Initial Public Offerings (IPOs) and acquisitions. In 2014 alone, combined startup funding in San Francisco, New York, and Boston reached a staggering $28.2 billion. That same year saw 7,168 funding rounds in the United States. Clustered in several large cities, the startup scene features complex interactions between product and place.

We seek to understand which product markets are best-funded, and which regions attract the most capital investment. Specifically, we explore the effects of market type and startup region on startup funding. We also explore the effects of population and mean income on startup funding by region, theorizing that certain demographic characteristics define startup clusters.

The funding relationships we describe are of interest to startup founders, venture capitalists, and policymakers. Our findings can help these stakeholders understand where funding is available, and which markets to invest resources in.

Prediction of Dental Visits: Kaggle Competition
Techniques: Generalized Linear Models (GLMs), Generalized Addictive Models (GAMs), CARTs, and random forest.

Description Coming soon

Graphs, Iterators, and More!
Techniques: functional programming, proxy pattern, functors, iterators, and more.
In my CS 207 “Systems Development for Computational Science,” a software engineering design class, we have been working on a semester long project to develop a robust Graph class.