I bring statistics and machine learning together with critical perspectives from social science to consider when, how, and why data and modeling succeed in their aims—and when, how, and why they can fail. I am passionate about improving practice towards more responsible, robust, effective, and just uses of data and modeling, as well as engaging in outreach to help practitioners in policy, government, law, journalism, social science, business, civil society, and elsewhere understand and adopt machine learning and data science.
In addition to doing empirical work modeling social media and mobile phone sensor data, I work on how to understand statistics, machine learning, and data science from critical and constructivist perspectives, on ethical and policy implications of predictive modeling, and on understanding and communicating foundational problems in statistical models of social networks. I also seek out opportunities to apply my skills and understandings to data projects supporting social services and local governments.
I am the data science postdoctoral fellow at the Berkman Klein Center for Internet & Society at Harvard University. I contribute technical expertise to the Ethics and Governance of Artificial Intelligence Initiative, and provide statistical and network modeling for data in the media space.
Download my current resume or CV [last updated 22 February 2020].
I have begun blogging. See my first post: Can algorithms themselves be biased? Medium, Berkman Klein Center Collection. April 24, 2019. [Medium link] [Mobile-friendly PDF]
Momin M. Malik. (2020). A hierarchy of limitations in machine learning. In submission. [arXiv:2002.05193]
“All models are wrong, but some are useful,” wrote George E. P. Box (1979). Machine learning has focused on the usefulness of probability models for prediction in social systems, but is only now coming to grips with the ways in which these models are wrong—and the consequences of those shortcomings. This paper attempts a comprehensive, structured overview of the specific conceptual, procedural, and statistical limitations of models in machine learning when applied to society. Machine learning modelers themselves can use the described hierarchy to identify possible failure points and think through how to address them, and consumers of machine learning models can know what to question when confronted with the decision about if, where, and how to apply machine learning. The limitations go from commitments inherent in quantification itself, through to showing how unmodeled dependencies can lead to cross-validation being overly optimistic as a way of assessing model performance.
Kar-Hai Chu, Jason Colditz, Momin M. Malik, Tabitha Yates, and Brian Primack. (2019). Identifying key target audiences for public health campaigns: Leveraging machine learning in the case of hookah tobacco smoking. Journal of Medical Internet Research, 21 (7), e12443. doi:10.2196/12443. [JMIR link]
Background: Hookah tobacco smoking (HTS) is a particularly important issue for public health professionals to address owing to its prevalence and deleterious health effects. Social media sites can be a valuable tool for public health officials to conduct informational health campaigns. Current social media platforms provide researchers with opportunities to better identify and target specific audiences and even individuals. However, we are not aware of systematic research attempting to identify audiences with mixed or ambivalent views toward HTS.
Objective: The objective of this study was to (1) confirm previous research showing positively skewed HTS sentiment on Twitter using a larger dataset by leveraging machine learning techniques and (2) systematically identify individuals who exhibit mixed opinions about HTS via the Twitter platform and therefore represent key audiences for intervention.
Methods: We prospectively collected tweets related to HTS from January to June 2016. We double-coded sentiment for a subset of approximately 5000 randomly sampled tweets for sentiment toward HTS and used these data to train a machine learning classifier to assess the remaining approximately 556,000 HTS-related Twitter posts. Natural language processing software was used to extract linguistic features (ie, language-based covariates). The data were processed by machine learning tools and algorithms using R. Finally, we used the results to identify individuals who, because they had consistently posted both positive and negative content, might be ambivalent toward HTS and represent an ideal audience for intervention.
Results: There were 561,960 HTS-related tweets: 373,911 were classified as positive and 183,139 were classified as negative. A set of 12,861 users met a priori criteria indicating that they posted both positive and negative tweets about HTS.
Conclusions: Sentiment analysis can allow researchers to identify audience segments on social media that demonstrate ambiguity toward key public health issues, such as HTS, and therefore represent ideal populations for intervention. Using large social media datasets can help public health officials to preemptively identify specific audience segments that would be most receptive to targeted campaigns.
Momin M. Malik. (2018, August). Bias and beyond in digital trace data. PhD dissertation, Carnegie Mellon University School of Computer Science. [SCS Technical Report Collection] [Defense slides]
Abstract: Large-scale digital trace data from sources such as social media platforms, emails, purchase records, browsing behavior, and sensors in mobile phones are increasingly used for business decision-making, scientific research, and even public policy. However, these data do not give an unbiased picture of underlying phenomena. In this thesis, I demonstrate some of the ways in which large-scale digital trace data, despite its richness, has biases in who is represented, what sorts of actions are represented, and what sorts of behaviors are captured. I present three critiques, demonstrating respectively that geotagged tweets exhibit heavy geographic and demographic biases, that social media platforms’s attempts to guide user behavior are successful and have implications for the behavior we think we observe, and that sensors built into mobile phones like Bluetooth and WiFi measure proximity and co-location but not necessarily interaction as has been claimed.
In response to these biases, I suggest shifting the scope of research done with digital trace data away from attempts at large-sample statistical generalizability and towards studies that situate knowledge in the contexts in which the data were collected. Specifically, I present two studies demonstrating alternatives to complement each of the critiques. In the first, I work with public health researchers to use Twitter as a means of public outreach and intervention. In the second, I design a study using mobile phone sensors in which I use sensor data and survey data to respectively measure proximity and sociometric choice, and model the relationship between the two.
Committee: Jürgen Pfeffer (co-chair), Institute for Software Research; Anind K. Dey (co-chair), Human-Computer Interaction Institute; Cosma Rohilla Shalizi, Department of Statistics & Data Science; and David Lazer, Northeastern University.
Jürgen Pfeffer and Momin M. Malik. (2017). Simulating the dynamics of socio-economic systems. In Betina Hollstein, Wenzel Matiaske, & Kai-Uwe Schnapp (Eds.), Networked governance: New research perspectives, pages 143–161. Cham, Switzerland: Springer. doi:10.1007/978-3-319-50386-8_9. [Springer Link (paywall)] [Authors’s copy (contains minor corrections)] [Full-sized vector image of my recreation of the World3 diagram] [BibTeX]
Excerpt: To the two traditional modes of doing science, in vivo (observation) and in vitro (experimentation), has been added “in silico”: computer simulation. It has become routine in the natural sciences, as well as in systems planning and business process management (Baines et al. 2004; Laguna and Marklund 2013; Paul et al. 1999) to recreate the dynamics of physical systems in computer code. The code is then executed to give outputs that describe how a system evolves from given inputs. Simulation models of simple physical processes, like boiling water or materials rupturing, give precise outputs that reliably match the outcomes of the actual physical system. However, as Winsberg (2010, p. 71) argues, scientists who rely on simulations do so because they “assume as background knowledge that we already know a great deal about how to build good models of the very features of the target system that we are interested in learning about.” This is not the case with social simulation. It is often done precisely to try and discover the important features of the target system when those features are unknown or uncertain. Social simulation is a kind of computer-aided thought experiment (Di Paolo et al. 2000) and as such, it is most appropriate to use as a “method of theory development” (Gilbert and Troitzsch 2005). Unlike in the natural sciences, uncertainty and the impossibility of verification are the rule rather than the exception, and so it is rare to find attempts to use social simulation for prediction and forecasting (Feder 2002).
Momin M. Malik and Jürgen Pfeffer. (2016). Identifying platform effects in social media data. In Proceedings of the Tenth International AAAI Conference on Web and Social Media (ICWSM-16), pages 241–249. May 18–20, 2016, Cologne, Germany. [Updated version, Chapter 2 from dissertation] [ICWSM link] [ICWSM slides] [IC2S2 slides] [Sunbelt slides] [BibTeX]
Abstract: Even when external researchers have access to social media data, they are not privy to decisions that went into platform design—including the measurement and testing that goes into deploying new platform features, such as recommender systems, that seek to shape user behavior towards desirable ends. Finding ways to identify platform effects is thus important both for generalizing findings, as well as understanding the nature of platform usage. One approach is to find temporal data covering the introduction of a new feature; observing differences in behavior before and after allow us to estimate the effect of the change. We investigate platform effects using two such datasets, the Netflix Prize dataset and the Facebook New Orleans data, in which we observe seeming discontinuities in user behavior but that we know or suspect are the result of a change in platform design. For the Netflix Prize, we estimate user ratings changing by an average of about 3% after the change, and in Facebook New Orleans, we find that the introduction of the ‘People You May Know’ feature locally nearly doubled the average number of edges added daily, and increased by 63% the average proportion of triangles created by each new edge. Our work empirically verifies several previously expressed theoretical concerns, and gives insight into the magnitude and variety of platform effects.
Momin M. Malik and Jürgen Pfeffer. (2016). A macroscopic analysis of news in Twitter. Digital Journalism, 4 (8), 955–979. doi:10.1080/21670811.2015.1133249. [Taylor & Francis link] [Preprint] [BibTeX]
Abstract: Previous literature has considered the relevance of Twitter to journalism, for example as a tool for reporters to collect information and for organizations to disseminate news to the public. We consider the reciprocal perspective, carrying out a survey of news media-related content within Twitter. Using a random sample of 1.8 billion tweets over four months in 2014, we look at the distribution of activity across news media and the relative dominance of certain news organizations in terms of relative share of content, the Twitter behavior of news media, the hashtags used in news content versus Twitter as a whole, and the proportion of Twitter activity that is news media-related. We find a small but consistent proportion of Twitter is news media-related (0.8 percent by volume); that news media-related tweets focus on a different set of hashtags than Twitter as a whole, with some hashtags such as those of countries of conflict (Arab Spring countries, Ukraine) reaching over 15 percent of tweets being news media-related; and we find that news organizations’ accounts, across all major organizations, largely use Twitter as a professionalized, one-way communication medium to promote their own reporting. Using Latent Dirichlet Allocation topic modeling, we also examine how the proportion of news content varies across topics within 100,000 #Egypt tweets, finding that the relative proportion of news media-related tweets varies vastly across different subtopics. Over-time analysis reveals that news media were among the earliest adopters of certain #Egypt subtopics, providing a necessary (although not sufficient) condition for influence.
Hemank Lamba, Momin M. Malik, and Jürgen Pfeffer. (2015). A tempest in a teacup? Analyzing firestorms on Twitter. In Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015 (ASONAM 2015), pages 17–24. August 25–28, 2015, Paris, France. doi:10.1145/2808797.2808828. Best student paper award. [ACM link] [BibTeX]
Abstract: ‘Firestorms,’ sudden bursts of negative attention in cases of controversy and outrage, are seemingly widespread on Twitter and are an increasing source of fascination and anxiety in the corporate, governmental, and public spheres. Using media mentions, we collect 80 candidate events from January 2011 to September 2014 that we would term ‘firestorms.’ Using data from the Twitter decahose (or gardenhose), a 10% random sample of all tweets, we describe the size and longevity of these firestorms. We take two firestorm exemplars, #myNYPD and #CancelColbert, as case studies to describe more fully. Then, taking the 20 firestorms with the most tweets, we look at the change in mention networks of participants over the course of the firestorm as one method of testing for possible impacts of firestorms. We find that the mention networks before and after the firestorms are more similar to each other than to those of the firestorms, suggesting that firestorms neither emerge from existing networks, nor do they result in lasting changes to social structure. To verify this, we randomly sample users and generate mention networks for baseline comparison, and find that the firestorms are not associated with a greater than random amount of change in mention networks.
Momin M. Malik, Hemank Lamba, Constantine Nakos, and Jürgen Pfeffer. (2015). Population bias in geotagged tweets. In Papers from the 2015 ICWSM Workshop on Standards and Practices in Large-Scale Social Media Research (ICWSM-15 SPSM), pages 18–27. May 26, 2015, Oxford, UK. [Updated version, Chapter 1 from dissertation] [AAAI link] [Slides] [BibTeX]
Abstract: Geotagged tweets are an exciting and increasingly popular data source, but like all social media data, they potentially have biases in who are represented. Motivated by this, we investigate the question, ‘are users of geotagged tweets randomly distributed over the US population’? We link approximately 144 million geotagged tweets within the US, representing 2.6m unique users, to high-resolution Census population data and carry out a statistical test by which we answer this question strongly in the negative. We utilize spatial models and integrate further Census data to investigate the factors associated with this nonrandom distribution. We find that, controlling for other factors, population has no effect on the number of geotag users, and instead it is predicted by a number of factors including higher median income, being in an urban area, being further east or on a coast, having more young people, and having high Asian, Black or Hispanic/Latino populations.
Io Flament, Cristina Lozano, and Momin M. Malik. (2017). Data-driven planning for sustainable tourism in Tuscany. Cascais, Portugal: Data Science for Social Good Europe. [Report website] [Report]
Ethics, and the limits of data and modeling (invited talk). 13th Annual Analytics Day. April 24, 2020, Kennesaw State University, Kennesaw, Georgia.
Critical technical practice revisited: Towards `analytic actors’ in data science (invited talk). STS Circle. March 5, 2020, Program on Science, Technology & Society, Harvard Kennedy School, Cambridge, Massachusetts. [Slides]
Revisiting ‘all models are wrong’: Addressing limitations in big data, machine learning, and computational social science (invited talk). Wednesdays@NICO Seminar Speaker Series, Northwestern Institute on Complex Systems. February 5, 2020, Northwestern University, Evanston, Illinois. [Slides]
How STS can improve data science (invited talk). Science, Technology and Society Lunch Seminar. January 23, 2020, Tufts University, Medford, Massachusetts. [Slides]
A hierarchy of limitations in machine learning (invited talk). Microsoft Research New England. December 3, 2019, Microsoft Research New England, Cambridge, Massachusetts. [Slides]
Statistics and machine learning: Foundations, limitations, and ethics (invited talk). Colby College Department of Mathematics and Statistics, Colloquium Fall 2019. October 7, 2019, Colby College, Waterville, Maine. [Slides]
A critical introduction to machine learning. 2019 ACM Richard Tapia Celebration of Diversity in Computing Conference. September 19, 2019, Marriott Marquis San Diego Marina, San Diego, California. [Slides]
Everything you ever wanted to know about network statistics but were afraid to ask. XXXIX Sunbelt Social Networks Conference of the International Network for Social Network Analysis. June 18, 2019, UQAM, Montreal, Quebec. [Slides] [R script]
Three open problems for historians of AI. Towards a History of Artificial Intelligence. May 24, 2019, Columbia University, New York, New York. [Slides] [Video]
Interpretability is a red herring: Grappling with ‘prediction policy problems.’ 17th Annual Information Ethics Roundtable: Justice and Fairness in Data Use and Machine Learning. April 5, 2019, Northeastern University, Boston, Massachusetts. [Slides and draft] [Draft only]
What can AI do with copyrighted data? (invited talk). Bracing for Impact – The Artificial Intelligence Challenge: A Roadmap for AI Governance in Canada. Part II: Data, Policy & Innovation. IP Osgoode, Osgoode Hall Law School, York University. March 21, 2019, Toronto Reference Library, Toronto, Canada.
The ethical implications of technical limitations (invited talk). Fairness, Accountability & Transparency/Asia. January 12, 2019, Hong Kong.
Machine learning for social scientists. Fairness, Accountability & Transparency/Asia. January 11, 2019, Hong Kong. [Slides]
“AI” is a lie: Getting to the real issues. AGTech Forum, Berkman Klein Center for Internet & Society at Harvard University. December 13, 2018, Cambridge, Massachusetts. [Slides]
Theorizing sensors for social network research. Computational Social Science Institute, UMass Amherst. December 7, 2018, Amherst, Massachusetts. [Slides]
What everyone needs to know about ‘prediction’ in machine learning. Leverhulme Centre for the Future of Intelligence, University of Cambridge. December 3, 2018, Cambridge, UK. [Slides]
Anxiety, crisis, and a computational future for journalism. Philip Merrill College of Journalism / College of Information Studies, University of Maryland. November 27, 2018, College Park, Maryland.
Networks, yeah! The representation of relations. Data & Donuts, DigitalHKS, Harvard Kennedy School, Harvard University. November 2, 2018, Cambridge, Massachusetts.
Demystifying AI: Terms of disservice. AI Working Group, Berkman Klein Center for Internet & Society. October 28, 2018, Cambridge, Massachusetts.
Surprising aspects of “prediction” in data science. 0213eight, Harvard Alumni Association. October 13, 2018, Cambridge, Massachusetts.
From the forest to the swamp: Modeling vs. implementation in data science. Techtopia @ Harvard University. October 2, 2018, Cambridge, Massachusetts.
Thesis defense: Bias and beyond in digital trace data. Institute for Software Research, School of Computer Science, Carnegie Mellon University. August 9, 2018, Pittsburgh, Pennsylvania. [Slides]
Friendship and proximity in a fraternity cohort with mobile phone sensors. XXXVIII Sunbelt Conference of the International Network for Social Network Analysis. Modeling network dynamics (ses15.05). July 1, 2018, Utrecht, Netherlands. [Slides]
A critical introduction to statistics and machine learning. Cascais Data Science for Social Good Europe Fellowship, Nova School of Business and Economics, Universidade NOVA de Lisboa. August 15, 2017, Cascais/Lisbon, Portugal. [Part I Slides] [Part II Slides]
A social scientist’s guide to network statistics. Guest lecture in 70/73-449: Social, Economic and Information Networks, Fall 2016 (Instructor: Dr. Katharine Anderson). Undergraduate Economics, Tepper School of Business, Carnegie Mellon University. November 10, 2016, Pittsburgh, Pennsylvania. [Slides]
Platform effects in social media networks. 2nd Annual International Conference on Computational Social Science. Social Networks 1. June 24, 2016, Evanston, Illinois. [Slides]
Identifying platform effects in social media data. Tenth International AAAI Conference on Web and Social Media (ICWSM-16). Session I: Biases and Inequalities. May 18, 2016, Cologne, Germany. [Slides]
Social media data and computational models of mobility: A review for demography. 2016 ICWSM Workshop on Social Media and Demographic Research (ICWSM-16 SMDR). May 17, 2016, Cologne, Germany. [Slides]
Platform effects in social media networks. XXXVI Sunbelt Conference of the International Network for Social Network Analysis. Social Media Networks: Challenges and Solutions (Sunday AM2). April 10, 2016, Newport Beach, California. [Slides]
A social scientist’s guide to network statistics (presented to statisticians). stat-network seminar, Department of Statistics, Carnegie Mellon University. March 25, 2016, Pittsburgh, Pennsylvania. [Slides not public, see these slides for the same content.]
Ethical and policy issues in predictive modeling. Guest lecture in 08-200/08-630/19-211: Ethics and Policy Issues in Computing, Spring 2016 (Instructor: Professor James Herbsleb). Institute for Software Research, School of Computer Science, Carnegie Mellon University. March 1, 2016, Pittsburgh, Pennsylvania. [Slides]
Population bias in geotagged tweets. 2015 ICWSM Workshop on Standards and Practices in Social Media Research (ICWSM-15 SPSM). May 26, 2015, Oxford, UK. [Slides]
Inferring social networks from sensor data. XXXIV Sunbelt Conference of the International Network for Social Network Analysis. Network Data Collection (Saturday AM2). February 22, 2014, St Pete Beach, Florida. [Slides]
I have worked on projects outside of my main focus, contributing data analysis and/or theory.
Gabriel Ferreira, Momin Malik, Christian Kästner, Jürgen Pfeffer, and Sven Apel. (2016). Do #ifdefs influence the occurrence of vulnerabilities? An empirical study of the Linux Kernel. In Proceedings of the 20th International Systems and Software Product Line Conference (SPLC ’16), pages 65–73. September 19–23, 2016, Bejing, China. doi:10.1145/2934466.2934467. Nominated for Best Paper Award. [ACM link] [arXiv preprint] [BibTeX]
Kathleen M. Carley, Momin Malik, Peter M. Landwehr, Jürgen Pfeffer, and Michael Kowalchuck. (2016). Crowd sourcing disaster management: The complex nature of Twitter usage in Padang Indonesia. Safety Science, 90, 48–61. doi:10.1016/j.ssci.2016.04.002. [ScienceDirect link (paywall)]
Urs Gasser, Momin Malik, Sandra Cortesi, and Meredith Beaton. (2013, November 14). Mapping approaches to news literacy curriculum development: A navigation aid. Berkman Center Research Publication No. 2013-25. [SSRN link]
Momin Malik, Sandra Cortesi, and Urs Gasser. (2013, October 18). The challenges of defining ‘news literacy’. Berkman Center Research Publication No. 2013-20. [SSRN link]
Momin M. Malik. (2013, June 24). The role of incumbency in field emergence: The case of Internet studies. Poster presented at the Science of Team Science (SciTS) Conference 2013, Northwestern University, Evanston, IL, June 24–27, 2013. [PDF]
(Note that this is a poster version of my MSc thesis, adapted for the topic of SciTS. Also, I have since realized the error of a non-statistical approach to significance claims.)
Momin M. Malik. (2012, October). Networks of collaboration and field emergence in ‘Internet Studies’. Thesis submitted in partial fulfillment of the degree of MSc in Social Science of the Internet at the Oxford Internet Institute at the University of Oxford. Oxford Internet Institute, University of Oxford, Oxford, UK. [PDF]
Urs Gasser, Sandra Cortesi, Momin Malik, and Ashley Lee. (2012, February 16). Youth and digital media: From credibility to information quality. Berkman Center Research Publication No. 2012-1. [SSRN link]
Urs Gasser, Sandra Cortesi, Momin Malik, and Ashley Lee. (2010, August 30). Information quality, youth, and media: A research update. Youth Media Reporter. [Online]
Momin M. Malik. (2009, September). Survey of state initiatives for conservation of coastal habitats from sea-level rise. Rhode Island Coastal Resources Management Council. [PDF]
Momin M. Malik. (2008, December 8). Rediscovering Ramanujan. Thesis submitted in partial fulfillment for an honors degree in History and Science. The Department of the History of Science, Harvard University, Cambridge, MA. [PDF]
I may also be found in the acknowledgements of the following works:
Dariusz Jemielniak. (2019). Socjologia Internetu (in Polish). Warszawa: Wydawnictwo Naukowe Scholar. [Publisher website] [Sample content and reference list from author]
Keiki Hinami, Michael J. Ray, Kruti Doshi, Maria Torres, Steven Aks, John J. Shannon, and William E. Trick. (2019). Prescribing associated with high-risk opioid exposures among non-cancer chronic users of opioid analgesics: A social network analysis. Journal of General Internal Medicine. doi:10.1007/s11606-019-05114-3 [Springer link (paywall)] [PubMed record (abstract only)]
Viktor Mayer-Schönberger and Kenneth Cukier. (2013). Big Data: A revolution that will transform how we live, work, and think. Boston and New York: Eamon Dolan/Houghton Mifflin Harcourt. [Book website]
Mary Madden, Amanda Lenhart, Sandra Cortesi, Urs Gasser, Maeve Duggan, Aaron Smith, and Meredith Beaton. (2013, May 21). Teens, social media, and privacy. Pew Internet & American Life Project. [Report website]
I am Sponsorship Chair for the 14th International Conference on Web and Social Media (ICWSM-2020), Atlanta, Georgia, June 8–June 11, 2020.
I am an Editorial Board member for the 2019 special issue on “Critical Data and Algorithms Studies” in Frontiers in Big Data, Data Mining and Management (Frontiers Media S.A.).
I was co-organizer of the Workshop on Critical Data Science at 13th International Conference on Web and Social Media (ICWSM-2019), Munich, Germany, June 11, 2019.
I was posters co-chair for the 11th International ACM Web Science Conference 2019 (WebSci ’19), Boston, Massachusetts, June 30–July 3, 2019.
I have done peer review for EPJ Data Science (Springer), 2018–present; Digital Journalism (Taylor & Francis), 2016–present; and the Journal of Medical Internet Research (JMIR Publications), 2017–2018.
I am a PC member for: the International AAAI Conference on Web and Social Media (ICWSM), 2018–present; the Web and Society track of the World Wide Web conference (TheWebConf, formerly WWW), 2018–present; and the International Conference on Computational Social Science (IC2S2), 2017–present.
I may be reached at gmail (my first name dot my last name).
This website is my primary online presence, but I maintain profiles elsewhere as well. [Twitter] [LinkedIn] [GitHub] [Google Scholar] [ORCID] [arXiv] [Academia.edu] [SSRN] [Frontiers] [ACM Author Page] [dblp] [ResearcherID/Publons]