Browse Source

add documentation folder and initial report

pull/16/head
James Fairbanks 2 years ago
parent
commit
df7c8820fd
  1. 2
      .gitignore
  2. 8
      FluModel.ipynb
  3. 2
      LICENSE
  4. 4
      README.md
  5. BIN
      doc/covar_fig1.jpg
  6. BIN
      doc/flu_pipeline.pdf
  7. 384
      doc/main.tex
  8. 215
      doc/refs.bib
  9. 65
      doc/schema.dot
  10. BIN
      doc/schema.pdf

2
.gitignore

@ -2,3 +2,5 @@
*.jl.*.cov
*.jl.mem
deps/deps.jl
.DS_Store
.pynb_checkpoints

8
FluModel.ipynb

@ -100,14 +100,18 @@
" # population = stripunits(sum(sol.u[end]))\n",
" df = Semantics.generate_synthetic_data(population, 0,100)\n",
" f = @formula(vaccines_produced ~ flu_patients)\n",
" model = lm(f, df[2:length(df.year), [:year, :flu_patients, :vaccines_produced]])\n",
" model = lm(f,\n",
" df[2:length(df.year),\n",
" [:year, :flu_patients, :vaccines_produced]])\n",
" println(\"GLM Model:\")\n",
" println(model)\n",
"\n",
" year_to_predict = 1\n",
" num_flu_patients_from_sim = finalI\n",
" vaccines_produced = missing\n",
" targetDF = DataFrame(year=year_to_predict, flu_patients=num_flu_patients_from_sim, vaccines_produced=missing)\n",
" targetDF = DataFrame(year=year_to_predict,\n",
" flu_patients=num_flu_patients_from_sim, \n",
" vaccines_produced=missing)\n",
" @show targetDF\n",
"\n",
"\n",

2
LICENSE

@ -1,6 +1,6 @@
MIT License
Copyright (c) 2018 James
Copyright (c) 2018 James Fairbanks <james.fairbanks@gtri.gatech.edu>
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal

4
README.md

@ -14,6 +14,10 @@ Then you can load it with `using Semantics`
See the tests for example usage.
## Documentation
There is a docs folder which contains the documentation, including reports sent to our sponsor, DARPA.
## Concepts
This package enables representation of complex and diverse model structure in the type system of julia. This will allow generic programing and API development for these complex models.

BIN
doc/covar_fig1.jpg

After

Width: 1319  |  Height: 503  |  Size: 147 KiB

BIN
doc/flu_pipeline.pdf

384
doc/main.tex

@ -0,0 +1,384 @@
\documentclass{article}
\usepackage[utf8]{inputenc}
%\usepackage{hyperref}
\usepackage{fullpage}
\usepackage{float}
\usepackage{booktabs}
\usepackage[pdf]{graphviz}
\newcommand{\schemaorg}[1]{\url{https://schema.org/#1}}
\newcommand{\metaschemaorg}[1]{\url{https://meta.schema.org/#1}}
\newcommand{\jlpkg}[1]{#1.jl}
\usepackage{enumerate}
\usepackage[colorlinks]{hyperref}
\hypersetup{
colorlinks = true, %Colours links instead of ugly boxes
urlcolor = blue, %Colour for external hyperlinks
linkcolor = blue, %Colour of internal links
citecolor = blue %Colour of citations
}
\providecommand{\tightlist}{%
\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
%\input{jupyter.tex}
\usepackage{biblatex}
\addbibresource{refs.bib}
\title{Automatic Scientific Knowledge Extraction: Architecture, Approaches, and Techniques}
\author{Christine Herlihy and Scott Appling and Erica Briscoe and James Fairbanks}
\date{Dec 1, 2018}
\begin{document}
\maketitle{}
\begin{abstract}
This deliverable provides a high-level overview of the open-source epidemiological modeling software packages that we have reviewed, and outlines our intended approach for extracting information from scientific papers that cite one or more of these packages. Our extraction efforts are directed toward the construction of a knowledge graph that we can traverse to reason about how to best map a set of known, unitful inputs to a set of unknown, unitful outputs via parameter modification, hyperparamter modification, and/or sequential chaining of models present in the knowledge graph. Our overarching purpose in this work is to reduce the cost of conducting incremental scientific research, and facilitate communication and knowledge integration across different research domains.
% We discuss the state of the art and future developments in automating scientific knowledge extraction for advanced modeling.
% Our goals are to lower the cost of incremental science and to increase communication across domains.
% Data structures that represent the set of feasible modifications to models.
\end{abstract}
\section{Introduction}
% section re: lit review/discovery of existing open-source packages; some meta analysis of how these packages rank along dimension of usefulness versus actual adoption by researchers (given time in existence)
The ASKE program aims to extract knowledge from the body of scientific work. Our view is that the best way to prove that you have extracted knowledge is to show that you can build new models out of the components of old models. The purpose of these new models may be to improve the fidelity of the original model with respect to the phenomenon of interest or to probe the mechanistic relationships between phenomena. Another use case for adapting models to new contexts is in order to use a simulation to provide data that cannot be obtained through experimentation or observation.
Our initial scientific modeling domain is the epidemiological study of disease
spread, commonly called compartmental or SIR models.
These models are compelling because the literature demonstrates the use of
a repetitive model structure with many variations. The math represented therein spans both discrete and continuous equations, and the algorithms that solve these models are diverse. Additionally, this general model may apply to various national defense
related phenomena, such as viruses on computer networks~\cite{cohen_efficient_2003}
or misinformation in online media~\cite{budak_limiting_2011}.
The long term goal for our project is to reduce the labor cost of integrating models between scientists so that researchers can more efficiently build on the research of others. Such an effort is usefully informed by prior work and practices within the areas of software engineering and open source software development. Having the source code for a library or package is essential to building on it, but perhaps even more important are the affordances provided by open source licensing models and (social) software distribution systems that can significantly reduce the effort required to download others code and streamline execution from hours to minutes. This low barrier to entry is responsible for the proliferation of open source software that we see today.
By extracting knowledge from scientific software and representing that knowledge, including model semantics, in knowledge graphs, along with leveraging type systems to conduct program analysis, we aim to increase the interoperability and development of scientific models at large scale.
\setcounter{secnumdepth}{2}
\setcounter{tocdepth}{2}
\newpage
\tableofcontents
\section{Scientific Domain and Relevant Papers}
% Epidemiology, including information diffusion. Here we describe the papers that we found with code and documentation available, libraries of interest.
% identifying areas/models/research questions that could potentially be linked, as such models will be prime target for metamodeling
We have focused our initial knowledge artifact gathering efforts on the scientific domain of epidemiology broadly defined, so as to render the diffusion of both disease and information in scope. Given that our ultimate goal is to automate the extraction of calls to epidemiological modeling libraries and functions, as well as the unitful parameters contained therein, we have conducted a preliminary literature review for the purpose of: (1) identifying a subset of papers published in this domain that leverage open-source epidemiological modeling libraries, and/or agent-based simulation packages, and make their code available to other researchers; and (2) identifying causally dependent research questions that could benefit from, and/or be addressed by the modification and/or chaining of individual models, as these questions can serve as foundational test cases for the meta-models we develop.
\subsection{Papers and Libraries}
We began the literature review and corpus construction process by identifying a representative set of open-source software (OSS) frameworks for epidemiological modeling, and/or agent-based simulation, including: NDLib, EMOD, Pathogen, NetLogo, EpiModels, and FRED. These frameworks were selected for initial consideration based on: (1) the scientific domains and/or research questions they are intended to support (specifically, disease transmission and information diffusion); (2) the programming language(s) in which they are implemented (Julia, Python, R, C++); and (3) the extent to which they have been used in peer-reviewed publications that include links to their source code. We provide a brief overview of the main components of each package below, as well as commentary on the frequency with which each package has been used in relevant published works.
% general info to cover: team that worked on the project; background and objectives; domain(s); language(s) it's implemented in; community it's targeted for; nice/useful features with positive implications for future use; relative frequency of use within published literature
\subsubsection{NDLib}
NDLib is an open-source package developed by a research team from the Knowledge Discovery and Data Mining Laboratory (KDD-lab) at the University of Pisa, and written in Python on top of the NetworkX library. NDLib is intended to aid social scientists, computer scientists, and biologists in modeling/simulating the dynamics of diffusion processes in social, biological, and infrastructure networks \cite{NDlib1, NetworkX}. NDLib includes built-in implementations of many common epidemiological models (e.g., SIR, SEIR, SEIS, etc.), as well as models of opinion dynamics (e.g., Voter, Q-Voter, Majority Rule, etc.). In addition, there are several features intended to make NDLib available to non-developer domain experts, including an abstract Network Diffusion Query Language (NDQL), an experiment server that is query-able through a RESTful API to allow for remote execution, and a web-based GUI that can be used to visualize and run epidemic simulations \cite{NDlib1}.
The primary disadvantage of NDLib is that it is relatively new: the associated repository on GitHub was created in 2016, with the majority of commits beginning in 2017; two supporting software system architecture papers were published in 2017-2018 \cite{ndlibDocs, NDlib1, NDlib2}. As such, while there are several factors which bode well for future adoption (popularity of Python for data science workflows and computer science education; user-friendliness of the package, particularly for users already familiar with NetworkX, etc.), the majority of published works citing NDLib are papers written by the package authors themselves, and focus on information diffusion.
\subsubsection{Epimodels}
EpiModel is an R package, written by researchers at Emory University and The University of Washington, that provides tools for simulating and analyzing mathematical models of infectious disease dynamics. Supported epidemic model classes include deterministic compartmental models, stochastic individual contact models, and stochastic network models. Disease types include SI, SIR, and SIS epidemics with and without demography, with utilities available for expansion to construct and simulate epidemic models of arbitrary complexity. The network model class is based on the statistical framework of temporal exponential random graph models (ERGMs) implementated in the Statnet suite of software for R. \cite{JSSv084i08} The library is widely used and the source code is available. The library would make a great addition to the system we are building upon integration. EpiModels has received several grants from the National Institutes of Health (NIH) for funding its development. There are several publications utilizing the library at highly elite research journals, including PLoS ONE and Infectious Diseases, as well as the Journal of Statistical Software.
\subsubsection{NetLogo}
NetLogo, according to the User Manual, is a programmable modeling environment for simulating natural and social phenomena. It was authored by Uri Wilensky in 1999 and has been in continuous development ever since at the Center for Connected Learning and Computer-Based Modeling. NetLogo is particularly well suited for modeling complex systems developing over time. Modelers can give instructions to hundreds or thousands of “agents” all operating independently. This makes it possible to explore the connection between the micro-level behavior of individuals and the macro-level patterns that emerge from their interaction. NetLogo lets students open simulations and “play” with them, exploring their behavior under various conditions. It is also an authoring environment which enables students, teachers and curriculum developers to create their own models. NetLogo is simple enough for students and teachers, yet advanced enough to serve as a powerful tool for researchers in many fields. NetLogo has extensive documentation and tutorials. It also comes with the Models Library, a large collection of pre-written simulations that can be used and modified. These simulations address content areas in the natural and social sciences including biology and medicine, physics and chemistry, mathematics and computer science, and economics and social psychology. Several model-based inquiry curricula using NetLogo are available and more are under development. NetLogo is the next generation of the series of multi-agent modeling languages including StarLogo and StarLogoT. NetLogo runs on the Java Virtual Machine, so it works on all major platforms (Mac, Windows, Linux, et al). It is run as a desktop application. Command line operation is also supported. \cite{tisue2004netlogo, nlweb} NetLogo has been widely used by the simulation research community at-large for well over nearly two decades. Although there is a rich literature that mentions its use, it may be more difficult to identify scripts that have been authored and that pair with published research papers using the modeling library due to the amount of time that has passed and that researcher may no longer monitor the email addresses listed on their publications for various reasons.
\subsubsection{EMOD}
Epidemiological MODeling (EMOD) is an open-source agent-based modeling software package developed by the Institute for Disease Modeling (IDM), and written in C++ \cite{emodRepo, emodDocs}. The primary use case that EMOD is intended to support is the stochastic agent-based modeling of disease transmission over space and time. EMOD has built-in support for modeling malaria, HIV, tuberculosis, sexually transmitted infections (STIs), and vector-borne diseases; in addition, a generic modeling class is provided, which can be inherited from and/or modified to support the modeling of diseases that are not explicitly supported \cite{emodDocs, emodRepo}.
The documentation provided is thorough, and the associated GitHub repo has commits starting in July 2015; the most recent commit was made in July 2018 \cite{emodRepo, emodDocs}. EMOD also includes a regression test suite, so that stochastic simulation results can be compared to a reference set of results and assessed for statistical similarity within an acceptable range. In addition, EMOD leverages Message Passing Interface (MPI) to support within- and among-simulation(s)-level parallelization, and outputs results as JSON blobs. The IDM conducts research, and as such, there are a relatively large number of publications associated with the institute that leverage EMOD and make their data and code accessible. One potential drawback of EMOD relative to more generic agent-based modeling packages is that domain-wise, coverage is heavily slanted toward epidemiological models; built-in support for information diffusion models is not included.
\subsubsection{Pathogen}
Pathogen is an open-source epidemiological modeling package written in Julia \cite{pathogenRepo}. Pathogen is intended to allow researchers to model the spread of infectious disease using stochastic, individual-level simulations, and perform Bayesian inference with respect to transmission pathways \cite{pathogenRepo}. Pathogen includes built-in support for SEIR, SEI, SIR, and SI models, and also includes example Jupyter notebooks and methods to visualize simulation results (e.g., disease spread over a graph-based network, where vertices represent individual agents). With respect to the maturity of the package, the first commit to an alpha version of Pathogen occurred in 2015, and the master branch contains commits within the last month (e.g., November 2018) \cite{pathogenRepo}. Pathogen is appealing because it could be integrated into our Julia-based meta-modeling approach without incurring the overhead associated with wrapping non-Julia-based packages. However, one of the potential disadvantages of the Pathogen package is that there is no associated software or system architecture paper; as such, it is difficult to locate papers that use this package.
% the guy who wrote this package (Justin Angevaare ) is a PhD student and has written/forked other related libraries written in Julia and Python. per research gate he only has one publication and it cites Julia but does not cite Pathogen; none of his public github repos appear to be linked to this paper. we can reach out to him directly if we want to pursue this package further.
% new (first commit in May 2018; under active development)
% animated visualization of disease transmission over the network
% Julia -> performance +; used in scientific computing so good chance of linking to other models (?)
% relatively new; no citations
\subsubsection{FRED}
FRED, which stands for a Framework for Reconstructing Epidemic Dynamics, is an open-source, agent-based modeling software package written in C++, developed by the Pitt Public Health Dynamics Laboratory for the purpose of modeling the spread of disease(s) and assessing the impact of public health intervention(s) (e.g., vaccination programs, school closures, etc.) \cite{pittir24611, fredRepo}. FRED is notable for its use of synthetic populations that are based on U.S. Census Data, and as such, allow for the instantiation of agents whose spatiotemporal and sociodemographic characteristics, including household membership and location, as well as income level and patterns of employment and/or school attendance, reflect the actual distribution of the population in the selected geographic area(s) within the United States \cite{pittir24611}. FRED is modular and paramterized to allow for support of different diseases, and the associated software paper, as well as the GitHub repository, provide clear, robust documentation for use. One advantage of FRED relative to some of the other packages we have reviewed is that it is relatively mature. Commits range from 2014-2016, and the associated software paper was published in 2013; as such, epidemiology researchers have had more time to become familiar with the software and cite it in their works \cite{pittir24611, fredRepo}. A related potential disadvantage is that FRED does not appear to be under active development \cite{pittir24611, fredRepo}.
% A corpora of papers across three domains including OSS frameworks that are used by papers, where, validated by humans: 1) at least 1 or more OSS frameworks have been utilized, 2) sufficient equations, discussion of model parameters, etc. have been identified for use in extraction activities in later tasks.
% justify choice of libraries and languages; offer analysis of the frequency with which each is used, and the domains that are covered, and whether or not these domains are disjoint. to what extent is there cross-over potential; where do we see these libraries in the future, what features make them more/less likely to gain ground (language they're implemented in; use by people from certain fields; GUI/API tools for non-developers, etc.)
% potentially a trade-off re: useability versus maturity
%Initial survey of papers/libraries we think we will use.
\subsection{Evaluation}
The packages outlined in the preceding section are all open-source, and written in Turing-complete programming languages; thus, we believe any subset of them would satisfy the open-source and complexity requirements for artifact selection outlined in the solicitation. As such, the primary dimensions along which we have evaluated and compared our initial set of packages include: (1) the frequency with which a given package has been cited in published papers that include links or references to their code; (2) the potential trend of increasing adoption/citation over the near-to-medium term; (3) the existence of thorough documentation; and (4) the feasibility of cross-platform and/or cross-domain integration.
% mention complementary vs. substitutable vs. disjoint re: models implemented and/or domains covered
% our objective is to maximize extensibility of our metamodeling package; however, we do need to ensure the domain(s) represented are robust enough to do non-trivial modeling/sim tasks (e.g., we don't want to completely sacrifice depth for the sake of breadth)
%% They put criteria in BAA for use to evaluate on
% our evaluation of the packages and/or our criteria; we don't want to make a hard decision until talking to the PM
% CH: from the quote below, these appear to be the most directly relevant to evaluating packages 'The reference scientific model should be (1) open-source, (2) widely used in the specified scientific domain, and (3) sufficiently complex (i.e., have variable declarations, type definitions, loops, sections of well documented code as well as undocumented code, etc.).'
% sections of well documented code as well as undocumented code, etc.
% ^ is a weird requirement
% \begin{quote}
% In addition to the technical details of the approach, proposals to TA1 must specify the scientific domain, and explicitly state the reference scientific models (or journal articles) that will be used to generate semantic representations. The reference scientific model should be open-source, widely used in the specified scientific domain, and sufficiently complex (i.e., have variable declarations, type definitions, loops, sections of well documented code as well as undocumented code, etc.). Approaches should directly address model sensitivity and uncertainty quantification. Approaches that are agnostic to the programming language and enable multi-modal model execution are preferred.
% \end{quote}
%\subsection{Options}
With respect to the selection of specific papers and associated knowledge artifacts, our intent at this point in the knowledge discovery process is to prioritize the packages outlined above based on their relative maturity, and proceed to conduct additional, augmenting bibliometric exploration in the months ahead. Our view is that EMOD, Epimodels, NetLogo, and FRED can be considered established packages, given their relative maturity and the relative availability of published papers citing these packages. Pathogen and NDLib can be considered newer packages, in that they are relatively new and lack citations, but have several positive features that bode well for an uptick in use and associated citation in the near- to medium-term. It is worth noting that while the established packages provide a larger corpus of work from which to select a set of knowledge artifacts, the newer packages are more modern, and as such, we expect them to be easier to integrate into the type of data science/meta-modeling pipelines we will develop. Additionally, we note that should the absence of published works prove to be an obstacle for a package we ultimately decide to support via integration into our framework, we are able to generate feasible examples by writing them ourselves.
For purposes of development and testing, we will need to use simple or contrived models that are presented in a homogeneous framework. Pedagogical textbooks~\cite{voit_first_2012} and lecture notes\footnote{\url{http://alun.math.ncsu.edu/wp-content/uploads/sites/2/2017/01/epidemic_notes.pdf}} will be a resource for these simple models that are well characterized.
\section{Information Extraction}
In order to construct the knowledge graph that we will traverse to generate metamodel directed acyclic graphs (DAGs), we will begin by defining a set of knowledge artifacts and implementing (in both code and process/system design) an iterative, expert-in-the-loop knowledge extraction pipeline. The term ``knowledge artifacts'' is intended to refer to the set of open-source software packages (e.g., their code-bases), as well as a curated subset of published papers in the scientific domains of epidemiology and/or information diffusion that cite one or more of these packages and make their own code and/or data (if relevant) freely available. Our approach to the selection of packages and papers has been outlined in the preceding section, and is understood to be both iterative and flexible to the incorporation of additional criteria/constraints, and/or the inclusion/exclusion of (additional) works as the knowledge discovery process proceeds.
Given a set of knowledge artifacts, we plan to proceed with information extraction as follows: First, we will leverage an expert system's based approach to derive rules to automatically recognize and extract relevant phenomena; see Table~\ref{table:info_extract} for details. The rules will be built using the outputs of language parsers and applied to specific areas of source code that meet other heuristic criteria e.g. length, association with other other functions/methods. Next, we will also experiment with supervised approaches (mentioned in our proposal) and utilize information from static code analysis tools, programming language parsers, and lexical and orthographic features of the source code and documentation. For example, variables that are calculated as a result of running a for loop within code and whose characters, lexically speaking, occur within source code documentation and or associated research publications are likely related to specific models being proposed or extended in publications.
We will also be performing natural language parsing \cite{manning} on research papers themselves to provide cues for how we perform information extraction on associated scripts with references to known libraries. For example, a research paper will reference a library that our system is able to reason about and extend models from and so if no known library is identified then the system will not attempt to engage in further pipeline steps. For example, a paper that leverages the EpiModels library will contain references to the EpiModels library itself and in one set of cases, reference a particular family of models e.g. ``Stochastic Individual Contact Models''. The paper will likely not mention any references to actual library functions/methods that were used but will reference particular circumstances related to using a particular model such as e.g. model parameters that were the focus of the research paper's investigation. These kinds of information will be used in supervised learning to build the case for different kinds of extractions. In order to do supervised learning, we will be developing ground truth annotations to train models with. To gain a better sense of the kinds of knowledge artifacts we will be working with, below we present an example paper that a metamodel can be built from and from whence information can be extracted to help in the creation of that metamodel.
\subsection{EpiModels Example}
In \cite{doi:10.1111/oik.04527} the authors utilize the EpiModels library and provide scripts for running the experiments they describe. We believe this is an example of the kind of material we will be able to perform useful information extractions on to inform the development of metamodels. Figure \ref{fig:covar_paper1} is an example of script code from \cite{doi:10.1111/oik.04527}:
\begin{figure}[H]
\centering
\includegraphics[width=0.7\textwidth]{covar_fig1.jpg}
\caption{Example script excerpt associated with \cite{doi:10.1111/oik.04527} setting parameters for use in an ERGM model implemented by EpiModels library.}
\label{fig:covar_paper1}
\end{figure}
Table \ref{table:info_extract} is a non-exhaustive list of the kinds of information extractions we are currently planning and the purposes they serve in supporting later steps:
%\vspace{1cm}
\begin{table}[htbp]
\centering
\begin{tabular}{ p{3.5cm} p{7cm} p{3cm} }
\toprule
Extraction Type & Description & Sources\\
\midrule
Code References & Creation and selection of metamodels to extend or utilize depending on user goals & Papers, Scripts\\
Model Parameters & Natural language variable names, function parameters & Papers, Scripts\\
Function Names & Names of library model functions used to run experiments in papers & Scripts\\
Library Names & Include statements to use specific libraries. Identification of libraries & Scripts \\
\bottomrule
\end{tabular}
\caption{Planned information extractions. A non-exhaustive list of information extractions, their purposes, and sources.}
\label{table:info_extract}
\end{table}
The information extractions we produce here will be included as annotations in the knowledge representations we describe next.
\section{Knowledge Representation}
On the topic of dimensionality / complexity reduction (in an entropic sense) and knowledge representation: (1) we will begin by passing the code associated with each knowledge artifact through static analysis tools. Static analysis tools include linters intended to help programmers debug their code and correct syntax, stylistic, and/or security-related errors. As the knowledge artifacts in our set are derived from already published works, we do not anticipate syntax errors. Rather, our objective is to use the abstract syntax trees (ASTs), call graphs, control flow graphs, and/or dependency graphs that are produced during static analysis to extract both discrete model instantiation(s) (along with arguments, which can be mapped back to parameters which may have associated metadata, including required object type and units), as well as sequential function call information.
The former can be thought of as contributing a connected subgraph to the knowledge graph, such that $G_i \subseteq G$, in which model classes and variable data/unit types are represented as vertices and connected by directed ``requires/accepts'' edges. The latter can be thought of as contributing information about the mathematical and/or domain-specific legality and/or frequency with which a given subset of model types can be sequentially linked; this information can be used to weight edges connecting model nodes in the knowledge graph.
% frequency/co-ocurrence patterns w/respect to (chained) and/or recursive function calls can be used to determine edge weights \in G;
The knowledge graph approach will help identify relevant pieces of context. For example the domain of a scientific paper or code will be necessary for correct resolution of scientific terms which are used to refer to multiple phenomena in different contexts. For example, in a paper about biological cell signalling pathways the term ``death'' is likely to refer to the death of individual cells, while in a paper about disease prevalence in at-risk populations, the same term is likely referring to the death of individual people. This will be further complicated by figurative language in the expository aspects of paper where ``death'' might be used as a metaphor when a cultural behavior or meme ``dies out'' because people stop spreading the behavior to their social contacts.
% Here we discuss techniques for information extraction.
% pass code through language-specific linter;
% parse the resulting AST;
% general expectation is that useful/unitful parameters will be at the base of the stack represented by a series of function calls;
% use scope to help determine where to focus our efforts AND to understand sequential dependencies between function calls
% as well as between functions, variables (data types), and units.
% Could potentially train a model here to probablistically predict the next function call given a sequence of previous function calls.
% get text from pdf (section headers) -> dependency parse tree -> {subject, verb, obj} tuples
% variables, model definitions, units, domain research question, etc.
% human in the loop for review of these dependencies
% extract as much as possible automatically and then pass it to a human to validate/fill in gaps, etc.
\subsection{Schema Design}
We will represent the information extracted from the artifacts using a knowledge graph. And while knowledge graphs are very flexible in how they represent data, it helps to have a schema describing the vertex and edge types along with the metadata that will be stored on the vertices and edges.
In our initial approach, the number of libraries that models can be implemented with will be small enough that schema design can be done by hand. We expect that this schema will evolve as features are added to the system, but remain mostly stable as new libraries, models, papers, and artifacts are added.
When a new paper/code comes in, we will extract edges and vertices automatically with code which represents those edges and vertices in the predefined schema.
Many of the connections will be from artifacts to their components, which will connect to concepts. When papers are connected to other papers, they are connected indirectly (e.g., via other vertices), except for edges that represent citations directly between papers.
\begin{figure}
\centering
\includegraphics[width=\textwidth]{schema.pdf}
\caption{An example of the knowledge graph illustrating the nature of the schema.}
\label{fig:schema.}
\end{figure}
It is an open question for this research whether the knowledge graph should contain models with the parameters bound to values, or the general concept of a model with parameters available for instantiation. Our initial approach will be to model both the general concept of a model such as \texttt{HookesLawModel} along with the specific instantiation \texttt{HookesLawModel\{k=5.3\}} from specific artifacts.
\subsection{Data Sets in the Knowledge Graph}
A big component of how papers refer to the same physical phenomenon is that they use the same data sets. These common datasets which become benchmarks that are leveraged widely in the research community are highly concentrated in a small number of widely cited papers. This is good for our task because we know that if two papers use the same dataset then they are talking about the same phenomenon.
The most direct overlap of datasets is to go through papers that provide the canonical source for that dataset. But we can also think of similarity of datasets in terms of the schema(s) of the datasets. This requires a more detailed dataset representation than just the column names commonly found on CSV files. Google's open dataset search has done a lot of the work necessary for annotating the semantics for features of datasets.
The \jlpkg{DataDeps} system includes programmatic ways to access this information for many of the common open science data access protocols\footnote{\url{http://white.ucc.asn.au/DataDeps.jl/latest/z20-for-pkg-devs.html\#Registering-a-DataDep-1}}
By linking dataset feature (column) names to knowledge graph concepts, we will be able to compare datasets for similarity and conceptual overlap. The fact that two models are connected to the same dataset or concept is an important indicator that the two models are compatible or interchangeable.
\subsection{Schema.org}
Schema.org is one of the largest and most diverse knowledge graph systems.
It includes virtually no coverage of scientific concepts. There are no schema.org nodes for Variable, Function, Equation. The most relevant schema.org concepts are given in the following list.
\begin{itemize}
\tightlist
\item \schemaorg{ScholarlyArticle}
\item \schemaorg{SoftwareSourceCode}.
\item \schemaorg{ComputerLanguage}
\item \schemaorg{variableMeasured}
\item \metaschemaorg{Property}
\item \schemaorg{DataFeedItem}
\item \schemaorg{Quantity} which has more specific types
\begin{itemize}
\item \schemaorg{Distance}
\item \schemaorg{Duration}
\item \schemaorg{Energy}
\item \schemaorg{Mass}
\end{itemize}
\end{itemize}
The focus of schema.org is driven by its adoption in the web document community. Schema.org concepts are used for tagging documents in order for search engines or automated information extraction systems to find structured information in documents. Often it is catalogue or indexing sites that use schema.org concepts to describe the items or documents in their collections.
The lack of coverage for scientific concepts is surprising given that we think of academic research on publication mining to be focused on their own fields, for example papers about mining bibliographic databases often use examples of database researchers themselves.
You could model the relationships between papers using this schema.org schema. But that takes place at the bibliometric level instead of the the model semantics level. There are no entries for expressing that these two papers solve the same equation. Or model the same physical phenomena. Of course schema.org is organized so that everything can be expressed as a \schemaorg{Thing}, but there is no explicit representation for these concepts. There is a Schema.org schema for heath and life science \url{https://health-lifesci.schema.org/}. As we define the schema of our knowledge graph, we will link up with the schema.org concepts as much as possible and could add an extension to the schema.org in order to represent scientific concepts.
\section{Model Representation and Execution}
Representation of models occurs at four levels:
\begin{itemize}\tightlist
\item \textbf{Executable}: the level of machine or byte-code instructions
\item \textbf{Lexical}: the tradition code representation assignment, functions, and loops
\item \textbf{Semantic}: a declarative language or computation graph representation with nodes linked to the knowledge graph
\item \textbf{Human}: a description in natural language as in a research paper or textbook
\end{itemize}
The current method of scientific knowledge extraction is to take a Human level description and have a graduate student build a Lexical level description by reading papers and implementing new codes. We aim to introduce the Semantic level which is normally stored only in the brains of human scientists, but must be explicitly represented in machines in order to automate scientific knowledge extraction. A scientific model represented at the Semantic level will be easy to modify computationally and be describable for the automatic description generation component. The Semantic level representation of a model is a computation DAG. One possible description is to represent the DAG in a human-friendly way, such as in Figure~\ref{fig:flu}.
\begin{figure}[hbtp]
\centering
\includegraphics[width=\textwidth]{flu_pipeline.pdf}
\caption{An example pipeline and knowledge graph elements for a flu response model.}
\label{fig:flu}
\end{figure}
\subsection{Scientific Workflows (Pipelines)}
Our approach will need to be differentiated from scientific workflow managers that are based on conditional evaluation tools like Make. Some examples include \href{https://swcarpentry.github.io/make-novice/}{Make for scientists}, \href{http://scipipe.org/}{Scipipe}, and \href{https://galaxyproject.org/}{the Galaxy project}. These scientific workflows focus on representing the relationships between intermediate data products without getting into the model semantics. While scientific workflow managers are a useful tool for organizing the work of a scientist, they do not have a particularly detailed representation of the modeling tasks. Workflow tools generally accept the UNIX wisdom that text is the universal interface and communicate between programs using files on disk or in memory pipes, sockets, or channels that contain lines of text.
Our approach will track a higher fidelity representation of the model semantics in order to enable computational reasoning over the viability of combined models. Ideas from static analysis of computer programs will enable better verification of metamodels before we run them.
\subsection{Metamodels as Computation Graphs}
Our position is that if you have a task currently solved with a general purpose programming language, you cannot replace that solution with anything less powerful than a general purpose programming language. The set of scientific modeling codes is just too diverse, with each part a custom solution, to be solved with a limited scope solution like a declarative model specification. Thus we embed our solution into the general purpose programming language Julia.
We use high level programming techniques such as abstract types and multiple dispatch in order to create a hierarchical structure to represent a model composed of sub-models. These hierarchies can lead to static or dynamic DAGs of models. Every system that relies on building an execution graph and then executing it finds the need for dynamically generated DAGs at some point. For sufficiently complicated systems, the designer does not know the set of nodes and dependencies until execution has started. Examples include recursive usage of the make build tool, which lead to techniques such as \texttt{cmake}, \texttt{Luigi}, and \texttt{Airflow}, and Deep Learning which has both static and dynamic computation graph implementations for example TensorFlow and PyTorch. There is a tradeoff between the static analysis that helps optimize and validate static representations and the ease of use of dynamic representations. We will explore this tradeoff as we implement the system.
For a thorough example how to use our library to build a metamodel see the notebook \texttt{FluExample.ipynb}. This example uses Julia types system to build a model DAG that represents all of the component models in a machine readable form. This DAG is represented in Figure~\ref{fig:flu}. Code snippets and rendered plots appear in the notebook.
%\input{flumodel.tex}
\subsection{Metamodel Constraints}
When assembling a metamodel, it is important to eliminate possible combinations of models that are scientifically or logically invalid. One type of constraint is provided by units and dimensional analysis. Our flu example pipeline uses \href{Unitful.jl}{https://github.com/ajkeller34/Unitful.jl} to represent the quantities in the models including $C,s,d,person$ for Celsius, second, day, and person. While $C,s,d$ are SI defined units that come with Unitful.jl, person is a user defined type that was created for this model. These unit constraints enable a dynamic analysis tool (the Julia runtime system) to invalidate combinations of models that fail to use unitful numbers correctly, i.e., in accordance with the rules of dimensional analysis taught in high school chemistry and physics. In order to make rigorous combinations of models, more information will need to be captured about the component models. It is necessary but not sufficient for a metamodel to be dimensionally consistent. We will investigate the additional constraints necessary to check metamodels for correctness.
\subsection{Metamodel Transformations}
Metamodel transformations describe high-level operations the system will perform based on the user's request and the information available to it in conjunction with using a particular set of open source libraries; examples of these are as follows:
\begin{enumerate}\tightlist
\item utilize an existing metamodel and modifying parameters;
\item modifying the functional form in a model such as adding terms to an equation
\item changing the structure of the metamodel by modifying the structure of the computation graph
\item introducing new nodes to the model\footnote{new model nodes must first be ingested into the system in order to be made available to users.}
\end{enumerate}
%% TODO: we need to have instantiated models and model classes
\subsection{Types}
This project leverages the Julia type system and code generation toolchain extensively.
Many Julia libraries define and abstract interface for representing the problems they can solve for example \begin{itemize}
\item DifferentialEquations.jl \url{https://github.com/JuliaDiffEq/DiffEqBase.jl} defines \texttt{DiscreteProblem}, \texttt{ODEProblem}, \texttt{SDEProblem}, \texttt{DAEProblem} which represent different kinds of differential equation models that can be used to represent physical phenomena. Higher level concepts such as a \texttt{MonteCarloProblem} can be composed of subproblems in order to represent more complex computations. For example a \texttt{MonteCarloProblem} can be used to represent situations where the parameters or initial conditions of an \texttt{ODEProblem} are random variables, and a scientist aims to interrogate the distribution of the solution to the ODE over that distribution of input.
\item MathematicalSystems.jl \url{https://juliareach.github.io/MathematicalSystems.jl/latest/lib/types.html} defines an interface for dynamical systems and controls such as \texttt{LinearControlContinuousSystem} and \texttt{ConstrainedPolynomialContinuousSystem} which can be used to represent Dynamical Systems including hybrid systems which combine discrete and continuous phenomena. Hybrid systems are of particular interest to scientists examining complex phenomena at the interface of human designed systems and natural phenomena.
\item Another library for dynamical systems includes \url{https://juliadynamics.github.io/DynamicalSystems.jl/}, which takes a timeseries and physics approach to dynamical systems as compared to the engineering and controls approach taken in MathematicalSystems.jl.
\item MADs \url{http://madsjulia.github.io/Mads.jl/} offers a modeling framework that supports many of the model analysis and decision support tasks that will need to be performed on metamodels that we create.
\end{itemize}
Each of these libraries will need to be integrated into the system by understanding the types that are used to represent problems and developing constraints for how to create hierarchies of problems that fit together. We think that the number of libraries that the system understands will be small enough that the developers can do a small amount of work per library to integrate it into the system, but that the number of papers will be too large for manual tasks per paper.
When a new paper or code snippet is ingested by the system, we may need to generate new leaf types for that paper automatically.
\subsection{User Interface}
%% TODO: add figure of how to use it.
Our system is used by expert scientists who want to reduce their time spent writing code and plumbing models together. As an input it would take a set of things known or measured by the scientist and a set of variables or quantities of interest that are unknown. The output of the program is a program that calculates the unknowns as a function of the known input(s) provided by the user, potentially with holes that require expert knowledge to fill in.
\subsection{Generating new models}
% Mentioned that the meta-models are indeed being built, partially, using information extracted earlier
We will use metaprogramming to build a library that takes data structures, derived partially using information previously extracted from research publication and associated scripts, which represent models as input and transform and combine them into new models, then generates executable code based on the these new, potentially larger models.
One foreseeable technical risk is that the Julia compiler and type inference mechanism could be overwhelmed by the number of methods and types that our system defines. In a fully static language like C++ the number of types defined in a program is fixed at compile time and the compile burden is paid once for many executions of the program. In a fully dynamic language like Python, there is no compilation time and the cost of type checking is paid at run time. However, in Julia, there is both compile time analysis and run time type computations.
In Julia, changing the argument types to a function causes a round of LLVM compilation for the new method of that function. When using Unitful numbers in calculations, changes to the units of the numbers create new types and thus additional compile time overhead. This overhead is necessary to provide unitful numbers that are no slower for calculations than bare bitstypes provided by the processor. As we push more information into the type system, this tradeoff of additional compiler overhead will need to be managed.
\section{Validation}
There are many steps to this process and at each step there is a different process for validating the system.
\begin{itemize}
\item \emph{Extraction of knowledge elements from artifacts}: we will need to assess the accuracy of knowledge elements extracted from text, code and documentation to ensure that the knowledge graph is correct. This will require some manual annotation of data from artifacts and quality measures such as precision and recall. The precision is the number of edges in the knowledge graph that are correct, and the recall is the fraction of correct edges that were recovered by the information extraction approach.
\item \emph{Metamodel construction}: once we have a knowledge graph, we will need to ensure that combinations of metamodels are valid, and optimal. We will aim to produce the simpliest metamodel that relates the queried concepts this will be measured in terms of number of metamodel nodes, number of metamodel dependency onnections, number of adjustment or transformation functions. We will design test cases that increase in complexity from pipelines with no need to transform variables, to pipelines with variable transformations, to directed acyclic graphs (DAGs).
\item \emph{Model Accuracy}: as the metamodels are combinations of models that are imperfect, there will be compounding error within the metamodel. We will need to validate that our metamodel execution engine does not add error unnecessarily. This will involve numerical accuracy related to finite precision arithmetic, as well as statistical accuracy related to the ability to learn parameters from data. Additionally, since we are by necessity doing some amount of domain adaptation when reusing models, we will need to quantify the domain adaptation error generated by applying a model developed for one context in a different context. These components of errors can be thought of as compounding loss in a signal processing system where each component of the design introduces loss with a different response to the input.
\end{itemize}
Our view is to analogize the metamodel construction error and the model accuracy to the error and residual in numerical solvers.
For a given root finding problem, such as $f(x)=0$ solve for $x$ the most common way to measure the quality of the solution is to measure both the error and the residual.
The error is defined as $\mid x-x^\star\mid$, which is the difference from the correct solution in the domain of $x$ and the residual is $\mid f(x) - f(x^\star)\mid$ or the difference from the correct solution in the codomain.
We will frame our validation in terms of error and residual, where the error is how close did we get to the best metamodel and residual is the difference between the observed versus predicted phenomena.
These techniques need to generate simple, explainable models for physical phenomena that are easy for scientists to generate, probe, and understand, while being the best possible model of the phenomena under investigation.
% validation of model components AND of edges connecting different models; unit checking
% validation of meta-model X:F(X) pairings
% validation of actual F(X) results compared to empirically observed outcomes
% error at diff levels of the stack (e.g., numerical precision; statistical error; misapplication of model intended for one domain to another, etc.) -> how does error get propagated as metamodel is constructed?
% For metamodels we need to choose pairs of inputs and outputs to run the metamodeling planner on.
% take existing pairs of papers that build on each other to get sets of inputs and outputs.
% Given two related papers that build on each other, can we replicate their results with less effort?
% parameter modification vs. chaining of models from discrete domains
\section{Next Steps}
Our intended path forward following the review of this report is as follows:
\begin{enumerate}
\item Incorporation of feedback received from DARPA PM, including information related to: the types of papers we consider to be in scope (e.g., those with and without source code); domain coverage and desired extensibility; expressed preference for inclusion/exclusion of particular package(s) and/or knowledge artifact(s).
\item Construction of a proof-of-concept version of our knowledge graph and end-to-end pipeline, in which we begin with a motivating example and supporting documents (e.g., natural language descriptions of the research questions and mathematical relationships modeled; source code), show how these documents can be used to construct a knowledge graph, and show how traversal of this knowledge graph can approximately reproduce a hand-written Julia meta-modeling pipeline. The flu example outlined above is our intended motivating example, although we are open to tailoring this motivating example to domain(s) and/or research questions that are of interest to DARPA.
\item A feature of the system not yet explored is automatic transformation of models at the Semantic Level. These transformations will be developed in accordance with interface expectations from downstream consumers including the TA2 performers.
\end{enumerate}
Executing on this proof-of-concept deliverable will allow us to experience the iterative development and research life-cycle that end-users of our system will ultimately participate in. We anticipate that this process will help us to identify gaps in our knowledge and framing of the problem at hand, and/or shortcomings in our methodological approach that we can enhance through the inclusion of curated domain-expert knowledge (e.g., to supplement the lexical nodes and edges we are able to extract from source code). In addition, we expect the differences between our hand-produced meta-model and our system-produced meta-model to be informative and interpretable as feedback which can help us to improve the system architecture and associated user experience.
% I think this point is worth making but not sure it goes in this section as currently slated, and not sure where else to put it right now.
It's also worth noting that over the medium term, we anticipate that holes in the knowledge graph (e.g., missing vertices and/or edges; missing conversion steps to go from one unit of analysis to another, etc.) may help us to highlight areas where either additional research, and/or expert human input is needed.
%Develop set of transformations on semantic models.
\printbibliography
\end{document}
% papers we've found w/good open source docs; how we can use them to create metamodels / to build a knowledge graph and associated schema
% what are atomic pieces of the model and how do we define them; how do they relate to one another/fit together (e.g., models, agents, variables, parameters, equations, etc.)

215
doc/refs.bib

@ -0,0 +1,215 @@
@InProceedings{manning,
author = {Manning, Christopher D. and Surdeanu, Mihai and Bauer, John and Finkel, Jenny and Bethard, Steven J. and McClosky, David},
title = {The {Stanford} {CoreNLP} Natural Language Processing Toolkit},
booktitle = {Association for Computational Linguistics (ACL) System Demonstrations},
year = {2014},
pages = {55--60},
url = {http://www.aclweb.org/anthology/P/P14/P14-5010}
}
@article{doi:10.1111/oik.04527,
author = {White, Lauren A. and Forester, James D. and Craft, Meggan E.},
title = {Covariation between the physiological and behavioral components of pathogen transmission: host heterogeneity determines epidemic outcomes},
journal = {Oikos},
volume = {127},
number = {4},
pages = {538-552},
doi = {10.1111/oik.04527},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/oik.04527},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/oik.04527},
abstract = {Although heterogeneity in contact rate, physiology, and behavioral response to infection have all been empirically demonstrated in host–pathogen systems, little is known about how interactions between individual variation in behavior and physiology scale-up to affect pathogen transmission at a population level. The objective of this study is to evaluate how covariation between the behavioral and physiological components of transmission might affect epidemic outcomes in host populations. We tested the consequences of contact rate covarying with susceptibility, infectiousness, and infection status using an individual-based, dynamic network model where individuals initiate and terminate contacts with conspecifics based on their behavioral predispositions and their infection status. Our results suggest that both heterogeneity in physiology and subsequent covariation of physiology with contact rate could powerfully influence epidemic dynamics. Overall, we found that 1) individual variability in susceptibility and infectiousness can reduce the expected maximum prevalence and increase epidemic variability; 2) when contact rate and susceptibility or infectiousness negatively covary, it takes substantially longer for epidemics to spread throughout the population, and rates of epidemic spread remained suppressed even for highly transmissible pathogens; and 3) reductions in contact rate resulting from infection-induced behavioral changes can prevent the pathogen from reaching most of the population. These effects were strongest for theoretical pathogens with lower transmissibility and for populations where the observed variation in contact rate was higher, suggesting that such heterogeneity may be most important for less infectious, more chronic diseases in wildlife. Understanding when and how variability in pathogen transmission should be modelled is a crucial next step for disease ecology.}
}
@article{NDlib1,
author = {Giulio Rossetti and
Letizia Milli and
Salvatore Rinzivillo and
Alina S{\^{\i}}rbu and
Fosca Giannotti and
Dino Pedreschi},
title = {NDlib: a Python Library to Model and Analyze Diffusion Processes Over
Complex Networks},
journal = {CoRR},
volume = {abs/1801.05854},
year = {2018},
url = {http://arxiv.org/abs/1801.05854},
archivePrefix = {arXiv},
eprint = {1801.05854},
timestamp = {Mon, 13 Aug 2018 16:46:00 +0200},
biburl = {https://dblp.org/rec/bib/journals/corr/abs-1801-05854},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@conference {NDlib2,
title = {NDlib: Studying Network Diffusion Dynamics},
booktitle = {IEEE International Conference on Data Science and Advanced Analytics, DSA},
year = {2017},
address = {Tokyo},
abstract = {Nowadays the analysis of diffusive phenomena occurring on top of complex networks represents a hot topic in the Social Network Analysis playground. In order to support students, teachers, developers and researchers in this work we introduce a novel simulation framework, ND LIB . ND LIB is designed to be a multi-level ecosystem that can be fruitfully used by different user segments. Upon the diffusion library, we designed a simulation server that allows remote execution of experiments and an online visualization tool that abstract the programmatic interface and makes available the simulation platform to non-technicians.},
author = {Giulio Rossetti and Letizia Milli and Salvatore Rinzivillo and Alina Sirbu and Dino Pedreschi and Fosca Giannotti}
}
@article{JSSv084i08,
author = {Samuel Jenness and Steven Goodreau and Martina Morris},
title = {EpiModel: An R Package for Mathematical Modeling of Infectious Disease over Networks},
journal = {Journal of Statistical Software, Articles},
volume = {84},
number = {8},
year = {2018},
keywords = {mathematical model; infectious disease; epidemiology; networks; R},
abstract = {Package EpiModel provides tools for building, simulating, and analyzing mathematical models for the population dynamics of infectious disease transmission in R. Several classes of models are included, but the unique contribution of this software package is a general stochastic framework for modeling the spread of epidemics on networks. EpiModel integrates recent advances in statistical methods for network analysis (temporal exponential random graph models) that allow the epidemic modeling to be grounded in empirical data on contacts that can spread infection. This article provides an overview of both the modeling tools built into EpiModel, designed to facilitate learning for students new to modeling, and the application programming interface for extending package EpiModel, designed to facilitate the exploration of novel research questions for advanced modelers.},
issn = {1548-7660},
pages = {1--47},
doi = {10.18637/jss.v084.i08},
url = {https://www.jstatsoft.org/v084/i08}
}
@INPROCEEDINGS{NetworkX,
author = {Daniel A. Schult},
title = {Exploring network structure, dynamics, and function using NetworkX},
booktitle = {In Proceedings of the 7th Python in Science Conference (SciPy},
year = {2008},
pages = {11--15}
}
@inproceedings{tisue2004netlogo,
title={Netlogo: A simple environment for modeling complexity},
author={Tisue, Seth and Wilensky, Uri},
booktitle={International conference on complex systems},
volume={21},
pages={16--21},
year={2004},
organization={Boston, MA}
}
@misc{nlweb,
title = {NetLogo User Manual},
howpublished = {\url{https://ccl.northwestern.edu/netlogo/docs/}},
note = {Accessed: 2018-11-27}
}
@misc{ndlibDocs,
author = {},
title = {{NDlib 4.0.0 documentation}},
howpublished = {\url{https://ndlib.readthedocs.io/en/latest/}},
month = {},
year = {},
note = {(Accessed on 11/21/2018)}
}
@misc{emodRepo,
author = {},
title = {Institute for Disease Modeling: Source files for building the IDM EMOD disease transmission model},
howpublished = {\url{https://github.com/InstituteforDiseaseModeling/EMOD}},
month = {},
year = {},
note = {(Accessed on 11/21/2018)}
}
@misc{emodDocs,
author = {},
title = {Documentation home: Institute for Disease Modeling},
howpublished = {\url{http://www.idmod.org/documentation}},
month = {},
year = {},
note = {(Accessed on 11/21/2018)}
}
@misc{pathogenRepo,
author = {Justin Angevaare},
title = {path{jangevaare/Pathogen.jl}: Simulation and inference utilities for the spread of infectious diseases with Julia 1.0},
howpublished = {\url{https://github.com/jangevaare/Pathogen.jl}},
month = {},
year = {},
note = {(Accessed on 11/21/2018)}
}
@article{DeepCoder,
author = {Matej Balog and
Alexander L. Gaunt and
Marc Brockschmidt and
Sebastian Nowozin and
Daniel Tarlow},
title = {DeepCoder: Learning to Write Programs},
journal = {CoRR},
volume = {abs/1611.01989},
year = {2016},
url = {http://arxiv.org/abs/1611.01989},
archivePrefix = {arXiv},
eprint = {1611.01989},
timestamp = {Mon, 13 Aug 2018 16:47:48 +0200},
biburl = {https://dblp.org/rec/bib/journals/corr/BalogGBNT16},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@article{cohen_efficient_2003,
title = {Efficient {{Immunization Strategies}} for {{Computer Networks}} and {{Populations}}},
volume = {91},
issn = {0031-9007, 1079-7114},
doi = {10.1103/PhysRevLett.91.247901},
language = {en},
number = {24},
journal = {Physical Review Letters},
author = {Cohen, Reuven and Havlin, Shlomo and {ben-Avraham}, Daniel},
month = dec,
year = {2003},
file = {/Users/jamesfairbanks/Zotero/storage/6KFM4R98/Cohen et al. - 2003 - Efficient Immunization Strategies for Computer Net.pdf}
}
@inproceedings{budak_limiting_2011,
address = {New York, NY, USA},
series = {WWW '11},
title = {Limiting the {{Spread}} of {{Misinformation}} in {{Social Networks}}},
isbn = {978-1-4503-0632-4},
doi = {10.1145/1963405.1963499},
abstract = {In this work, we study the notion of competing campaigns in a social network and address the problem of influence limitation where a "bad" campaign starts propagating from a certain node in the network and use the notion of limiting campaigns to counteract the effect of misinformation. The problem can be summarized as identifying a subset of individuals that need to be convinced to adopt the competing (or "good") campaign so as to minimize the number of people that adopt the "bad" campaign at the end of both propagation processes. We show that this optimization problem is NP-hard and provide approximation guarantees for a greedy solution for various definitions of this problem by proving that they are submodular. We experimentally compare the performance of the greedy method to various heuristics. The experiments reveal that in most cases inexpensive heuristics such as degree centrality compare well with the greedy approach. We also study the influence limitation problem in the presence of missing data where the current states of nodes in the network are only known with a certain probability and show that prediction in this setting is a supermodular problem. We propose a prediction algorithm that is based on generating random spanning trees and evaluate the performance of this approach. The experiments reveal that using the prediction algorithm, we are able to tolerate about 90\% missing data before the performance of the algorithm starts degrading and even with large amounts of missing data the performance degrades only to 75\% of the performance that would be achieved with complete data.},
booktitle = {Proceedings of the 20th {{International Conference}} on {{World Wide Web}}},
publisher = {{ACM}},
author = {Budak, Ceren and Agrawal, Divyakant and El Abbadi, Amr},
year = {2011},
keywords = {social networks,misinformation,competing campaigns,information cascades,submodular functions,supermodular functions},
pages = {665--674},
file = {/Users/jamesfairbanks/Zotero/storage/78EQPA7J/Budak et al. - 2011 - Limiting the Spread of Misinformation in Social Ne.pdf}
}
@article{pittir24611,
volume = {13},
number = {1},
month = {10},
title = {FRED (A Framework for Reconstructing Epidemic Dynamics): An open-source software system for modeling infectious diseases and control strategies using census-based populations},
author = {JJ Grefenstette and ST Brown and R Rosenfeld and J Depasse and NTB Stone and PC Cooley and WD Wheaton and A Fyshe and DD Galloway and A Sriram and H Guclu and T Abraham and DS Burke},
year = {2013},
journal = {BMC Public Health},
url = {http://d-scholarship.pitt.edu/24611/},
abstract = {Background: Mathematical and computational models provide valuable tools that help public health planners to evaluate competing health interventions, especially for novel circumstances that cannot be examined through observational or controlled studies, such as pandemic influenza. The spread of diseases like influenza depends on the mixing patterns within the population, and these mixing patterns depend in part on local factors including the spatial distribution and age structure of the population, the distribution of size and composition of households, employment status and commuting patterns of adults, and the size and age structure of schools. Finally, public health planners must take into account the health behavior patterns of the population, patterns that often vary according to socioeconomic factors such as race, household income, and education levels. Results: FRED (a Framework for Reconstructing Epidemic Dynamics) is a freely available open-source agent-based modeling system based closely on models used in previously published studies of pandemic influenza. This version of FRED uses open-access census-based synthetic populations that capture the demographic and geographic heterogeneities of the population, including realistic household, school, and workplace social networks. FRED epidemic models are currently available for every state and county in the United States, and for selected international locations. Conclusions: State and county public health planners can use FRED to explore the effects of possible influenza epidemics in specific geographic regions of interest and to help evaluate the effect of interventions such as vaccination programs and school closure policies. FRED is available under a free open source license in order to contribute to the development of better modeling tools and to encourage open discussion of modeling tools being used to evaluate public health policies. We also welcome participation by other researchers in the further development of FRED. {\copyright} 2013 Grefenstette et al.; licensee BioMed Central Ltd.}
}
@misc{fredRepo,
author = {University of Pittsburgh Public Health Dynamics Laboratory},
title = {PublicHealthDynamicsLab/FRED: The FRED Repository},
howpublished = {\url{https://github.com/PublicHealthDynamicsLab/FRED}},
month = {},
year = {},
note = {(Accessed on 11/30/2018)}
}
@book{voit_first_2012,
edition = {1st},
title = {A {{First Course}} in {{Systems Biology}}},
isbn = {978-0-8153-4467-4},
abstract = {A First Course in Systems Biologyis a textbook designed for advanced undergraduate and graduate students. Its main focus is the development of computational models and their applications to diverse biological systems. Because the biological sciences have become so complex that no individual can acquire complete knowledge in any given area of specialization, the education of future systems biologists must instead develop a student's ability to retrieve, reformat, merge, and interpret complex biological information. This book provides the reader with the background and mastery of methods to execute standard systems biology tasks, understand the modern literature, and launch into specialized courses or projects that address biological questions using theoretical and computational means. The format is a combination of instructional text and references to primary literature, complemented by sets of small-scale exercises that enable hands-on experience, and larger-scale, often open-ended questions for further reflection.},
publisher = {{Garland Science}},
author = {Voit, Eberhard},
year = {2012}
}

65
doc/schema.dot

@ -0,0 +1,65 @@
\neatograph[scale=0.45]{schema}{
rankdir = "LR"
node [shape=box]
subgraph cluster_0 {
style=filled;
color=lightgrey;
node [style=filled,color=white];
a0; a2; a3;
label = "Artifacts";
a0[label="Paper 1"];
a2[label="Space.docs.io"];
a3[label="github.com SpaceModels/Space.jl"];
}
subgraph cluster_1 {
node [style=filled];
b0; b1; b2; b3;
label = "Components";
color=blue;
b1[label="height"];
b2[label="area=height*width"];
b3[label="width"];
b4[label="area"];
b1 -> b2 [label="term of"];
b3 -> b2;
subgraph cluster_10{b0[label="space"]; label="Models";
}
subgraph cluster_11{b1; b3; b4; label="Variables"};
subgraph cluster_12{b2; label="Equations"};
}
subgraph cluster_2 {
node [style=filled];
c0; c1; c2; c3;
c0[label="measurement"];
c1[label="meters"];
c2[label="unit"];
c3[label="grams"];
label = "Concepts";
c1 -> c2 [label="subconcept"];
color=green;
}
subgraph cluster_3 {
node [style=filled];
d1 d2 d3;
d1[label="5m"];
d2[label="3m"];
d3[label="15m"];
label = "Values";
color=red;
}
a2 -> b0 [label="describes"];
a0 -> b0 [label="consumes"];
b0 -> b1 [label="composed of"];
b0 -> b4 [label="returns"];
b0 -> b3;
b3 -> c0 [label="implements"];
b3 -> d1 [label="assigned"];
b1 -> d2 [label=""];
b4 -> d3 [label=""];
b3 -> c1;
c1 -> d1 [label="instantiates"];
a3 -> b0 [label="implements"];
}

BIN
doc/schema.pdf

Loading…
Cancel
Save