building SQL categorically
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

142 lines
6.6 KiB

11 months ago
6 months ago
11 months ago
6 months ago
6 months ago
6 months ago
6 months ago
6 months ago
6 months ago
6 months ago
6 months ago
6 months ago
11 months ago
  1. # <img src="docs/src/assets/logo.png" width="15%"> AlgebraicRelations.jl
  2. ![Tests](https://github.com/AlgebraicJulia/AlgebraicRelations.jl/workflows/Tests/badge.svg)
  3. AlgebraicRelations.jl is a Julia library built to provide an intuitive and
  4. elegant method for generating and querying a scientific database. This
  5. package provides tooling for defining database schemas,
  6. generating query visualizations, and connecting directly up to a PostgreSQL
  7. server. This package is built on top of
  8. [Catlab.jl](https://github.com/AlgebraicJulia/Catlab.jl) which is the
  9. powerhouse behind its functions.
  10. ## Learning by Doing
  11. The functions of this library may be best explained by showing an example
  12. of how it can be used. This will be done in the steps of [Defining a
  13. Schema](#defining-a-schema), [Creating Queries](#creating-queries), and
  14. [Connecting to PostgreSQL](#connecting-to-postgresql).
  15. ### Defining a Schema
  16. Within this library, we define database schemas based on the *presentation* of a
  17. workflow (more generally, the presentation of a symmetric monoidal category).
  18. The presentation of a workflow includes the **data types** of products in the
  19. workflow (objects in an SMC) and the **processes** that transform these products
  20. (homomorphisms in an SMC). We will give an example of defining the schema of a
  21. traditional computer vision workflow. This involves extracting images from a
  22. file, performing a test/train split on images, training a neural network on
  23. images, and finally evaluating a network on images. This example is also
  24. presented in [this notebook](examples/ml_workflow_demo/ml_demo.ipynb).
  25. #### Defining Types
  26. In order to define types for the presentation, we need to provide the name of
  27. the type (e.g. `File` for compressed files of images) and then the Julia
  28. datatype which can store this type (The filename can be stored uniquely as a
  29. `String`). The definition of all types that we will need for our example is as
  30. follows:
  31. ```julia
  32. # Initialize presentation object
  33. present = Presentation()
  34. # Add types to presentation
  35. File, Images, NeuralNet,
  36. Accuracy, Metadata = add_types!(present, [(:File, String),
  37. (:Images, String),
  38. (:NeuralNet, String),
  39. (:Accuracy, Real),
  40. (:Metadata, String)]);
  41. ```
  42. #### Defining Processes
  43. To define processes that operate on these types, we need three pieces of
  44. information. First, we need the name of the processes (`extract` for the
  45. process that extracts images from files), the input types (`File` for the file
  46. to extract) and the output types (`Images` for the images which were
  47. extracted). The symbol `⊗` (monoidal product) joins two types, allowing for multiple types
  48. in the inputs and outputs of processes. To the schema, this means nothing more than that,
  49. for the process `train` there are two objects need for the input, the first of
  50. type `NeuralNet` and the second of type `Images`.
  51. ```julia
  52. # Add Processes to presentation
  53. extract, split, train,
  54. evaluate = add_processes!(present, [(:extract, File, Images),
  55. (:split, Images, Images⊗Images),
  56. (:train, NeuralNet⊗Images, NeuralNet⊗Metadata),
  57. (:evaluate, NeuralNet⊗Images, Accuracy⊗Metadata)]);
  58. ```
  59. #### Generating the Schema
  60. Once this presentation is defined, the database schema can be generated as follows:
  61. ```julia
  62. # Convert to Schema
  63. TrainDB = present_to_schema(present);
  64. print(generate_schema_sql(TrainDB()))
  65. ```
  66. ```sql
  67. CREATE TABLE evaluate (NeuralNet1 text, Images2 text, Accuracy3 real, Metadata4 text);
  68. CREATE TABLE extract (File1 text, Images2 text);
  69. CREATE TABLE split (Images1 text, Images2 text, Images3 text);
  70. CREATE TABLE train (NeuralNet1 text, Images2 text, NeuralNet3 text, Metadata4 text);
  71. ```
  72. ### Creating Queries
  73. In order to create queries, we use the `@query` macro (based on the `@relation`
  74. macro in Catlab). For this, we must specify a list of objects to get as results
  75. of the query, list of all objects used in the query, and finally a list of
  76. relationships between these objects (based on the primitives defined for the
  77. workflow). In this case, the relationships between objects are the processes
  78. from the presentation and the types of objects are the types defined in the
  79. presentation. Following is an example workflow
  80. ```julia
  81. q = @query TrainDB() (im_train, nn, im_test, acc, md2) where (im_train, im_test, nn,
  82. nn_trained, acc, md,
  83. md2, _base_acc, im) begin
  84. train(nn, im_train, nn_trained, md)
  85. evaluate(nn_trained, im_test, acc, md2)
  86. split(im, im_train, im_test)
  87. >=(acc, _base_acc)
  88. end
  89. print(to_sql(q))
  90. ```
  91. This produces the following query:
  92. ```sql
  93. SELECT t1.Images2 AS im_train, t1.NeuralNet1 AS nn, t2.Images2 AS im_test, t2.Accuracy3 AS acc, t2.Metadata4 AS md2
  94. FROM train AS t1, evaluate AS t2, split AS t3
  95. WHERE t2.NeuralNet1=t1.NeuralNet3 AND t3.Images2=t1.Images2 AND t3.Images3=t2.Images2 AND t2.Accuracy3>=$1
  96. ```
  97. ### Connecting to PostgreSQL
  98. The connection to PostgreSQL is fairly straightforward. We first create a
  99. connection using the [LibPQ.jl](https://invenia.github.io/LibPQ.jl/stable/)
  100. library:
  101. ```Julia
  102. conn = Connection("dbname=test_db");
  103. ```
  104. We then can prepare statements and run them with arguments like:
  105. ```Julia
  106. statement = prepare(conn,q)
  107. execute(statement, [0.6])
  108. ```
  109. which will obtain all of the rows from the previous query which contain
  110. an accuracy of greater than 0.6.
  111. The `execute` function will return a `DataFrame` object (from the
  112. [`DataFrames.jl`](http://juliadata.github.io/DataFrames.jl/stable/) library)
  113. ## Theory
  114. Some excellent resources for understanding how Bicategories of Relations relate
  115. to SQL queries (and inspiriation for this library) are as follows:
  116. * ["Knowledge Representation in Bicategories of Relations"](https://arxiv.org/abs/1706.00526)
  117. * This work does an excellent job of elucidating operations on the Bicategories of Relations and how that relates to methods of knowledge representation like SQL
  118. * ["The operad of wiring diagrams: formalizing a graphical language for databases, recursion, and plug-and-play circuits"](https://arxiv.org/abs/1305.0297)
  119. * This work presents the concepts behind converting undirected wiring diagrams to queries (as well as the limitations present in this conversion)
  120. * Category Theory for Scientists by Spivak
  121. * While generally a very useful introduction to Category Theory, this book elaborates on the categorization of databases in Chapter 3 (in the online version)