Pikchr XML Classes With ~class-name Token

(1) By sam atman (mnemnion) on 2022-08-28 16:39:36 [link] [source]

After a reasonable amount of tinkering around, I'm glad to report that I've added a mechanism for tagging Pikchr elements with class names.

I would be happy to sign a Contributor's Agreement so I can share the branch without creating complications. There aren't impediments to sharing the code on my end, and I'm happy to just do that if no paperwork is considered necessary.

This was a fun project and I'm pleased with the results, I hope you'll like it as well.

I write parsers for a living, but these are Parsing Expression Grammars, in Lua: so top-down, single-scan (no lexer), no left recursion, and memory managed. Rather than bottom up, scanner-driven, manually managed memmory, and idiomatic left recursion. Really an excellent shift of frame for me.

The first thing I discovered is that % is an impractically difficult prefix for the job, because the lexer isn't expected to have enough context to know when it means percent and when it's leading a %classname. Not saying it can't be done, but I am saying that a phrase like linewid 120 %chop is legal, however hideous, and educating the parser would be a breaking change, not to mention the additional complexity.

I have to say, giving the user two whole userspaces with $ and @ was pretty generous! I hoard the glyphs you can type with an ISO keyboard jealously, since I know I'm going to run out.

But little ~ wasn't doing anything and you know what, I like it. The result is lightweight. I put the PToken on a struct field for the PXmlClass type, made a little note that the PToken might be gone by the time I need it, and whipped up a single-linked list for appending classes as I encounter them.

I then had a few hours of thinking very hard about the implication of the number of parse conflicts changing when I would insert and remove a rule for a nonterminal class_list(X), or derivations of it. For whatever reason, the conflicts I was generating didn't have tags in the .out file, as some noncanonical online sources lead me to expect.

A couple of hours of these shenanigans, and reading what there is online about the lemon parser, I realized that I was trying to avoid left recursion, and that an unnamed_statement rule with a leading XML_CLASS(C) rule can and will match as many ~class-rules as it sees, all I need is another slot on PObj to put the linked list and I was off to the races.

Straight into segmentation fault, which was a bit harder to find than it should be, since I'd forgotten my suspicion that keeping a PToken around until render-time was going to get me in trouble. Once I found my own note to that effect, momentum was restored.

Copying the semantic portion of the string to a const char * accomplished, it was time to render them. The first step here was to factor out a pik_append_tag_open(p, pObj, "tag") function, for every case where a tag is being created for an object. That turned out to not be a 1-1 relationship, I'll get back to that, but this created a hook for pik_append_xml_class.

Which was great! My little test ditty was rendering the way I wanted it to.

But when I went to try it out on an SQLite diagram. I had somehow broken all arrows. This turned out to be because the line was rendering by first allocating a path, then checking for arrowheads, appending them, doing path stuff to the path, and then appending the path.

Which wasn't a bug until I started changing things, drawing the arrowhead(s) first has no effect on the appearance of the diagram, but was surprising execution order.

Moving the arrowhead draw to the end fixed my regression, but the classes were only written to the line, not the polygon representing the arrowhead, which is unsatisfying (imagine changing the color).

After some more tinkering I have the code generating a group for the types composing multiple SVG elements, which attaches the classes to this group if there are any, and which doesn't do this twice if the pObj has a name, since we already group those.

This gives me the feature I need, and raises enough questions that it's worth pausing on symbols, while confirming for me the worth of it. I'll put off mulling over the implications of the current state of the code until after I show the results off.

I'll reply with some of the results, and a comparison with the SVG generated by trunk.

(2) By sam atman (mnemnion) on 2022-08-28 16:59:12 in reply to 1 [link] [source]

I picked the create-table clause out of the SQLite diagrams, as a reference with reasonable complexity.

I've eyeballed the test cases, and didn't spot any changes visually, but I wasn't especially careful about it. The right thing is to write a script to generate a page with both outputs side by side, maybe use the .hidden mechanism to switch between them by clicking.

Rather than clutter up forum, I'm including these as gists.

The original SVG output from trunk should of course look familiar.

Which is from this Pikchr code:

     linerad = 10px
     linewid *= 0.5
     $h = 0.21

     circle radius 10%
A0:  arrow 2*arrowht
CR:  oval "CREATE" fit
T1:  oval "TEMP" fit with .w at (linewid right of CR.e,.8*$h below CR)
T2:  oval "TEMPORARY" fit with .w at 1.25*$h below T1.w
TBL: oval "TABLE" fit with .w at (linewid right of T2.e,CR)
     arrow from CR.e right even with T2; arrow to TBL.w
     arrow from CR.e right linerad then down even with T1 then to T1.w
     arrow from CR.e right linerad then down even with T2 then to T2.w
     line from T2.e right linerad then up even with TBL \
        then to arrowht left of TBL.w
     line from T1.e right even with linerad right of T2.e then up linerad
     arrow from TBL.e right
     oval "IF" fit
     arrow right 2*arrowht
     oval "NOT" fit
     arrow 2*arrowht
ETS: oval "EXISTS" fit

     # IF NOT EXISTS bypass
Y1:  .5*$h below T2.s  # vertical position of back-arrow
     arrow from TBL.e right linerad then down even with Y1 then left even with T2
     arrow from ETS.e right linerad then down even with Y1 \
        then left even with ETS.w
     line left even with TBL.w

     # second row
     arrow left even with first circle then down $h*1.25 then right 2*arrowht
SN:  oval "schema-name" fit
     arrow 2*arrowht
DOT: oval "." bold fit
     arrow
TN:  oval "table-name" fit

     # schema-name bypass
     arrow from (first circle,SN.n) down even with $h below SN \
       then right even with SN
     line right even with arrowht right of DOT.e then up even with DOT \
        then right linerad

     # Loop back from table-name 
     arrow from TN.e right linerad then down even with DOT.s-(0,2*$h) \
       then left even with DOT

     # third row
     arrow left even with first circle then down $h*1.25 then right 2*arrowht
LP:  oval "(" bold fit
     arrow
CD:  box "column-def" fit
TC:  box "table-constraint" fit with .w at CD.e+(1.5*linewid,-1.25*$h)
     arrow <- from TC.e right 1.5*arrowht
C2:  oval "," bold fit
RP:  oval ")" bold fit at (2*linewid right of C2,LP)
     arrow from RP.e right 3*arrowht
TO:  box "table-options" fit

     # column-def loop
C1:  oval "," bold fit at 1.25*$h below CD
     arrow from CD.e right linerad then down even with C1 then to C1.e
     line from C1.w left even with 2*arrowht left of CD.w then up even with CD \
       then to arrowht left of CD.w

     # table-constraint bypass
     arrow from CD.e right
     arrow to RP.w

     # table-constraint loop
     arrow from (C2.e,RP) right 2*arrowht then down even with C2 then to C2.e
     line from TC.w left linerad then up even with RP then right 2*arrowht

     # exit circle and table-options bypass
     arrow from RP.e right linerad then up 1.5*$h then right even with TO.n
     arrow right even with TO.e then right 3*arrowht
EC:  circle same

     # table-options exit
     arrow from TO.e right linerad then up 1.5*$h then right even with EC.w

     # AS select clause
     arrow from TN.e right 250%
     oval "AS" fit
     arrow 2*arrowht
     box "select-stmt" fit
     arrow right
     line right even with linerad right of TO.e then down even with last circle \
        then right linerad

The SVG output from the branch isn't that different, and appears identical when rendered. You'll see the semantic grouping, but no classes, since none were present.

Here's the pikchr with the new token class, so that won't render correctly (yet!):

/*    2 */       linewid *= 0.5
/*    3 */       $h = 0.21
/*    4 */  
/*    5 */       circle radius 10%
/*    6 */  A0:  arrow 2*arrowht
/*    7 */  CR:  ~keyword ~token ~always [oval "CREATE" fit]
                 ^
ERROR: unrecognized token

I happen to think this is a legible change.

You can see the SVG output is still familiar.

But if you paste that SVG into a minimal HTML document, and hover over it, you'll see something cool.

I'm going to try and find a site which will host the SVG without disabling the hover feature, and reply with that.

(3) By drh on 2022-08-29 10:34:53 in reply to 1 [link] [source]

Thank you for taking the time to try to improve Pikchr and PIC.

PIC is a 40-year-old language. Any changes and/or enhancements need to be well justified. I think you will need to make a very strong case that the lack of class tags in the current implementation is a serious impediment before something like this could be added. Furthermore, the new "~class-name" syntax seems un-PIC-like. Are you sure this is the best approach?

(4) By sam atman (mnemnion) on 2022-08-29 16:12:51 in reply to 3 [link] [source]

Thanks for your reply, and for considering the case.

I agree that change should not be hasty. I'm interested in sharing the branch to enrich that discussion, without any expectation that it will land on trunk soon or ever. I would just upload it somewhere, but I know that SQLite is closed-contribution as a matter of policy, and Fossil uses a Contributor's Agreement. My purpose in offering to sign is to leave you free to consider, change, reject, reimplement from scratch, without issues.

The ~class-name syntax was an expedient choice, which has grown on me in the last, er, four days. I would like a better sense of your impression that it's un-PIC-like, because if we can come up with something which is PIC-like and provides classes, so much the better. I like prefixed-prefix because it can't interfere with existing code and because, the way I think, classification is front loaded between the name and the contents. PIC is fairly postfix oriented, which I take as justification for treating the statement as postfix to classes, and statements a postfix to labels. I'm interested in your thoughts.

I will take a satisfactory syntax as a given in the meantime, since there are more important questions. Such as whether the additional complexity is justified, and more centrally, if it's taking Pikchr in a direction that it shouldn't go. Nothing should compromise the core principles of the language, and you're right to be skeptical of any radical proposal, and this qualifies. Will the result be un-PIC-like? Good question.

There are a few steps to justifying something well. I have a proof of concept, and I'll make the case on the merits here. I'll start by saying that these are necessary changes for the purposes I intend. I'm an enthusiastic languages hacker, yes, but in this case as a means to an end. The powerful case must be made by demonstration, a proof of concept is a precursor to that.

But a case on the merits needs to justify itself by situating itself within the history of the project, and respecting its aims and existing uses.

Let me start with some observations about the difference between PIC and Pikchr. That will lay the groundwork for why extending Pikchr might be a good idea. I'll call the extension ~~Pikchr because: ++ has a bad reputation, ~ isn't postfix, and two gives you back your original value.

The biggest change is to the output format: instead of troff or LaTeX, Pikchr emits SVG. I've taken it as a given that SVG is all that Pikchr will ever support, that assumption is pervasive in the code, and there aren't contexts where SVG won't work, nor will there be. It's lingua franca for vectors.

Indeed, the first point in the differences from PIC is 'designed for the Web'. Markdown, SVG, and CSS are all name-checked in this paragraph. The omission of troff commands is, to my point of view, much like the inclusion of classes, in that both features are intended for use by the underlying format, to get something simple to use, which retains access to the full power of the system it's rendered to.

It's not surprising or bad that Pikchr treats SVG like troff. You should see what Adobe spits out. The semantics are quite different between them, and I've been able to take advantage of that.

troff doesn't have a place to put an invisible label, and wouldn't have anything useful to do with it. SVG does, it has a id="" property which can be filled in with anything.

What ~~Pikchr is doing is moving as much of the semantics of the .pik file as it can to the SVG output. Which gets back to classes: they are the mechanism in XML for traits, concerns which cut across inheritance and elemental category.

Classes are the one thing it isn't practical to add after Pikchr is done rendering. The only way to maintain that mapping while changing the pik file is to have it in the file.

Demo

Almost by definition, a change to Pikchr allows it to do things that SQLite hasn't felt a compelling need for. So I can't imagine you would say "Wow! I needed my diagrams to do this, thanks!". It's still the right before/after, partly because it shows that without the semantic markup, the SVG is visually identical and barely larger.

I've made a gist with an HTML file containing the SVG. The only thing added to the ~~Pikchr output is a CSS block. If you open that file locally and mouse hover, the elements of the statement will change color.

Without the CSS, it's the same inert diagram, just has a bunch of classes in it.

Your question was whether the lack of classes was a serious impediment to something: imagine doing this to Pikchr output, rather than by adding a few CSS rules to ~~Pikchr. It would be impaired!

Given that, I would humbly suggest the code has more to say than prose can, at this point. I may even continue to tinker, as I have time and attention. Thanks for your consideration and good works.

(5) By drh on 2022-08-30 15:52:32 in reply to 3 [source]

Something like:

class class-name

Seems more PIC-like to me. You might also add:

id identifier-name

In other words: keyword value.

(6) By drh on 2022-08-30 15:55:57 in reply to 5 [link] [source]

What are the security implications of adding arbitrary class names to the SVG generated by Pikchr? Could a hostile user leverage other class names from elsewhere on the page in order to cause mischief? Would we need to prepend a unique prefix to each class name (perhaps "pikchr-") in order to avoid collisions with other classes from other parts of the document in which the pikchr is embedded?

(8) By sam atman (mnemnion) on 2022-09-01 09:45:57 in reply to 6 [link] [source]

That's a good question I can't answer with confidence, there's just a lot that goes into it.

Mostly, no, if they can't add to the page in other ways. We can get really hypothetical about it and imagine transmuting an SVG into something which looks like a Windows login form, but when you start reasoning about how that attack might progress, the level of implied control suggests that they won't need that asset. I'm going to run this question by some colleagues who know more so I can expand on when this is unlikely, which is most of the time.

It might be okay to have a 'desktop' Pikchr and a 'server' Pikchr, which are the same codebase with a compile-time flag or Make target, where the only differences are small: higher token limit for the desktop, and unique prefixes for classes on the server. I don't think the prefixing is necessary for security, but if there are plausible reasons to do it, this is one of a few ways to handle it.

Prefixing every class a tool produces in the same way is also not uncommon at all.

This is a situation where 'risk of collision' and 'deliberately naming classes so they match the rest of the document' are somewhat in tension, and one bit of configuration may be justified.

(7.1) By sam atman (mnemnion) on 2022-09-01 10:18:56 edited from 7.0 in reply to 5 [link] [source]

Ok, that makes sense to me as a syntax for classes. Although I wonder whether two ways of adding an id would be useful, since one can label any statement and the automatic mapping between label and id is something I consider a strength.

If this followed the XML convention of class "a-class secondClass class03" that would be lightweight enough, although it's irregular syntax because strings generally show up in the output. An ordinary function call class(a-class, secondClass, class03) might fit better.

I would want to include some variadic way to add many classes after saying class once, there are real-life uses for passing small handfuls of classes to things and it would feel noisy.

Another possibility, which is kind of shell-like, so same school as PIC, would be class one-class for one class and class 'a-class secondClass class03' with a single quote. I like this the best, single-quote strings aren't textual elements in Pikchr, or indeed legal syntax, unless I've misread the tokenizer this morning. As you know better than most, single quotes are vernacular for telling a simple parser that spaces are part of the value in a key-value pair (or elsewhere). So (to me at least) it doesn't feel magical to allow the quotes to be left off for a single value, I'm willing to be persuaded otherwise.

Let me check an assumption, this would be in the attribute list, so MyLabel: oval class 'fancy-oval tooltip' fit or in any other such order, rather than living ahead of the object class? Class is an attribute so I'm seeing your point here. Thanks for the feedback.

Edit: as I think about it more, one of the reasons I chose a prefix is so that hyphens wouldn't have a way to collide with the minus sign. Idiomatic classes make heavy use of hyphens, and XML doesn't allow us to treat underscores as equivalent. This implies that leaving off the single quotes in that syntax isn't optional, because allowing it but only without underscores, that actually is spooky magic no one wants to deal with debugging.

It seems to me that classes would be a distinct reduction which comes directly after an object name or textual element, this would cooperate best with define and the proposed symbol extension.