Designing a Tree Diff Algorithm Using Dynamic Programming and A*

Created on 2021-10-21T11:14:31-05:00

Initial version walked both trees and ran a patience diff when finding a node that's child count did not match.

Patience diffing

Attempts at using tree diffing algorithms lead to problems of mechanically and semantically correct but derpy answers.

Reframing the problem as using dynamic programming to find the optimal place to insert switch nodes.

Basically rewriting as a system that tries various operations (ex. insert, delete, recurse in to, ...) and returns a change score.

Change scores for given subtrees are calculated and the best found result is cached; when later attempts come by they can reuse the best solved option for the subtree.

Used two grids to handle an 'inside' and 'outside' mode since different changes happen inside and outside a special date block.

Unlike Levenstein Distance and Knuth-Plass the nodes also carry s-expressions for the generated trees as well as a badness score; the algorithm tries to create a new tree/code to write which has the least badness.

Example of badness is strategically placing switch branches so the generated code has the least number of characters; ex. whether to insert more date branches to avoid repeating long identifier names or when to repeat one or two lines of code rather than inject a new branch point.

Bunch of optimizations for specific use cases, implementation details.

Adding pathfinding to dynamic programming

Using a heap to store a priority list of the most promising paths to pursue

Using A* to estimate remaining work for a given path

Path takes current known path and adds estimated remaining work to the heap