About
This website does the Machine Learning of Non-trivial Regular Expressions.
These learned Regular Expressions are
- descriptive,
- non-trivial,
- optimal, or near optimal,
- executable
- matching, or exact matching and
- learned from positive example input strings.
This has not been done before -- it is a breakthrough in Computer Science and Machine Learning.
This solves the Regular Expression Induction (REI) problem for the first time - in a significant and practical way.
Up to 19 regexes are learned for each input set of strings -- providing a choice between Optimality, Readability and Abstractions.
Definitions
Descriptive means that the original input strings can be reconstructed by examining the learned regex.
Optimal means the shortest regex describing the input string set, based on the Significant Length of the regex.
Executable means that most normal regular expression engines can execute the learned regex.
Matching means that the learned regex matches all the input strings.
Exact Matching means that the learned regex matches all and only the input strings.
Abstractions use Character Classes (\d and \w), as well as computed Character Ranges (e.g. [3-5bg-j]).
Plain Length of a regex means the total number of characters in the regex.
Significant Length of a regex means the number of occurrences of original input string characters, in the regex.
Notes
- The purpose of these regexes, is to be:
- as close to optimal as possible, and
- to be both executable and readable (and modifiable) by humans.
- Thus it also allows for the analysis of strings/sequences by humans.
- It is a new form of explainable machine learning.
- Shortest regex is determined by using the significant length of the regex.
-
Significant Length vs. Plain Length, with counted characters in red, for input string "aab":
Id Regex Significant Length Plain Length 1 aab 3: aab 3: aab 2 a{2}b 2: a{2}b 5: a{2}b - Shortest regex, using Plain Length, is Id 1 in the above table: "aab", since (3 < 5).
- Shortest regex, using Significant Length, is Id 2 in the above table: "a{2}b", since (2 < 3).
- Reason for using Significant Length: It promotes the presentation of input string structure, in the regex.
