About

This website performs Machine Learning of Non-trivial Optimal Regular Expressions.

These learned Regular Expressions are:
  1. Descriptive
  2. Non-trivial
  3. Optimal or near-optimal
  4. Executable
  5. Matching or Exact Matching
  6. Learned from positive example input strings

This has not been done before—it’s a breakthrough in Computer Science and Machine Learning.

This solves the Regular Expression Induction (REI) problem in a significant and practical way.

Up to 17 regexes are learned for each input set—providing options between Optimality, Readability, and Abstractions.

Definitions

  • Descriptive: The input strings can be reconstructed from the learned regex.
  • Optimal: The shortest regex based on Significant Length.
  • Executable: Compatible with standard regex engines.
  • Matching: Matches all input strings.
  • Exact Matching: Matches all and only the input strings.
  • Abstractions: Use of character classes (e.g., \d, \w) and ranges (e.g., [3-5bg-j]).
  • Plain Length: Total characters in the regex.
  • Significant Length: Count of input string characters in the regex.
  • Expansion Factor: (matched strings count) ÷ (original input strings count). For exact matches: 1.0X.

Notes

  • Purpose:
    1. As close to optimal as possible
    2. Executable and readable by humans
  • Supports human analysis of string/sequences
  • Introduces a new form of explainable machine learning
  • Shortest regex determined by Significant Length
  • Significant vs. Plain Length example for input string aab:
    Id Regex Significant Length Plain Length
    1 aab 3: aab 3: aab
    2 a{2}b 2: a{2}b 5: a{2}b
    1. Using Plain Length, shortest is aab (3 < 5)
    2. Using Significant Length, shortest is a{2}b (2 < 3)
  • Why Significant Length? It emphasizes structure in the regex.
Microsoft for Startups