| | MAINTENANCE README FOR PCRE2 |
| | ============================ |
| |
|
| | The files in the "maint" directory of the PCRE2 source contain data, scripts, |
| | and programs that are used for the maintenance of PCRE2, but which do not form |
| | part of the PCRE2 distribution tarballs. This document describes these files |
| | and also contains some notes for maintainers. Its contents are: |
| |
|
| | Files in the maint directory |
| | Updating to a new Unicode release |
| | Preparing for a PCRE2 release |
| | Updating version info for libtool |
| | Long-term ideas (wish list) |
| |
|
| | For a description of the way PCRE2 works, see the file called HACKING in the |
| | top directory. |
| |
|
| |
|
| | Files in the maint directory |
| | ============================ |
| |
|
| | 132html |
| | A Perl script to convert man pages to HTML (.1 and .3 files "two" HTML), |
| | used by UpdateAlways. |
| |
|
| | CheckMan |
| | A Perl script to validate the syntax in PCRE2 man pages, used by |
| | UpdateAlways. |
| |
|
| | CleanTxt |
| | A Perl script to clean up the nroff output in PCRE2 man pages, used by |
| | UpdateAlways. |
| |
|
| | Detrail |
| | A Perl script to remove trailing whitespace from PCRE2 files, used by |
| | UpdateAlways. |
| |
|
| | GenerateCommon.py |
| | A Python module containing data and functions that are used by the other |
| | Generate scripts. |
| |
|
| | GenerateTest.py |
| | A Python script that generates input and expected output test data for tests |
| | 26 or 27, which tests certain aspects of Unicode property support. |
| |
|
| | GenerateUcd.py |
| | A Python script that generates the file pcre2_ucd.c from GenerateCommon.py |
| | and Unicode data files, which are themselves downloaded from the Unicode web |
| | site. The generated file contains the tables for a 2-stage lookup of Unicode |
| | properties, along with some auxiliary tables. The script starts with a long |
| | comment that gives details of the tables it constructs. |
| |
|
| | GenerateUcpHeader.py |
| | A Python script that generates the file pcre2_ucp.h from GenerateCommon.py |
| | and Unicode data files. The generated file defines constants for various |
| | Unicode property values. |
| |
|
| | GenerateUcpTables.py |
| | A Python script that generates the file pcre2_ucptables_inc.h from |
| | GenerateCommon.py and Unicode data files. The generated file contains tables |
| | for looking up Unicode property names. |
| |
|
| | FetchUcd.sh |
| | A shell script to download the UCD data from the Unicode website into |
| | the Unicode.tables directory. |
| |
|
| | FilterCoverage.py |
| | A small helper used by the RunCoverage script. |
| |
|
| | LintMan |
| | A Perl script to check and update magic numbers in the documentation that |
| | correspond to configurable settings in the codebase. |
| |
|
| | manifest-* |
| | Data files used to verify the contents of the distribution tarball and |
| | `make install` file lists. |
| |
|
| | ManyConfigTests |
| | A shell script that runs "configure, make, test" a number of times with |
| | different configuration settings. |
| |
|
| | UpdateAlways |
| | A shell script to ensure that all auto-generated outputs are ready for |
| | release. |
| |
|
| | It should be run often (by CI on each commit) to ensure that the repository |
| | is in a clean and consistent state. |
| |
|
| | UpdateDates.py |
| | UpdateRelease.py |
| | Python scripts to be run less frequently than UpdateAlways. These should only |
| | be needed immediately before a release, when finalising the repository. |
| | UpdateDates.py checks in the last-updated times on documentation pages. |
| | UpdateRelease.py is needed after any change to the version number in |
| | configure.ac. |
| |
|
| | pcre2_chartables.c.non-standard |
| | This is a set of character tables that came from a Windows system. It has |
| | characters greater than 128 that are set as spaces, amongst other things. I |
| | kept it so that it can be used for testing from time to time. |
| |
|
| | README |
| | This file. |
| |
|
| | RunCoverage |
| | A script used to generate the coverage report using Clang. It is called by |
| | the GitHub CI actions, and can also be run by a developer locally. |
| |
|
| | RunManifestTest |
| | RunManifestTest.ps1 |
| | Scripts to generate and verify a list of files against an expected 'manifest' |
| | detailing what the directory should contain. |
| |
|
| | Unicode.tables |
| | The files in this directory were downloaded from the Unicode web site. They |
| | contain information about Unicode characters and scripts, and are used by the |
| | Generate scripts. There is also UnicodeData.txt, which is no longer used by |
| | any script, because it is useful occasionally for manually looking up the |
| | details of certain characters. However, note that character names in this |
| | file such as "Arabic sign sanah" do NOT mean that the character is in a |
| | particular script (in this case, Arabic). Scripts.txt and |
| | ScriptExtensions.txt are where to look for script information. |
| |
|
| | ucptest.c |
| | A program for testing the Unicode property macros that do lookups in the |
| | pcre2_ucd.c data, mainly useful after rebuilding the Unicode property tables. |
| | Compile and run this in the "maint" directory (see comments at its head). |
| | This program can also be used to find characters with specific properties and |
| | to list which properties are supported. |
| |
|
| | ucptestdata |
| | A directory containing four files, testinput{1,2} and testoutput{1,2}, for |
| | use in conjunction with the ucptest program. |
| |
|
| | utf8.c |
| | A short, freestanding C program for converting a Unicode code point into a |
| | sequence of bytes in the UTF-8 encoding, and vice versa. If its argument is a |
| | hex number such as 0x1234, it outputs a list of the equivalent UTF-8 bytes. |
| | If its argument is a sequence of concatenated UTF-8 bytes (e.g. 12e188b4) it |
| | treats them as a UTF-8 string and outputs the equivalent code points in hex. |
| | See comments at its head for details. |
| |
|
| |
|
| | Updating to a new Unicode release |
| | ================================= |
| |
|
| | When there is a new release of Unicode, the files in Unicode.tables must be |
| | refreshed from the Unicode web site, which can be done with the script |
| | FetchUcd.sh. Once that is done, the four Python scripts that generate files from |
| | the Unicode data can be run from within the "maint" directory. Note that the |
| | format used for those files is not stable, and therefore changes to the scripts |
| | might be needed to support new versions. |
| |
|
| | Note: Previously, it was necessary to update lists of scripts and their |
| | abbreviations by hand before running the Python scripts. This is no longer |
| | necessary because the scripts have been upgraded to extract this information |
| | themselves. Also, there used to be explicit lists of scripts in two of the man |
| | pages. This is no longer the case; the pcre2test program can now output a list |
| | of supported scripts, and the command to do so is part of the documentation. |
| |
|
| | You can give an output file name as an argument to the following scripts, but |
| | by default: |
| |
|
| | GenerateUcd.py creates pcre2_ucd.c ) |
| | GenerateUcpHeader.py creates pcre2_ucp.h ) in the current directory |
| | GenerateUcpTables.py creates pcre2_ucptables_inc.h ) |
| |
|
| | These files can be compared against the existing versions in the src directory |
| | to check on any changes before replacing the old files, but you can also |
| | generate directly into the final location by running: |
| |
|
| | ./GenerateUcd.py ../src/pcre2_ucd.c |
| | ./GenerateUcpHeader.py ../src/pcre2_ucp.h |
| | ./GenerateUcpTables.py ../src/pcre2_ucptables_inc.h |
| |
|
| | Once the .c and .h files are in the ../src directory, the ucptest program can |
| | be compiled and used to check that the new tables work properly. The data files |
| | in ucptestdata are set up to check a number of test characters. See the |
| | comments at the start of ucptest.c. Depending of the type of changes, adding |
| | tests for new scripts, properties or characters to the files in ucptestdata |
| | is recommended. Make sure to regenerate and validate the output files after. |
| |
|
| | Finally, you should run the GenerateTest.py script to regenerate new versions |
| | of the input and expected output from a series of Unicode property tests that |
| | are automatically generated from the Unicode data files. By default, the files |
| | are written to testinput and testoutput in the current directory, but they |
| | should be moved to replace the files inside the main testdata directory and |
| | that are being used for tests 27 or 26. |
| |
|
| | In summary: |
| |
|
| | ``` |
| | ./GenerateUcd.py ../src/pcre2_ucd.c |
| | ./GenerateUcpHeader.py ../src/pcre2_ucp.h |
| | ./GenerateUcpTables.py ../src/pcre2_ucptables_inc.h |
| | ./GenerateTest.py |
| | mv testinput ../testdata/testinput27 |
| | mv testoutput ../testdata/testoutput27 |
| |
|
| | ...compile ucptest.c |
| | for i in 1 2; do |
| | ./ucptest < ucptestdata/testinput$i > testoutput$i |
| | diff -U3 testoutput$i ucptestdata/testoutput$i |
| | done |
| | ``` |
| |
|
| |
|
| | Preparing for a PCRE2 release |
| | ============================= |
| |
|
| | This section contains a checklist of things that I (NCW) do before building a |
| | new release. |
| |
|
| | * First of all, make sure that the main branch is in good condition. |
| |
|
| | - Basically, test it. The CI jobs should all be passing. This ensures that |
| | pcre2tests are passing, that the build is warning-free, and that all our |
| | platforms are running correctly. The CI jobs should be running the Perl |
| | tests (which assert that `testdata/testinput1` tests give the same results |
| | using Perl's regex engine). The ManyConfigTests exercise a variety of |
| | build options and combinations. The Autoconf and CMake builds must pass. |
| |
|
| | - If new build options have been added, ensure that they are added to the |
| | CMake files as well as to the Autoconf files. |
| |
|
| | - Run perltest.sh on the test data for tests 1 and 4. The output should match |
| | the PCRE2 test output, apart from the version identification at the start of |
| | each test. Sometimes there are other differences in test 4 if PCRE2 and Perl |
| | are using different Unicode releases. The other tests are not |
| | Perl-compatible (they use various PCRE2-specific features or options). The |
| | maint/RunPerlTest shell script can be used to do this testing in Unix-like |
| | environment. |
| |
|
| | - Check the external testing tools. CodeQL & Clang Static Analyzer report |
| | their results to the GitHub "Security" dashboard. Coverity has its own |
| | external dashboard, as does OSS-Fuzz. Since we have these tools, we should |
| | at least confirm they haven't flagged anything. |
| |
|
| | - Documentation: check the documentation is ready; and LICENCE, |
| | NON-AUTOTOOLS-BUILD, and README. Many of these won't need changing, but over |
| | the long term things do change. |
| |
|
| | * Ensure the AUTHORS file is up-to-date with any new contributors since the |
| | last release. I use this simple command: |
| |
|
| | ```sh |
| | git log $GIT_TAG..HEAD |
| | grep -E '^RealAuthor: .*|Co-authored-by:' | \ |
| | sed -E -e 's/RealAuthor: |.*Co-authored-by:\s*//' | \ |
| | sort -u |
| | ``` |
| |
|
| | * Ensure the ChangeLog and NEWS files are updated with everything that you want |
| | to announce in the new release. |
| |
|
| | This command helps dump the Git commits: |
| |
|
| | ```sh |
| | git log |
| | |
| | ``` |
| |
|
| | * Update the library version numbers in configure.ac according to the rules |
| | given below. |
| |
|
| | * Add the new library version to the src/libpcre2-*.sym files (even if no new |
| | symbols have been added since the last release). |
| |
|
| | * Push all these changes to main. |
| |
|
| | * Take a branch off main, named "release/pcre2-10.XX-RC1" or |
| | "release/pcre2-10.XX". |
| |
|
| | All releases should come from main. The final release isn't branched off |
| | from the RC branch; the RC branch is a "throwaway" release which can be pruned |
| | from the linear history of the trunk of PCRE2's tree. |
| |
|
| | * In the new branch, remove the "-DEV" prefix from the version number and set |
| | the release date in configure.ac. Update the release date in the ChangeLog and |
| | NEWS files. |
| |
|
| | ``` |
| | vim configure.ac |
| | vim NEWS |
| | vim ChangeLog |
| | git add -u configure.ac NEWS ChangeLog |
| | git commit -m"Update version number and release date" |
| | ``` |
| |
|
| | * Perform updates of the automatically-generated files. |
| |
|
| | ``` |
| | ./autogen.sh && ./configure |
| | rm src/config.h.generic src/pcre2.h.generic |
| | make src/config.h.generic src/pcre2.h.generic |
| | git clean -idx . |
| | git add -u src/config.h.generic src/pcre2.h.generic |
| | git commit -m"Automatic update of .generic files" |
| |
|
| | maint/UpdateRelease.py |
| | maint/UpdateDates.py |
| | maint/UpdateAlways |
| | git add -u |
| | git commit -m"Automatic update of doc files #noupdate" |
| | ``` |
| |
|
| | * Commit the Autoconf files to the branch. This is required so that users can |
| | check out the Git tag, and receive the same contents as the tarball users. |
| |
|
| | ``` |
| | ./autogen.sh |
| | git add -f Makefile.in aclocal.m4 ar-lib compile config.guess config.sub \ |
| | configure depcomp install-sh ltmain.sh m4/libtool.m4 m4/ltoptions.m4 \ |
| | m4/ltsugar.m4 m4/ltversion.m4 m4/lt~obsolete.m4 missing src/config.h.in \ |
| | test-driver |
| | git commit -m"Commit autogen.sh output" |
| | ``` |
| |
|
| | * Now, wait for the CI job to build the tarball. We can't do this locally: we |
| | want to be releasing a tarball which is signed by GitHub. The GitHub signature |
| | says, "Yes, the developer did not tamper with this tarball, we certify that it |
| | was derived solely from the contents of the Git repository at this commit |
| | hash". |
| |
|
| | * Create the tag locally. We can't do this via the GitHub UI: it has no way to |
| | create signed tags (since my GPG key lives on a Yubikey). |
| |
|
| | ``` |
| | git config user.signingkey 'FB63B406!' |
| | git tag -s pcre2-10.XX -m"Release 10.XX" |
| | git tag -v pcre2-10.XX |
| | ``` |
| |
|
| | * Download the tarball from the CI artifacts. Sign these using the GPG key. |
| |
|
| | ``` |
| | KEYID=FB63B406 |
| | for i in pcre2-10.XX.{zip,tar.gz,tar.bz2}; do |
| | gpg |
| | gpg |
| | done |
| | ``` |
| |
|
| | * In the GitHub UI, create a "release" from the tag (which must have been |
| | already pushed). Add the tarballs and GPG signatures. |
| |
|
| | * Announce the release on the mailing list. |
| |
|
| | * Bump the version number on main to the next release, plus -DEV. |
| |
|
| | * After issuing a final release, merge the release tag back to main with: |
| |
|
| | ``` |
| | git merge -s ours release/pcre2-10.XX |
| | ``` |
| |
|
| | Do not do this for -RC releases, which are not included in the linear history |
| | of the PCRE2 development trunk. |
| |
|
| | We want users with forks of PCRE2 to be able to update from release to the |
| | next by simply doing a `git merge` in their fork. If the release tag is not |
| | merged back to main, then users will see unnecessary Git conflicts when |
| | trying to fast-forward from one release to the next. |
| |
|
| |
|
| | Updating version info for libtool |
| | ================================= |
| |
|
| | This set of rules for updating library version information came from a web page |
| | whose URL I have forgotten. The version information consists of three parts: |
| | (current, revision, age). |
| |
|
| | 1. Start with version information of 0:0:0 for each libtool library. |
| |
|
| | 2. Update the version information only immediately before a public release of |
| | your software. More frequent updates are unnecessary, and only guarantee |
| | that the current interface number gets larger faster. |
| |
|
| | 3. If the library source code has changed at all since the last update, then |
| | increment revision; c:r:a becomes c:r+1:a. |
| |
|
| | 4. If any interfaces have been added, removed, or changed since the last |
| | update, increment current, and set revision to 0. |
| |
|
| | 5. If any interfaces have been added since the last public release, then |
| | increment age. |
| |
|
| | 6. If any interfaces have been removed or changed since the last public |
| | release, then set age to 0. |
| |
|
| | The following explanation may help in understanding the above rules a bit |
| | better. Consider that there are three possible kinds of reaction from users to |
| | changes in a shared library: |
| |
|
| | 1. Programs using the previous version may use the new version as a drop-in |
| | replacement, and programs using the new version can also work with the |
| | previous one. In other words, no recompiling nor relinking is needed. In |
| | this case, increment revision only, don't touch current or age. |
| |
|
| | 2. Programs using the previous version may use the new version as a drop-in |
| | replacement, but programs using the new version may use APIs not present in |
| | the previous one. In other words, a program linking against the new version |
| | may fail if linked against the old version at run time. In this case, set |
| | revision to 0, increment current and age. |
| |
|
| | 3. Programs may need to be changed, recompiled, relinked in order to use the |
| | new version. Increment current, set revision and age to 0. |
| |
|
| |
|
| | Future ideas (wish list) |
| | ======================== |
| |
|
| | This section records a list of ideas so that they do not get forgotten. They |
| | vary enormously in their usefulness and potential for implementation. Some are |
| | very sensible; some are rather wacky. Some have been on this list for many |
| | years. |
| |
|
| | . Optimization |
| |
|
| | There are always ideas for new optimizations so as to speed up pattern |
| | matching. Most of them try to save work by recognizing a non-match without |
| | having to scan all the possibilities. These are some that I've recorded: |
| |
|
| | * /((A{0,5}){0,5}){0,5}(something complex)/ on a non-matching string is very |
| | slow, though Perl is fast. Can we speed up somehow? Convert to {0,125}? |
| | OTOH, this is pathological - the user could easily fix it. |
| |
|
| | * Turn ={4} into ==== ? (for speed). I once did an experiment, and it seems |
| | to have little effect, and maybe makes things worse. |
| |
|
| | * "Ends with literal string" - note that a single character doesn't gain much |
| | over the existing "required code unit" feature that just remembers one code |
| | unit. |
| |
|
| | * Remember an initial string rather than just 1 code unit. |
| |
|
| | * A required code unit from alternatives - not just the last unit, but an |
| | earlier one if common to all alternatives. |
| |
|
| | * Friedl contains other ideas. |
| |
|
| | * The code does not set initial code unit flags for Unicode property types |
| | such as \p; I don't know how much benefit there would be for, for example, |
| | setting the bits for 0-9 and all values >= xC0 (in 8-bit mode) when a |
| | pattern starts with \p{N}. |
| |
|
| | . Perl and PCRE2 sometimes differ in the settings of capturing subpatterns |
| | inside repeats. One example of the difference is the matching of |
| | /(main(O)?)+/ against mainOmain, where PCRE2 leaves $2 set. In Perl, it's |
| | unset. Changing this in PCRE2 will be very hard because I think it needs much |
| | more state to be remembered. |
| |
|
| | . A feature to suspend a match via a callout was once requested. |
| |
|
| | . An option to convert results into character offsets and character lengths. |
| |
|
| | . A (non-Unix) user wanted pcregrep options to (a) list a file name just once, |
| | preceded by a blank line, instead of adding it to every matched line, and (b) |
| | support |
| |
|
| | . Define a union for the results from pcre2_pattern_info(). |
| |
|
| | . Provide a "random access to the subject" facility so that the way in which it |
| | is stored is independent of PCRE2. For efficiency, it probably isn't possible |
| | to switch this dynamically. It would have to be specified when PCRE2 was |
| | compiled. PCRE2 would then call a function every time it wanted a character. |
| |
|
| | . pcre2grep: add -rs for a sorted recurse. Having to store file names and sort |
| | them will of course slow it down. |
| |
|
| | . Someone suggested |
| | never wanted. This seems rather marginal. |
| |
|
| | . A user suggested a parameter to limit the length of string matched, for |
| | example if the parameter is N, the current match should fail if the matched |
| | substring exceeds N. This could apply to both match functions. The value |
| | could be a new field in the match context. Compare the offset_limit feature, |
| | which limits where a match must start. |
| |
|
| | . Write a function that generates random matching strings for a compiled |
| | pattern. |
| |
|
| | . Pcre2grep: an option to specify the output line separator, either as a string |
| | or select from a fixed list. This is not straightforward, because at the |
| | moment it outputs whatever is in the input file. |
| |
|
| | . Improve the code for duplicate checking in pcre2_dfa_match(). An incomplete, |
| | non-thread-safe patch showed that this can help performance for patterns |
| | where there are many alternatives. However, a simple thread-safe |
| | implementation that I tried made things worse in many simple cases, so this |
| | is not an obviously good thing. |
| |
|
| | . PCRE2 cannot at present distinguish between subpatterns with different names, |
| | but the same number (created by the use of ?|). In order to do so, a way of |
| | remembering *which* subpattern numbered n matched is needed. |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |