Conversation
…Java 17 - Change skipTests from true to false so mvn test actually runs - Update maven-compiler-plugin source/target from 1.8 to 17 (matches runtime) - Add missing compile dependencies: jmzml 1.7.11, fastutil 8.5.12, slf4j-api 1.7.36, logback-classic 1.2.12, commons-io 2.15.1 (master code references these classes but they were not declared) - @ignore TestMzML test that requires Windows-specific DMS files Result: 120 tests run, 53 active, 67 skipped, 0 failures Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…shold In MZIdentMLGen.addSpectrumIdentificationResults(), change `break` to `continue` when a match has DeNovoScore below the minimum threshold. The `break` was incorrectly stopping emission of all subsequent matches for that spectrum, silently dropping valid PSMs from the mzid output. Also add null safety check for spectrum index lookup — if a spectrum index is not found in the spectrum file, log a warning and skip instead of throwing a NullPointerException. Add TestMZIdentMLGen with two integration tests: - testMzidScoreCompleteness: runs MSGF+ search, verifies every SII has all 4 score CVParams (RawScore, DeNovoScore, SpecEValue, EValue) - testMzidStructuralValidity: verifies output mzid has required mzIdentML structure elements Closes MSGFPlus#157 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add new -msLevel CLI parameter to filter spectra by MS level. Accepts single value (e.g., -msLevel 2) or comma-separated range (e.g., -msLevel 2,3). Default is 2 (MS2 only). Changes: - ParamManager: add MS_LEVEL enum and registration - IntRangeParameter: enable single-value parsing, fix typo - SearchParams: add minMSLevel/maxMSLevel fields - SpecKey: filter spectra by MS level in getSpecKeyList() - SpectraAccessor: add setMSLevelRange(), wire to parsers - MzMLAdapter/MzXMLSpectraMap: fix maxMSLevel to be inclusive - MSGFPlus/MSGFDB/MSGFDBLib: wire MS level parameters - pom.xml: remove fastutil shade filter (jmzml 1.7.11 needs full fastutil) Tests: TestIntRangeParameter (9 tests), TestMSLevelFiltering (6 tests) Benchmark (TMT 1.1GB, TDA): Baseline: 1245s, 6654 PSMs@1%FDR -msLevel 2: 957s (-23%), 6936 PSMs@1%FDR (+4.2%) Closes MSGFPlus#159 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat(MSGFPlus#159): add -msLevel parameter for MS level filtering
fix(MSGFPlus#157): preserve PSM scores when DeNovoScore is below threshold
fix: enable test suite and fix broken build dependencies
Remove standalone scripts, legacy tools, and unused classes that are not referenced by the core MSGF+ search pipeline, reducing codebase by ~22,000 lines. Deleted entire packages: - ims/ (9 files) — legacy IMS utilities - ipa/ (5 files) — unused isotope pattern analysis - msgf2d/ (8 files) — abandoned 2D scoring experiment - msdictionary/ (7 files) — unused genome dictionary tool - mstag/ (3 files) — unused sequence tagging - scripts/ (6 files) — standalone CLI utilities - msutil/test/ (3 files) — misplaced test classes - msgf/test/ (2 files) — legacy test stubs - msgf/analysis/ (1 file) — unused ROC generator Cleaned mixed packages: - misc/: removed 59 standalone scripts, kept 5 core utilities - msgf/: removed 6 unused graph/scoring classes - msutil/: removed 9 unused filter/annotation classes - msdbsearch/: removed 4 standalone DB tools - parser/: removed 9 legacy format parsers (InsPecT, Mascot, etc.) - ui/: removed 6 legacy entry points (MSGF, MSGFLib, etc.) - mzid/: removed 1 unused adapter stub - msscorer/: removed 1 unused stats class - suffixarray/: removed 1 unused mass array class Also removed dead test methods and cleaned dangling imports. Tests: 119 run, 0 failures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rewrite README.md: - Full parameter reference tables covering all 30+ flags organized by category (core search, fragmentation, enzyme, filtering, etc.) - Quick start examples for basic and TMT searches - Modification file format documentation with examples - Build-from-source instructions - Updated requirements to Java 17+ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add parameter docs to README and CI/CD workflow
Remove dead code: 150 unused classes, -22K lines
Write search results directly to TSV from in-memory objects, bypassing mzIdentML serialization. Output is column-identical to MzIDToTsv (verified by diff on test.mgf search). This avoids generating large .mzid files when only TSV is needed downstream (e.g. OpenMS MSGFPlusAdapter, Percolator). - New DirectTSVWriter class with same score/protein/mod logic as MZIdentMLGen but streaming tab-delimited output - New -outputFormat parameter: 0=mzid (default), 1=tsv, 2=both - Includes fixed + variable mods, MGF Title column, decoy filtering - Backwards compatible: default remains mzid Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When -addFeatures 1 is used with -outputFormat tsv, the TSV now includes all PSMFeatureFinder columns needed for Percolator: ExplainedIonCurrentRatio, NTermIonCurrentRatio, CTermIonCurrentRatio, MS2IonCurrent, MS1IonCurrent, IsolationWindowEfficiency, NumMatchedMainIons, and all error statistics (MeanError/StdevError for All and Top7, both absolute and relative). These features were previously only available as UserParams in mzid and were not extracted by OpenMS's addMSGFFeatures() — now they are directly accessible as TSV columns. The peptide modification format (M+15.995) is already compatible with OpenMS MSGFPlusAdapter's modifySequence_() converter which transforms it to bracket notation M[+15.995] for AASequence. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the jmzml JAXB-based MzMLUnmarshaller with a lightweight StAX streaming parser that extracts only the 11 fields MSGF+ needs. The new parser builds a spectrum index in a single pass, then preloads all spectra into memory on first random access, eliminating repeated XML parsing during the search phase. Benchmark (TMT 1.1GB mzML, target-decoy, 4 threads): - Wall time: 957s -> 853s (-10.9%) - PSMs at 1% FDR: 6,936 (unchanged) - Score completeness: 100% (unchanged) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…port - Remove jmzml (JAXB-based mzML parser) dependency from pom.xml - Delete old jmzml-dependent classes: MzMLAdapter, MzMLSpectraMap, MzMLSpectraIterator, SpectrumConverter - Add referenceableParamGroupRef resolution to StaxMzMLParser: builds a map of param groups during index pass, resolves refs during spectrum parsing (critical for files that define polarity, MS level, etc. in referenceable groups) - Move turnOffLogs() utility to StaxMzMLParser, update all callers - Keep fastutil dependency (needed by jmzidml at runtime) JAR size reduced from 39.5MB to 38MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jmzReader (uk.ac.ebi.pride.tools:jmzreader:2.0.6) had zero imports anywhere in the codebase — a dead dependency from earlier development. All spectrum file format parsing uses custom implementations: mzML (StaxMzMLParser), mzXML (embedded jrap/stax), MGF/MS2/PKL (custom parsers). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove mzXML file format support entirely: - Delete embedded jrap/stax library (20 files, ~5,800 lines) - Delete MzXMLSpectraMap, MzXMLSpectraIterator, MzXMLToMgfConverter - Delete MzXMLToMgf utility and mzXML test resources (38MB) - Remove MZXML from SpecFileFormat enum, SpectraAccessor, ParamManager - Update misc/scripts/ui classes to remove mzXML code paths mzXML is a legacy format superseded by mzML. Users with mzXML files can convert to mzML using msconvert (ProteoWizard). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- StaxMzMLParser: use ConcurrentHashMap for thread-safe spectrum cache, fix class-level doc (preload-all, not bounded LRU), check index before preloading, propagate exceptions instead of returning null - StaxMzMLSpectraIterator: throw NoSuchElementException when exhausted - SpectraAccessor: throw exception instead of System.exit(-1), validate specFormat is non-null in constructor - SelectSpectra: update stale .mzXML reference to .mzML - pom.xml: fix duplicate <manifest>, remove stale comments, note fastutil is required by jmzidentml at runtime Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Write search results directly to TSV from in-memory objects, bypassing mzIdentML serialization. Output is column-identical to MzIDToTsv (verified by diff on test.mgf search). This avoids generating large .mzid files when only TSV is needed downstream (e.g. OpenMS MSGFPlusAdapter, Percolator). - New DirectTSVWriter class with same score/protein/mod logic as MZIdentMLGen but streaming tab-delimited output - New -outputFormat parameter: 0=mzid (default), 1=tsv, 2=both - Includes fixed + variable mods, MGF Title column, decoy filtering - Backwards compatible: default remains mzid Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When -addFeatures 1 is used with -outputFormat tsv, the TSV now includes all PSMFeatureFinder columns needed for Percolator: ExplainedIonCurrentRatio, NTermIonCurrentRatio, CTermIonCurrentRatio, MS2IonCurrent, MS1IonCurrent, IsolationWindowEfficiency, NumMatchedMainIons, and all error statistics (MeanError/StdevError for All and Top7, both absolute and relative). These features were previously only available as UserParams in mzid and were not extracted by OpenMS's addMSGFFeatures() — now they are directly accessible as TSV columns. The peptide modification format (M+15.995) is already compatible with OpenMS MSGFPlusAdapter's modifySequence_() converter which transforms it to bracket notation M[+15.995] for AASequence. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace ConvertToMgf-based tests (class removed in PR #7) with StaxMzMLParser and SpectraAccessor mzML parsing tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: native TSV output — bypass mzIdentML for OpenMS/Percolator pipelines
perf: replace jmzml JAXB parser with StAX-based mzML reader
* chore: add CI/release packaging and benchmark scaffolding Split infra and repository maintenance updates into a dedicated reviewable change set, including workflow automation, Docker packaging, benchmark scripts/docs, and project documentation updates. Exclude large local benchmark artifacts and keep this PR focused on non-hot-path code organization and release hygiene. Made-with: Cursor * chore: keep benchmark folder local-only Remove benchmark scripts/docs from this branch and ignore the entire benchmark directory so local benchmarking assets do not appear in review PRs. Made-with: Cursor * docs: keep single canonical primitives plan Fold memory-reduction guidance into the balanced primitives plan and remove the old duplicate plan file so review and maintenance use one canonical document. Made-with: Cursor * chore: narrow PR1 plans to scope-only docs Remove unrelated strategy and optimization plan documents from PR1 so this branch stays focused on infra/packaging cleanup. Keep only the plans index file in this PR. Made-with: Cursor * chore: remove legacy ZippedReleases folder Delete the obsolete Windows release helper scripts and reference files under ZippedReleases from the repository. Made-with: Cursor * chore: remove legacy extlib dependency jar Delete the obsolete jrap/stax legacy jar under extlib as part of repository cleanup. Made-with: Cursor * fix: address copilot review feedback for PR11 Align docs with actual supported legacy formats, update release pipeline to build from tag version with tests, and fix Docker build JDK requirement. Made-with: Cursor * chore: minor packaging/docs hygiene for PR1 Normalize ignore files, shrink Docker build context, align agent README with dev/CI, and clarify release workflow step naming. Made-with: Cursor * docs: trim examples folder to small referenced artifacts Remove duplicate Tryp_Pig_Bov DB/index copies (tests use src/test/resources), drop large unlinked Excel/PNG teaching files, and add docs/examples/README.md so the directory purpose is obvious. Link the index from the main README. Made-with: Cursor * chore: remove IntelliJ IDEA tips screenshots from docs Made-with: Cursor * docs: replace legacy HTML manuals with Markdown Convert docs/*.html to GitHub-flavored Markdown (pandoc), fix internal links, add docs/README.md as the documentation index, and remove unused style.css. Made-with: Cursor * docs: strip leftover HTML span wrappers from converted Markdown Made-with: Cursor
Add a workflow-dispatch benchmark pipeline on a fixed self-hosted runner profile, with public-data download, metrics emission, and baseline TSV comparison under benchmark/ci/PXD001819 for future dataset expansion. Made-with: Cursor
Use uppercase PXD001819 naming in workflow-visible labels/artifacts and update README to state mzXML is not available in this fork. Made-with: Cursor
Made-with: Cursor
- run_ci.sh: count only opening <SpectrumIdentificationItem> tags for sii_count (prior substring match double-counted closing tags and picked up SpectrumIdentificationItemRef) - run_ci.sh: always emit peak_rss_kb and cpu_percent (NA when GNU time does not expose them) so metrics file format is consistent - compare_metrics.py: support an `optional` column; optional missing/NA metrics warn instead of failing CI - baseline.tsv: add optional column, mark peak_rss_kb optional, fix ubuntu-latest note to reference the self-hosted runner, widen sii_count floor to match the de-duplicated count - README pointers: update stale references to a non-existent benchmark/run_pxd001819_benchmark.sh script - benchmark/README.md: describe the actual committed CI scaffold instead of an uncommitted local harness layout Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- extract_metrics.py: stream-parse mzIdentML with ElementTree.iterparse so SII counting and PSM 1% FDR counting no longer rely on line-shaped regex matches over XML - run_ci.sh: use a bash array for SEARCH_ARGS (safe against future flags with spaces), atomic .part downloads, validate cached gzip, default MSGFPLUS_THREADS to 8 to match the workflow, drop the always-zero java_exit metric, and emit integer wall_time_sec - workflow: pin Python via actions/setup-python@v5 so self-hosted runners have a known 3.11 interpreter for the helper scripts - compare_metrics.py: add test_compare_metrics.py covering in-range pass, out-of-range fail, missing required/optional, NA, non-numeric, and empty-range rows (7 tests, all passing) - .gitignore: drop redundant benchmark/** patterns (already covered by benchmark/* + ci/ allowlist); add __pycache__/ and *.pyc - docs: describe new helper and test scripts in both READMEs
fix(benchmark): harden PXD001819 scaffold per review feedback
…ull sink Three small follow-ups to the per-task-buffer commit that pay back the allocation churn the buffer change shifted around: - ConcurrentMSGFPlus: cache a static NULL_PRINT_STREAM instead of allocating one per RunMSGFPlus.run() invocation. Same NullOutputStream sink either way; we just stop newing it 12+ times per search. - ConcurrentMSGFPlus.RunMSGFPlus: add drainResultsTo(dest) and getResultCount(). Drain transfers the task-local buffer's contents to the destination AND clears the local list, so the per-task heap is released immediately after merge instead of dangling on the task reference until the next major GC. - MSGFPlus.runMSGFPlus: pre-size the merged ArrayList based on sum(t.getResultCount()) before draining, avoiding the resize-and-copy cycle that addAll otherwise triggers as it doubles the backing array. Then submittedTasks.clear() drops the strong refs to all 12 task instances right after the wall summary, so their per-task heap (specScanner, scanner, aaMass arrays, etc.) is collectible before the FDR / write phase. TestConcurrentMSGFPlus: new drainsTaskLocalResultsIntoCallerBuffer test pins the new drain semantics — destination receives all entries, source is empty afterwards. Astral 3-arm parity (clean re-run, post-Rancher-shutdown system): armA wall=848.8s targets=89479 decoys=46792 armB wall=752.2s targets=89479 decoys=46792 (-11% vs A) armC wall=798.2s targets=89360 decoys=46913 (-6% vs A) Percolator at 1% FDR: armB 35818 / armC 35767 — both bit-identical to dev. All 8 parity numbers match exactly. Scoped tests: 59/59 green.
…list MSGFDB.java mirrored the same pattern PR #25 fixed in MSGFPlus.java: each task's ScoredSpectraMap was constructed with a Collections.synchronizedList(specKeyList.subList(...)) wrapper. The sublist is a read-only view of the parent specKeyList, accessed only from the single worker thread that owns the task's ScoredSpectraMap; the synchronization wrapper served no purpose. This is the legacy MS-GFDB CLI path, kept consistent with the matching cleanup landed in MSGFPlus. Deliberately NOT changed in this commit: the global resultList at MSGFDB.java:296 is still Collections.synchronizedList(...). Unlike MSGFPlus, the MSGFDB path still has N worker threads concurrently calling scanner.addDBSearchResults(sharedList, ...). The previous PR #25 commit (957f6c2) removed the `synchronized` modifier from DBScanner.addDBSearchResults; the synchronizedList wrapper is now what keeps individual gen.add(...) calls atomic. Removing it without also doing the per-task-buffer refactor that MSGFPlus got would introduce a real data race on the gen list. The full per-task-buffer refactor for MSGFDB (matching what MSGFPlus has via getResults() + drainResultsTo()) is a follow-up PR. MSGFDB doesn't have an Astral-style parity benchmark to validate against, so it deserves its own focused validation cycle rather than piggy-backing here.
… change) Cleanup pass over the perf/search-sync-cleanup branch. No behavior change; tests still pass; net -43 LOC. ConcurrentMSGFPlus.java - TaskWallStats: 14-line final class → 2-line Java 17 record. Same immutable shape, less ceremony. - Drop volatile on the wallStats field — the only reader is the main thread after executor.awaitTermination, which establishes happens- before per JLS §17.4.5. The volatile-plus-comment was self- contradictory. - Trim restate-the-code Javadocs from getResults / getResultCount / drainResultsTo / getWallStats and the resultList field. Replaced with a one-line WHY comment on the wallStats field about the happens-before rationale. - Use the short-form ArrayList import instead of the fully-qualified java.util.ArrayList<>(). MSGFPlus.java - Promote "msgfplus.useForkJoin" to USE_FORK_JOIN_PROPERTY constant alongside the existing TASKS_PER_THREAD_PROPERTY pattern. Move both constants near DISABLE_THREADING at the top of the class to match CompactSuffixArray.SA_BUILD_THREADS_PROPERTY convention. - Extract shutdownPoolNow(executor, fjp) helper. Three identical catch-block ternaries collapse to one call each. - Add fjp.shutdownNow() on the FJP early-return-on-Future-failure path so any still-running tasks get interrupted instead of running to completion in the background. - Drop the T1/T2/T3 planning-doc tags from comments — those are internal task-numbers from the work plan that don't belong in source per the project's CLAUDE.md "no comments referencing the task / caller" rule. - Trim "default thread pool", "FJP-mode pool", "retain task references for the tail summary", and "T1: tail-imbalance summary" labels that restated the code. Kept the JLS §17.4.5 happens-before comment before the drain loop (genuine non-obvious WHY) and the deferred- ScoredSpectraMap-construction comment. Skipped from the review: - SysProps.intOr extraction across resolveTasksPerThread + CompactSuffixArray.resolveSortThreads — only two copies; per CLAUDE.md "three similar lines is better than premature abstraction." - Median helper extraction — MassErrorStat.median takes List<Float> not List<Long>; wrapping is more code than the inline version. - Per-task ArrayList pre-size from subListSize — would re-add a constructor parameter that the per-task-buffer commit deliberately removed. Scoped tests green: 66/66.
The sysprop introduced in commit c3afe11 (T2) duplicated functionality the existing CLI flag -tasks -N already provides. Both produce numTasks = numThreads * N when no -tasks is specified explicitly. Having two parallel knobs for the same thing is clutter, not feature. Removed: TASKS_PER_THREAD_PROPERTY constant, resolveTasksPerThread() helper. The default multiplier survives as DEFAULT_TASKS_PER_THREAD constant, used inline. Users who want a different multiplier should use -tasks -N. The ParamManager already documents this (negative argument means "tasks per thread"). The -Dmsgfplus.useForkJoin sysprop is kept — it's not a tuning knob, it's an A/B measurement toggle for an alternative executor we're not committing to. CLI exposure would imply users should reach for it; sysprop is the right level for an opt-in evaluation flag. Scoped tests green: 70/70.
Move classes into clearer top-level packages so the CLI entry points, write-side outputs, and parser support are colocated: - ui.MSGFPlus, ui.MSGFDB -> cli.* - mzid.DirectPinWriter, DirectTSVWriter, Unimod, UnimodComposition -> output.* - net.pempek.unicode.UnicodeBOMInputStream -> parser.UnicodeBOMInputStream - mslibsearch.ProcessedSpectrum deleted (no remaining refs) Update the executable jar wiring to follow the new MSGFPlus location: - META-INF/MANIFEST.MF Main-Class - pom.xml mainClass entries (assembly + shade plugin) Pure rename + import-fix; no behavior change. clean compile + scoped tests pass (TestDirectPinWriter, TestMSUtils, TestSA, TestMisc, TestRunManifestWriter -- 40 tests, 0 failures, 0 errors).
- search-sync-cleanup.md: ship record for PR #25 (search-path sync cleanup + per-task result buffers). - parameter-modernization.md: proposed plan to migrate the custom edu.ucsd.msjava.params hierarchy to picocli with a thin MSGFPlus-specific compat layer for legacy aliases, config files, and repeated mod/CustomAA entries. Status: proposed. The parameter-modernization plan is documentation only on this branch -- it explicitly recommends a separate refactor branch for implementation, not bundling with this performance cleanup.
Picocli is the foundation for the parameter-modernization plan (see .claude/plans/parameter-modernization.md): replace the custom edu.ucsd.msjava.params hierarchy with declarative @option flags, auto-generated help, and a thin compat layer for the legacy config-file format. The inventory artifact enumerates every MSGFPlus CLI flag (34 total: 27 visible, 7 hidden) plus the parsing semantics each one currently relies on (asymmetric tolerance, dynamic-enum registries, range inclusivity flips, trailing-! number quirk, ~13 config-file aliases, etc.). It is the foundation for the typed cli.MSGFPlusOptions class in the next commit. No code changes yet -- dep + docs only.
Declarative replacement for the imperative addParameter() calls in ParamManager.addMSGFPlusParams(). All 34 MSGFPlus flags (27 visible + 7 hidden) are declared as picocli @option fields with typed Java fields where the Java type is natural (File, Integer, Double, String). Complex domain-typed flags (precursor tolerance, isotope-error range, ms-level range, output format, precursor-cal mode, dynamic enums for fragmentation/instrument/enzyme/protocol) are captured as raw String/Integer for now; the Phase 1c adapter will round-trip them through the existing params.Parameter#parse(String) hierarchy to preserve current behavior. Phase 3 collapses that round-trip once the old hierarchy is deleted. No wiring change yet -- MSGFPlus.main() still uses ParamManager.parseParams(). Adapter and main() switch land in follow-up commits. Compile passes.
Phase 1 of parameter-modernization. All non-config-file MSGFPlus
invocations now flow through:
argv -> picocli (MSGFPlusOptions) -> MSGFPlusOptionsAdapter -> ParamManager
instead of imperative ParamManager.parseParams. -conf still falls
through to the legacy path; Phase 2 ports the config-file reader.
The adapter round-trips each typed field through the existing
params.Parameter#parse(String) hierarchy so the downstream
SearchParams.parse(ParamManager) build path is unchanged. Phase 3
collapses that round-trip when the old hierarchy is deleted.
New equivalence test (MSGFPlusOptionsAdapterTest) asserts that the
picocli path populates the same ParamManager state as legacy
parseParams for a representative CLI corpus, including asymmetric
tolerance ("-t 0.5Da,2.5Da"). Picocli's required-flag enforcement
also verified.
Scoped test sweep (TestDirectPinWriter, TestMSUtils, TestSA, TestMisc,
TestRunManifestWriter, SearchParamsTest, TestPercolator,
MSGFPlusOptionsAdapterTest): 45 tests, 0 failures, 0 errors.
Drop the special-case "-conf in argv -> legacy parseParams" fallback from MSGFPlus.parseArgs. The picocli path now handles every MSGFPlus invocation, including config-file inputs. Two changes: 1. -s and -d are no longer picocli-required. They were already declared isOptional=true on the legacy ParamManager side (FileParameter.setAsOptional in addSpecFileParam(true) / addDBFileParam(true)), so requiring them at the picocli level was stricter than legacy. Without this change, a valid "-conf params.txt" invocation (where SpectrumFile / DatabaseFile live in the config file) would be rejected by picocli before the adapter could populate ParamManager. 2. The config-file overlay is unchanged. SearchParams.parseConfigParamFile already runs after CLI parsing and only fills in !commandLineParam.isValueAssigned() entries (SearchParams.java:588), so CLI flags continue to override config-file values transparently. The flow is now: argv -> picocli -> MSGFPlusOptions -> adapter -> ParamManager -> SearchParams.parse (config-file overlay happens here). No legacy parseParams call remains in the MSGFPlus entry point. Test: picocliPathRejectsMissingRequiredFlags replaced with picocliPathAcceptsConfigOnlyInvocation (picocli must accept "-conf X" without -s/-d, and the adapter must record the config file path so SearchParams.parse can read it). Scoped sweep (TestDirectPinWriter, TestMSUtils, TestSA, TestMisc, TestRunManifestWriter, SearchParamsTest, TestPercolator, MSGFPlusOptionsAdapterTest): 45 tests, 0 failures, 0 errors.
Net -874 / +2 lines. Removed: - src/main/java/edu/ucsd/msjava/cli/MSGFDB.java -- deprecated since 2012 (version 8091, "08/06/2012"), no external references in src, docs, pom.xml, or manifest. The class was @deprecated and only referenced its own legacy entry point. - ParamManager.addMSGFDBParams() -- only caller was the deleted MSGFDB.main. - ParamManager.addMSGFParams() and addMSGFLibParams() -- no entry points existed at all (cli.MSGF, cli.MSGFLib never created); both were pure dead code accumulating since the original mzid pipeline removal. - ParamManager.addOutputFileParam() / addDBFileParam(String, String, boolean) -- private helpers used only by the deleted methods. - ParamNameEnum.OUTPUT_FILE -- the "-o for MSGF and MS-GFDB" variant; MSGFPlus uses SEARCH_OUTPUT_FILE (same key "o", different label). ParamManager.getOutputFileParam() now reads SEARCH_OUTPUT_FILE directly so SearchParams.parse continues to resolve the user's -o. - ParamNameEnum.C13, NNET, UNIFORM_AA_PROBABILITY -- MSGFDB-only. - docs/ms-gfdb.md -- documentation for the removed tool. - TestMinSpectraPerThread.msgfdbEntryPointAlsoRegistersTheParam -- test for the removed addMSGFDBParams. The remaining 3 tests in TestMinSpectraPerThread continue to verify MSGFPlus's MIN_SPECTRA_PER_THREAD param. Scoped sweep (TestDirectPinWriter, TestMSUtils, TestSA, TestMisc, TestRunManifestWriter, SearchParamsTest, TestPercolator, MSGFPlusOptionsAdapterTest, TestMinSpectraPerThread): 48 tests, 0 failures, 0 errors.
Replace String/Integer holders for the four range-shaped flags with
small typed value classes carrying their own parse/converter logic:
- cli/PrecursorTolerance (Tolerance left, Tolerance right) -- handles
both symmetric ("20ppm") and asymmetric ("0.5Da,2.5Da") forms,
validates unit-match and non-negative, exposed via picocli converter.
- cli/IntRange (int min, int max) -- inclusive range, accepts "min,max"
or single "n" (interpreted as "n,n"), used by -ti, -msLevel, -index.
MSGFPlusOptions fields now use these typed objects directly:
precursorTolerance: String -> PrecursorTolerance
isotopeErrorRange: String -> IntRange
msLevel: String -> IntRange
specIndexRange: String -> IntRange
The adapter still round-trips them through Parameter.parse(String)
via toString() for now -- that round-trip goes away in Phase 4c
step 3 when SearchParams.parse reads from MSGFPlusOptions directly
and the legacy params/ hierarchy is deleted.
Existing equivalence test still passes (asymmetric "0.5Da,2.5Da"
case included). Build green.
SearchParams now reads directly from MSGFPlusOptions; the legacy ParamManager + Parameter.parse round-trip is gone. What changed: 1. MSGFPlusOptions grew effective*() resolvers for every flag (with defaults: 6/40 lengths, 2/3 charges, proton mass, etc.), typed lookups for ActivationMethod/InstrumentType/Enzyme/Protocol from the existing registries, and an applyConfigFile(File) overlay that reproduces the legacy alias rewrites (IsotopeError -> IsotopeErrorRange, FragmentationMethod -> FragmentationMethodID, etc., 14 total) plus collects DynamicMod/StaticMod/CustomAA into ordered lists. CLI fields take precedence -- config-file values only fill in null fields. 2. SearchParams.parse(MSGFPlusOptions) replaces parse(ParamManager). The 35+ paramManager.getXxx() call sites translate one-for-one to opts.effective*() / opts.* reads. parseConfigParamFile is gone (its work is now opts.applyConfigFile). Spectrum-format "isSupported" check is now a small whitelist instead of routing through FileParameter. 3. AminoAcidSet.getAminoAcidSetFromModFile and the new getAminoAcidSetFromModEntries (replacing getAminoAcidSetFromList) take MSGFPlusOptions and call opts.setMaxNumModsFromMetadata when the loaded mod metadata declares a different value, mirroring the former paramManager.setMaxNumMods bidirectional handshake. 4. MSGFPlus.main no longer creates a ParamManager. argv -> picocli -> MSGFPlusOptions -> SearchParams.parse(opts), with picocli's own usage()/printToolInfo()/printJVMInfo() replacing the ParamManager equivalents. 5. Adapter (MSGFPlusOptionsAdapter) deleted -- the round-trip via Parameter.parse(String) is no longer needed. 6. Tests migrated to construct MSGFPlusOptions directly: SearchParamsTest, TestRunManifestWriter, TestDirectPinWriter, TestPrecursorCalScaffolding, TestPrecursorCalIntegration, TestPercolator, TestMSUtils, TestIPRG, TestCollaboration, TestCandidatePeptideGrid(ConsideringMetCleavage), TestSA, TestMinSpectraPerThread. MSGFPlusOptionsAdapterTest deleted (adapter is gone). TestIntRangeParameter deleted (params/ IntRangeParameter is the next thing to go in Phase 3). Validation: scoped sweep -- 73 tests, 0 failures, 0 errors, 5 skipped. The legacy params/ hierarchy still compiles but has no live callers on the MS-GF+ path; Phase 3 deletes it next.
After Phase 4c routed SearchParams + AminoAcidSet through MSGFPlusOptions directly, the entire params/ package became unreferenced on the live MSGFPlus path. Deleting: - ParamManager (1,059 LOC after Phase 4a/4b cleanup) -- replaced by cli.MSGFPlusOptions + its effective*() resolvers and applyConfigFile(). - Parameter, NumberParameter, RangeParameter (abstract bases) - IntParameter, FloatParameter, DoubleParameter, IntRangeParameter, FloatRangeParameter (typed leaf classes) - StringParameter, FileParameter, FileListParameter (file/string types) - ToleranceParameter (replaced by cli.PrecursorTolerance) - EnumParameter, ObjectEnumParameter (enum machinery; the dynamic enum dispatch now lives inline in MSGFPlusOptions.effective*()) - ParamParser (legacy config-file reader; replaced by MSGFPlusOptions.applyConfigFile) - CaseInsensitiveLinkedHashMapParam, CaseInsensitiveMap (the Parameter map that backed ParamManager) Two small helper types (ParamObject interface + UserParam config-file helper) are still consumed by msutil's runtime registries (ActivationMethod, InstrumentType, Enzyme, Protocol). They have no dependency on the rest of the params/ hierarchy, so they relocate to edu.ucsd.msjava.msutil where their consumers already live. Validation: - Clean compile (main + tests) passes. - Scoped sweep (TestDirectPinWriter, TestMSUtils, TestSA, TestMisc, TestRunManifestWriter, SearchParamsTest, TestPercolator, TestMinSpectraPerThread, TestPrecursorCalScaffolding, TestCandidatePeptideGrid + ConsideringMetCleavage): 73 tests, 0 failures, 0 errors, 5 skipped. The legacy MSGFPlus CLI surface is fully preserved via the typed picocli @option fields + applyConfigFile()'s alias rewrites; this is purely a maintainability cleanup that drops ~2,100 LOC of custom parameter-parsing scaffolding.
Remove legacy text-format parsers; only MGF and mzML are retained. - Delete MS2SpectrumParser, PklSpectrumParser, PNNLSpectrumParser, PNNLSpectraIterator, PNNLSpectraMap from parser/ - Delete dead-code SPTxtParser, SpectrumParserWithTitle, TSVParser, TSVResultParser, FullyBufferedLineReader from parser/ - Delete SpectraMapByTitle and SpectrumAccessorByTitle from msutil/ (only callers were in the deleted SPTxtParser) - Remove MS2/PKL/DTA_TXT/MZDATA entries from SpecFileFormat - Prune corresponding branches from SpectraAccessor.getSpecMap() and getSpecItr() and getSpectrumIDFormatCvParam() - Prune Spectrum.getSpectrumFileFormat() to mzML + MGF only - Update SearchParams.isSupportedSpectrumFormat() to mzML + MGF only - Update MSGFPlusOptions -s description to list only *.mzML, *.mgf - Remove @ignore generateTRexPRMSpectra test (used deleted TSVParser)
Pure file-move + package/import update; no behaviour change. - git mv the 6 remaining files under parser/ into mgf/ (BufferedLineReader, BufferedRandomAccessLineReader, LineReader, MgfSpectrumParser, SpectrumParser, UnicodeBOMInputStream) - Update package declaration in each moved file from edu.ucsd.msjava.parser → edu.ucsd.msjava.mgf - Update import edu.ucsd.msjava.parser.* → edu.ucsd.msjava.mgf.* across 13 callers in fdr/, mzml/, msutil/, msscorer/, msdbsearch/
The Phase 4c review surfaced four issues that all bottomed out in small details of the new typed-options path. Bundling the fixes plus a regression test: 1. **CustomAA= crash (critical, was code review issue #1).** AminoAcidSet.getAminoAcidSetFromModEntries was prepending "CustomAA=" to each entry before handing it to parseConfigEntry, but parseConfigEntry only strips the "nummods=" prefix -- every other line is split on commas and modInfo[0] is parsed as a mass or empirical formula. With the prefix attached, modInfo[0] became "CustomAA=C3H5NO" which fails Double.parseDouble + Composition.getMass and triggers the System.exit(-1) in the caller. Any -conf file with a CustomAA= line crashed the process. MSGFPlusOptions.applyConfigEntry already strips the "Key=" prefix when populating opts.customAAs; the fix is to drop the literal at AminoAcidSet:826 so the bare value reaches parseConfigEntry, matching how staticMods/dynamicMods are passed at line 831. New regression test MSGFPlusOptionsConfigFileTest.configFileWithCustomAAParsesWithoutCrashing pins this with a tiny synthetic config file. 2. **-decoy default doc.** @option description now says "Default: XXX" (the actual value returned by effectiveDecoyPrefix and the same constant the legacy code used via MSGFPlus.DEFAULT_DECOY_PROTEIN_PREFIX). The "Default: DECOY_" string was wrong since Phase 1a. 3. **Single-file spectrum-format null check.** The single-file branch in SearchParams.parse used to silently store a null SpecFileFormat into DBSearchIOFiles when the user supplied -s file.bogus, which later NPE'd at MSGFPlus:305 (specFormat.getPSIName()). It now short-circuits with the same message the directory branch's isSupportedSpectrumFormat filter implies: "Spectrum file extension does not match a supported format (*.mzML, *.mgf): <name>". 4. **Unrecognized config-key URL hint.** The legacy parseConfigParamFile tracked an invalid-parameter counter and, after closing the file, printed the example-params URL hint exactly once if the count was non-zero. MSGFPlusOptions.applyConfigFile now restores that behaviour with a private unrecognizedConfigEntries counter incremented inside the default branch of applyConfigEntry, plus the same end-of-file hint. Scoped tests pass (68 tests, 0 failures, 0 errors).
The Phase 4c MSGFPlusOptions.effectiveActivationMethod() switch hardcoded indices 0..3 (ASWRITTEN/CID/ETD/HCD) and threw IllegalArgumentException for index 4. The legacy addFragMethodParam(ActivationMethod.ASWRITTEN, doNotAddMergeMode=true) hid the registry's FUSION (slot 4), so the user-facing menu was 0..3 + UVPD at index 4. The Phase 4c rewrite silently dropped UVPD support; this commit restores it. - Add case 4: return ActivationMethod.UVPD; to the switch. - Update the @option description to enumerate 4=UVPD. - New unit test (MSGFPlusOptionsActivationMethodTest) pins the full 0..4 mapping plus the default and the out-of-range guard. docs/msgfplus.md already documents -m 4 = UVPD; this brings the code in line with the doc.
Remove stale references that no longer match the modernized code: - README.md: Quick-Start examples now write .pin (default) or use -outputFormat tsv; the deleted MzIDToTsv conversion step is gone. The "What is MS-GF+?" paragraph and "What is different in this fork?" bullets describe the actual current state (mzIdentML output removed; spectrum input narrowed to mzML + mgf only; picocli-based CLI). The required-input table now shows *.mzML / *.mgf only and the -o default is [input].pin. - docs/msgfplus.md: -s synopsis trimmed to "*.mzML or *.mgf" in three places; -allowDenseCentroidedPeaks no longer mentions mzXML; the duplicated "mzML, mzXML, mzML" typo is fixed. - docs/examples/MSGFPlus_Params.txt: SpectrumFile comment trimmed to *.mzML / *.mgf. - docs/readme.md: Input/Output summary now matches the fork's actual format support; obsolete ms-gfdb.md link removed (the doc and entry point were deleted in 5a2ec4e). - docs/troubleshooting.md: FASTA-split workaround now describes concatenating .pin / .tsv outputs (mzIdentML and MzidMerger no longer apply). The OpenMS TOPPAS workaround now feeds the .pin via PercolatorAdapter instead of importing a non-existent .mzid. - src/test/resources/MSGFDB_Param.txt: removed showDecoy=1 and uniformAAProb=auto (both ParamNameEnum entries were dropped in 5a2ec4e and now produce "unrecognized parameter" warnings on every test run). Normalized ParentMassTolerance -> PrecursorMassTolerance and IsotopeError -> IsotopeErrorRange to use the canonical keys; the canonicalConfigKey aliases keep the old names working but the test fixture should be self-documenting. No code changes; the existing scoped test sweep continues to pass.
The user code-reviewed the unpushed branch tip and surfaced three behaviour regressions that all date back to the Phase 4c ParamManager retire (commit 03f32c1) plus a few small cleanups. P1 -- config-file keys are now matched case-insensitively, restoring the legacy ParamManager.parseConfigParamFile semantics. The Phase 4c switch was exact-case so test fixtures using lowercase first-letter keys (e.g. "minCharge=2", "maxCharge=5" in MSGFDB_Param.txt) were silently dropped to defaults instead of overriding them. The fix lowercases canonicalConfigKey output and makes every applyConfigEntry case label lowercase. New regression test configFileKeysAreMatchedCaseInsensitively pins the contract with a mix of canonical, lowercased-first-letter, and ALLCAPS forms. P2 -- invalid enum-like CLI indices (-m 99, -inst 99, -e 99, -protocol 99) and out-of-range numerics now produce a clean user-facing error from SearchParams.parse instead of an IllegalArgumentException stack trace from the resolver. validate() is now invoked in place of validateRequired() and runs the bounds checks up-front. P2 -- restored the legacy IntParameter.minValue/maxValue range checks for: -thread, -tasks, -minSpectraPerThread, -minLength, -maxLength, -minCharge, -maxCharge, -n, -ntt, -tda, -verbose, -addFeatures, -allowDenseCentroidedPeaks, -edgeScore, -ignoreMetCleavage, -maxMissedCleavages, -numMods, -ccm, -u, -m, -inst, -e, -protocol, plus the hidden flags. New unit test validateRejectsOutOfRangeFlags pins a representative set. Polish: - Drop the dead Phase 1 / MSGFPlusOptionsAdapter / ParamManager rollout narrative from the MSGFPlusOptions class header (both the adapter class and ParamManager were deleted in earlier commits; the comment was stale). - Collapse the unrecognizedConfigEntries field + local probe counter in applyConfigFile into a single counter reset at start and read at end -- one piece of state, simpler control flow. - Strip the CRLF (\r before \n) on docs/examples/MSGFPlus_Params.txt and src/test/resources/MSGFDB_Param.txt, which git diff --check was flagging as trailing whitespace. The rest of the codebase is LF-only. Verified: - mvn -B -o test on the scoped sweep (77 tests, 0 failures, 0 errors, 3 skipped). - SearchParamsTest no longer warns "unrecognized parameter 'minCharge=2'" / "'maxCharge=5'" when loading MSGFDB_Param.txt. - git diff --check on this commit is clean.
Drop the lingering String + numeric backcompat for the two enum-shaped flags whose values are real names rather than IDs (per user direction: 'unless are options like 1 2 3 we should do only string'). After this commit: - -outputFormat accepts only `pin` / `tsv` (case-insensitive). The legacy numeric forms `0` / `1` are no longer recognised; users on those invocations should switch to the named values. This is a deliberate breaking change called out in the parameter-modernization cleanup -- consistency over backcompat in this corner. - -precursorCal continues to accept only `auto` / `on` / `off` (case-insensitive), but now via picocli's typed enum matcher rather than a String + fromString fallback. Invalid values fail fast at parse time instead of silently mapping to AUTO. Numeric-ID flags (-m, -inst, -e, -protocol) and 0/1 boolean-style flags (-tda, -verbose, -addFeatures, -allowDenseCentroidedPeaks, -edgeScore, -ignoreMetCleavage, -u, -ntt) keep their integer types -- those values are IDs, not names. Implementation: - New cli.OutputFormat enum (PIN, TSV). - MSGFPlusOptions.outputFormat: String -> OutputFormat. - MSGFPlusOptions.precursorCalMode: String -> SearchParams.PrecursorCalMode. - effectiveOutputFormat() now returns OutputFormat (was int 0/1). - effectivePrecursorCalRaw() collapsed into effectivePrecursorCal() returning the typed enum. - applyConfigEntry parses both flags via Enum.valueOf so config-file values like `OutputFormat=pin` and `PrecursorCal=auto` flow through the same case-insensitive contract as the CLI. - SearchParams.outputFormat field: int -> OutputFormat. writePin() / writeTsv() helpers retained (callers in MSGFPlus.runMSGFPlus). - SearchParams.PrecursorCalMode.fromString() deleted -- no callers after the resolver returns the typed enum directly. - New static factory MSGFPlusOptions.commandLine(opts) returns a CommandLine with caseInsensitiveEnumValuesAllowed(true). All call sites (MSGFPlus.main + 5 test files) routed through it so enum case-insensitivity is uniform. - docs/output.md updated to show `-outputFormat pin` / `tsv` and notes the numeric forms are no longer accepted. Tests: TestDirectPinWriter.outputFormatAcceptsOnlyPinAndTsv pins the new contract (numeric/legacy values rejected, named values accepted case-insensitively). TestPrecursorCalScaffolding migrated to enum constants and to a picocli rejection check for invalid values. The old fromString-fallback test is replaced by the rejection test. Scoped sweep: 78 tests, 0 failures, 0 errors, 4 skipped.
Expand the vNEXT entry to cover the full PR #25 modernization stack: - CLI parser modernisation: picocli-driven MSGFPlusOptions; -conf flow with case-insensitive keys + 13 legacy aliases preserved; numeric/enum range validation surface restored. - Breaking changes: -outputFormat numeric forms (0/1) removed in favour of named pin/tsv (typed enum); -precursorCal switched to typed enum (rejects unknown values instead of silently mapping to AUTO); spectrum input narrowed to mzML + mgf only (mzXML, MS2, PKL, _dta.txt parsers deleted); deprecated MSGFDB entry point and its dead MSGF/MSGFLib siblings removed. - Internal refactor: edu.ucsd.msjava.params package deleted (~2,100 LOC across 18 classes); package reorg (ui/ -> cli/, mzid/ -> output/, parser/ -> mgf/, net.pempek.unicode -> mgf); new typed value classes (MSGFPlusOptions, PrecursorTolerance, IntRange, OutputFormat); picocli 4.7.6 dep added. - Bench gate: prior Astral 3-arm run confirmed bit-identical PSM target/decoy counts (89,479 / 46,792) between baseline and new branch in -precursorCal off mode. Table embedded in the entry. - Earlier in cycle: precursor calibration + Percolator pin output bullets retained. No code change; pure docs.
Combined sweep of cuts surfaced by the post-modernization LOC audit. Net change: 13 files, +57 / -2,131 = -2,074 LOC. ### Deleted dead classes (-2,051 LOC, 6 files) Verified zero callers across src/main and src/test before removal: - msscorer/ScoringParameterGenerator (732 LOC) and ScoringParameterGeneratorWithErrors (880 LOC) -- standalone main() scoring-param tools from the pre-modernization era. Pre-built scoring model .tsv files are committed in resources; these generators are not invoked at search time and have no remaining consumers. - output/Unimod (85 LOC) and output/UnimodComposition (133 LOC) -- residue from the deleted mzIdentML write side. Atom.java's "Unimod mod bricks" comment is the only remaining reference and refers to upstream data, not the deleted classes. - msgf/ToolLauncher (154 LOC) -- abstract launcher with no concrete implementations. - msutil/ScoredString (67 LOC) -- duplicate of fdr/ScoredString with no live callers (all usages resolve to the fdr.* version). ### Inlined effective*() resolvers + helper collapse (-23 LOC net) - ~20 of the trivial `field != null ? field : default` resolvers in MSGFPlusOptions are inlined at the SearchParams.parse call sites. The non-trivial registry-resolving ones (effectiveActivationMethod, effectiveInstrumentType, effectiveEnzyme, effectiveProtocol) and a handful of frequent ones (effectiveOutputFormat, effectiveMin/Max PeptideLength etc.) stay since their length pays for itself. - ActivationMethodAvailability nested class collapsed -- it only hid one InstrumentType.getAllRegisteredInstrumentTypes().length call that is now inline in validate(). - SearchParams.getOutputFormat() and SearchParams.writePin() removed -- writePin() had two callers in MSGFPlus.java which now use !writeTsv() instead; getOutputFormat() had zero callers. ### stripComment dedup (-7 LOC) The two implementations of "split-on-#-and-trim" (SearchParams. getConfigLineWithoutComment and MSGFPlusOptions.stripComment) collapsed: stripComment is the canonical version (package-public), SearchParams.getConfigLineWithoutComment delegates to it, and AminoAcidSet.parseConfigEntry calls stripComment directly. ### Validation surface expanded MSGFPlusOptions.validate() now also rejects a -mod / ModificationFile= path that does not exist, returning a user-facing error string. New regression test (validateRejectsMissingModificationFile) pins both the CLI path and the config-file path. Verified: scoped sweep (TestDirectPinWriter, TestMSUtils, TestSA, TestMisc, TestRunManifestWriter, SearchParamsTest, TestPercolator, TestMinSpectraPerThread, TestPrecursorCalScaffolding, TestCandidatePeptideGrid + ConsideringMetCleavage, MSGFPlusOptionsConfigFileTest, MSGFPlusOptionsActivationMethodTest): 78 tests, 0 failures, 0 errors, 3 skipped.
…quences
Net change: 21 files, +24 / -409 = -385 LOC. Combines the audit's
medium-value pass with a parallel sweep across the codebase for
methods and fields with no live callers.
### Dependencies dropped from pom.xml
- commons-text: zero direct usages.
- commons-io: NullOutputStream replaced by Java 11's
OutputStream.nullOutputStream() in ConcurrentMSGFPlus;
FilenameUtils.removeExtension replaced by an inline lastIndexOf('.')
+ substring in BuildSA.
Removing commons-text also drops its transitive commons-lang3, which
was used by mgf/BufferedRandomAccessLineReader for Pair<String,
Integer>. Replaced with a small local record BomStripResult
(text, bomLength) — single-purpose, no API consumers, simpler at
the use sites (result.text() / result.bomLength() vs result.getKey()
/ result.getValue()).
### Dead code removed across packages
Verified zero remaining callers for each:
- fdr/TargetDecoyAnalysis: ~38 lines of commented-out legacy
constructors that referenced a non-existent TargetDecoyPSMSet
class. Pure comment cleanup.
- fdr/TSVPSMSet: 14 LOC of dead methods.
- mgf/MgfSpectrumParser: 35 LOC of dead methods.
- msgf/AAFrequencyCounter: 44 LOC of dead helpers.
- msgf/MassListComparator: 15 LOC.
- msscorer/NewRankScorer + NewScorerFactory: 31 LOC of dead
helpers (kept all live scoring-path code per CLAUDE.md invariants).
- msutil/AminoAcidSet, Composition, CompositionFactory, IonType,
Peptide: ~110 LOC of dead helpers and fields.
- sequences/FastaSequence + FastaSequences + ProteinFastaSequence
+ ProteinFastaSequences: ~76 LOC of dead methods.
### Other cleanup
- SearchParams.toString: drop 6 lines of commented-out spectrum-list
output and switch StringBuffer (synchronized, single-threaded
caller) to StringBuilder.
Verified: scoped sweep (TestDirectPinWriter, TestMSUtils, TestSA,
TestMisc, TestRunManifestWriter, SearchParamsTest, TestPercolator,
TestMinSpectraPerThread, TestPrecursorCalScaffolding,
TestCandidatePeptideGrid + ConsideringMetCleavage,
MSGFPlusOptionsConfigFileTest, MSGFPlusOptionsActivationMethodTest):
78 tests, 0 failures, 0 errors, 3 skipped.
Four test files (TestDirectPinWriter, TestPrecursorCalScaffolding, TestPrecursorCalIntegration, TestRunManifestWriter) had nearly identical buildOpts()/parsedSearchParams() helpers loading the standard MSGFDB_Param.txt + test.mgf + human-uniprot-contaminants.fasta fixture set. Collapse into a single shared SearchTestFixtures.standardOpts() helper in src/test/java/edu/ucsd/ msjava/cli; the two tests that need extra fields (output file, maxMissedCleavages) call standardOpts() and tweak the result. Net change: -25 LOC, plus tests now stale-import-free. Verified: scoped sweep (42 tests across the 4 affected files + MSGFPlusOptionsConfigFileTest + MSGFPlusOptionsActivationMethodTest + SearchParamsTest): 0 failures, 0 errors.
Scrub 31 Java source files of five comment categories: - single-line javadoc restating the method/field name - block javadoc containing only @param/@return lines - /* (non-Javadoc) */ markers - section headers (GETTERS / SETTERS / CONSTRUCTORS / etc.) - commented-out code blocks (dead debug printlns, old algorithms, inactive fields, alternative scorer variants) Non-obvious WHY comments, license headers, performance invariants, CLI contract javadoc, and Picocli @option descriptions are preserved unchanged. LOC delta: -2262 lines, +42 lines across 31 files. Scoped test sweep: 78 tests run, 0 failures, 0 errors, 3 skipped (TestMSUtils×1, TestSA×2 — pre-existing skips unrelated to this change).
Continues the comment cleanup pass with a focused removal of
disabled-but-kept Java code masquerading as comments. Net change:
32 files, +1 / -323 = -322 LOC.
What was removed:
- fdr/TargetDecoyAnalysis: dead 10-line if/else block at the tail of
getFDRMap() that re-set fdrMap edges no longer needed after the
q-value conversion.
- fdr/ComputeFDR: 2-line commented-out specFileCol guard.
- msdbsearch/{CandidatePeptideGrid,CandidatePeptideGridConsideringMetCleavage,
CompactFastaSequence, CompactSuffixArray, DBScanner, MassErrorStat,
PSMFeatureFinder} and msgf/{ProfileGF, FlexAminoAcidGraph,
ScoredSpectrumSum}: scattered // foo() / /* alt impl */ blocks
documenting old algorithm variants and disabled debug printlns.
- msutil/{AminoAcid, Composition, Constants, IonType, VolatileAminoAcid}:
~90 LOC of commented-out static initializers, alternate composition
tables, and debug-only branches.
- sequences/{FastaSequence, FastaSequences, ProteinFastaSequences}
and suffixarray/SuffixArray: ~37 LOC of dead disabled stubs.
Genuine prose comments (TODOs, performance invariants, why-explanations,
URLs to upstream issues, the `applyShift` bit-identical-path guard)
are preserved.
Verified: scoped sweep (TestDirectPinWriter, TestMSUtils, TestSA,
TestMisc, TestRunManifestWriter, SearchParamsTest, TestPercolator,
TestMinSpectraPerThread, TestPrecursorCalScaffolding,
TestCandidatePeptideGrid + ConsideringMetCleavage,
MSGFPlusOptionsConfigFileTest, MSGFPlusOptionsActivationMethodTest):
78 tests, 0 failures, 0 errors, 3 skipped.
Two follow-ups from the PR #25 review thread: 1. BOM strip in BufferedLineReader (Copilot finding). The constructor wrapped the FileInputStream in UnicodeBOMInputStream but never called skipBOM() -- BOM bytes leaked into line 1, breaking config / mod / FASTA files saved by editors that prepend a UTF-8 BOM (Windows Notepad, some VS Code configurations, etc.). Fix: chain .skipBOM() in the super(...) call. The BOM-stripping path is per-instance; the wrapper still detects the BOM in its constructor, we just now consume it. New test BufferedLineReaderTest pins the contract: a file with a UTF-8 BOM followed by `ParentMassTolerance=20ppm` returns the bare key=value as line 1 (not `ParentMassTolerance=20ppm`), and a no-BOM file is unchanged. 2. Delete unused MSGFResult.java. Verified zero callers in src/main and src/test before removal. The reviewer flagged this as a small follow-up to land alongside the modernization sweep. Verified: scoped sweep (BufferedLineReaderTest, MSGFPlusOptionsConfigFileTest, MSGFPlusOptionsActivationMethodTest, SearchParamsTest, TestPrecursorCalScaffolding, TestRunManifestWriter, TestDirectPinWriter, TestMSUtils, TestMisc, TestMinSpectraPerThread): 55 tests, 0 failures, 0 errors, 2 skipped.
Convert eight value-shaped types to Java records, per the audit: small immutable carriers with no inheritance constraints, where the record's auto-generated equals/hashCode/toString and accessor methods replace hand-rolled boilerplate. No behavioral change; net trim of ~80 LOC plus a clearer surface. Converted: - cli.IntRange -- record IntRange(int min, int max). Compact constructor keeps the min<=max validation. parse() and Converter (picocli ITypeConverter) preserved. - cli.PrecursorTolerance -- record PrecursorTolerance(Tolerance left, Tolerance right). Compact constructor enforces the matching-unit and non-negative invariants in-place; parse() / Converter retained. - msutil.CvParamInfo -- record CvParamInfo(String accession, String name, String value, String unitAccession, String unitName). The stored hasUnit field is gone; hasUnit() is now derived from unitAccession != null. Compatibility getters (getAccession(), getValue(), getUnitAccession(), getUnitName(), getHasUnit()) keep existing call sites untouched. - msutil.Annotation -- record Annotation(AminoAcid prevAA, Peptide peptide, AminoAcid nextAA). Setters (which had no live callers) removed; the parsing constructor delegates via this(...). isProteinNTerm/isProteinCTerm and the dotted toString are kept. - msgf.ProfilePeak<T extends Matter> -- record. Setters were unused; Comparable<ProfilePeak<T>> is preserved. - msutil.Atom -- record Atom(String code, double mass, int nominalMass, String name). Static atomArr table + atomMap + static initializer block all kept verbatim. getCode() / getName() / getMass() / getNominalMass() retained as compatibility wrappers. - msdbsearch.CompactSuffixArray.RangeMetadata -- record (4 file/int fields). Three call sites updated from md.numEntries to md.numEntries() etc. - msgf.MassListComparator.MatchedPair<T extends Matter> -- record. getMass1/getMass2 retained as compatibility wrappers. Call-site updates in SearchParams.parse: tol.left/.right, isotope.min/.max, specIdx.min/.max, ms.min/.max all switched to record accessor methods (.left()/.right()/.min()/.max()). Verified: scoped sweep (incl. BufferedLineReaderTest, MSGFPlusOptionsConfigFileTest, MSGFPlusOptionsActivationMethodTest, SearchParamsTest, TestPrecursorCalScaffolding, TestRunManifestWriter, TestDirectPinWriter, TestMinSpectraPerThread, TestMSUtils, TestSA, TestMisc, TestCandidatePeptideGrid + ConsideringMetCleavage): 80 tests, 0 failures, 0 errors, 3 skipped.
perf(search): instrumentation + dead-sync removal + per-task buffers + ForkJoinPool toggle
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request adds comprehensive project documentation and internal investigation records to the MS-GF+ codebase. It introduces a high-level project context file, detailed documentation of command-line flags, and a new system for tracking technical investigations and plans. The main themes are improved onboarding/documentation, enhanced transparency into known issues and technical debt, and groundwork for future development and parameter modernization.
Project documentation and onboarding:
.claude/CLAUDE.mdwith an overview of the MS-GF+ project, directory structure, build instructions, conventions, and performance-sensitive invariants. This serves as a central onboarding and context document for the codebase..claude/plans/README.mdand.claude/plans/parameter-modernization-flag-inventory.mdto document implementation plans, design documents, and a detailed inventory of all command-line flags and their parsing semantics. This supports the upcoming parameter modernization effort. [1] [2]Investigations and technical debt tracking:
.claude/investigations/README.mdas an index for tracked issues, bugs, and behaviors needing further analysis, establishing a process for technical investigations.