How-to: Create New Parsers¶
This how-to guide explains how to create a custom parser that reads raw files (CSV, Excel, JSON, XML, etc) and transforms them into the bam-masterdata format. By following these steps, your parser can be integrated into the Data Store workflow and used in the Parser app.
This allows you to bring custom or third-party data sources into the existing masterdata workflows without manual conversion.
Prerequisites
- Python ≥ 3.10 installed
- Knowledge of the bam-masterdata schema definitions in
bam_masterdata/datamodel/. Learn more in Schema Definitions.
Use the GitHub parser example¶
- Go to masterdata-parser-example.
- Either fork it (keep your own version) or use it as a template to start a new repository.
- Clone your fork/template locally:
-
Verify the folder structure includes
src/,tests/,pyproject.toml, andREADME.md:[your repo name] ├── LICENSE ├── pyproject.toml ├── README.md ├── src │ └── masterdata_parser_example │ ├── __init__.py │ ├── parser.py │ └── _version.py └── tests ├── __init__.py ├── conftest.py └── test_parser.pysrc/→ contains the parser package codetests/→ contains test files to check your parser works correctlypyproject.toml→ defines dependencies and project configurationREADME.md→ instructions and documentation
Forking or using the template
You can read more details in the GitHub docs on forking a repository and on creating a repository from a template.
Either way, you should end up with your own repository in which you can work on the definition and logic behind the parser.
Set up a Virtual Environment¶
It is recommended to create a virtual environment named .venv (already included in .gitignore) to manage dependencies. In the terminal, do:
You have two options to create a virtual environment:
- Using venv
- Using conda
Verify that everything is set up correctly by running inside the repo:
You should see all tests passing before you start customizing.
Faster pip installation
We recommend installing uv before installing the package by doing:
Modify the project structure and files¶
Since everything in the template project is named masterdata_parser_example and derivates, you will need to replace that with your own parser name.
This ensures that your parser has a unique and consistent package name.
For the purpose of this guide, we will rename everything using a ficticious code name SupercodeX.
Python naming conventions
- Packages / modules: lowercase, underscores allowed (e.g.,
my_parser) - Classes: CapWords / PascalCase (e.g.,
MyParser) - Variables / functions: lowercase with underscores (e.g.,
file_name,parse_file)
See the official Python style guide: PEP 8 – Naming Conventions
Rename project folder and parser class¶
- Modify the
srcpackage name frommasterdata_parser_exampleto your package name (e.g.,supercode_x). This will affect on how your users install the package later on by doingpip install(e.g.,pip install supercode_x). - Update in the
pyproject.tomlall occurrences ofmasterdata_parser_exampleto the new package name (e.g.,supercode_x). - Update the
[project]section inpyproject.tomlwith your specific information. - Go to
src/supercode_x/parser.pyand change the class name fromMasterdataParserExampleto your case (SupercodeXParser). - Update importing this class in
src/supercode_x/__init__.pyandtests/conftest.py. - Update the entry point dictionary in
src/supercode_x/__init__.py. - Verify that the project is still working by running
pytest tests. If everything is good, the testing should pass.
Rename entry point¶
- Go to
src/supercode_x/__init__.py. - Modify
masterdata_parser_example_entry_pointfor your new entry point variable name (e.g.,supercode_x_entry_point). - Update in the
pyproject.tomlall occurrences ofmasterdata_parser_example_entry_pointto the new entry point name (e.g.,supercode_x_entry_point).
Add parser logic¶
Open the src/.../parser.py file. After renaming your parser class to SupercodeXParser, you should have:
from bam_masterdata.datamodel.object_types import ExperimentalStep
from bam_masterdata.parsing import AbstractParser
class SupercodeXParser(AbstractParser):
def parse(self, files, collection, logger):
synthesis = ExperimentalStep(name="Synthesis")
synthesis_id = collection.add(synthesis)
measurement = ExperimentalStep(name="Measurement")
measurement_id = collection.add(measurement)
_ = collection.add_relationship(synthesis_id, measurement_id)
logger.info(
"Parsing finished: Added examples synthesis and measurement experimental steps."
)
Writing a parser logic is composed of a series of steps:
1. The object type classes imported from bam_masterdata (in the example above, ExperimentalStep).
2. Open the files with Python and read metainformation from them.
3. Instantiate object types and add the metainformation to the corresponding fields.
4. Add those object types and their relationships to collection.
Optionally, you can add log messages (info, warning, error, or critical) to debug the logic of your parser.
Example¶
As an example, imagine we are expecting to pass a super.json file to our SupercodeXParser to read certain metadata. The file contents are:
We recommend moving files in which you are testing the parser to a tests/data/ folder.
Step 1: Import necessary classes¶
At the top of parser.py, ensure you import:
# Step 1: import necessary classes
import json
from bam_masterdata.datamodel.object_types import SoftwareCode
from bam_masterdata.parsing import AbstractParser
Step 2: Modify the parse() method¶
- Iterate over the
filesargument. - Open each file and read the JSON content.
- Optionally, log progress using
logger.info().
# Step 1: import necessary classes
import json
from bam_masterdata.datamodel.object_types import SoftwareCode
from bam_masterdata.parsing import AbstractParser
class SupercodeXParser(AbstractParser):
def parse(self, files, collection, logger):
for file_path in files:
# Step 2: read files metainformation
logger.info(f"Parsing file: {file_path}")
with open(file_path, "r", encoding="utf-8") as f:
data = json.load(f)
Step 3: Instantiate objects and add metadata¶
- Instantiate
SoftwareCodeobjects and fill in the fields with the JSON data. - Optionally, log progress using
logger.info().
# Step 1: import necessary classes
import json
from bam_masterdata.datamodel.object_types import SoftwareCode
from bam_masterdata.parsing import AbstractParser
class SupercodeXParser(AbstractParser):
def parse(self, files, collection, logger):
for file_path in files:
# Step 2: read files metainformation
logger.info(f"Parsing file: {file_path}")
with open(file_path, "r", encoding="utf-8") as f:
data = json.load(f)
# Step 3: Instantiate and populate classes metadata
software = SoftwareCode(
name=data.get("program_name"),
version=data.get("program_version")
)
Step 4: Add objects and relationships to the collection¶
- Add the object to the collection using
collection.add(object). - You can also add relationships between objects using
collection.relationships(parent_id, child_id).
# Step 1: import necessary classes
import json
from bam_masterdata.datamodel.object_types import SoftwareCode
from bam_masterdata.parsing import AbstractParser
class SupercodeXParser(AbstractParser):
def parse(self, files, collection, logger):
for file_path in files:
# Step 2: read files metainformation
logger.info(f"Parsing file: {file_path}")
with open(file_path, "r", encoding="utf-8") as f:
data = json.load(f)
# Step 3: Instantiate and populate classes metadata
software = SoftwareCode(
name=data.get("program_name"),
version=data.get("program_version")
)
# Step 4: Add to collection
software_id = collection.add(software)
logger.info(f"Added SoftwareCode with ID {software_id}")
Referencing Existing Objects in OpenBIS¶
When parsing data, you may want to update an existing object in OpenBIS rather than creating a new one. This is useful when you're importing updated metadata for objects that already exist in the system.
To reference an existing object, set the code attribute on your object instance before adding it to the collection:
# Step 1
import json
from bam_masterdata.datamodel.object_types import SoftwareCode
from bam_masterdata.parsing import AbstractParser
class SupercodeXParser(AbstractParser):
def parse(self, files, collection, logger):
for file_path in files:
# Step 2
logger.info(f"Parsing file: {file_path}")
with open(file_path, "r", encoding="utf-8") as f:
data = json.load(f)
# Step 3
software = SoftwareCode(
name=data.get("program_name"),
version=data.get("program_version")
)
# Reference an existing object by setting its `code`
# This code should match the object's code in OpenBIS
if data.get("existing_identifier"):
software.code = data.get("existing_identifier")
logger.info(f"Referencing existing object: {software.code}")
# Step 4
software_id = collection.add(software)
- Without
codeset (default): A new object is created in OpenBIS with an automatically generated code based on the object type'sgenerated_code_prefix. - With
codeset: The parser looks for an existing object with that code in OpenBIS. If found, it updates the object's properties with the new values from your parser. If not found, the behavior depends on the OpenBIS configuration.
You can have multiple scenarios when deciding if setting up code or not:
- Creating new objects: Leave the
codeattribute unset. The system will generate unique codes automatically. - Updating existing objects: Set the
codeattribute to match the code of an existing object in OpenBIS. For example, if you have a sample with codeSAMPLE_001in OpenBIS, setsample.code = "SAMPLE_001"in your parser to update it. Note that this is a static assignment. - Mixed workflow: You can create some new objects while updating others in the same parsing operation by setting
codeonly where needed.
Code Format
The code must match the exact format used in OpenBIS, including any prefixes or separators. Codes are typically uppercase with underscores (e.g., SAMPLE_001, EXP_2024_01).
Identifier Construction
When referencing an existing object, the full identifier is constructed as:
- With collection:
/{space_name}/{project_name}/{collection_name}/{code} - Without collection:
/{space_name}/{project_name}/{code}
Make sure the object exists at the expected location in the OpenBIS hierarchy.
Tips¶
- Use the logger to provide useful messages during parsing, but bear in mind this can clutter the app if you plan to parse hundreds or more files.
- Test your parser incrementally by adding one object at a time to the collection and verifying results. You can test this by modifying the
tests/test_parser.pytesting file. - When updating existing objects, log which objects are being updated to help with debugging and traceability.
Final steps¶
You now have all the core components of your custom parser in place:
- Project structure set up.
- Package renamed to your parser name.
- Parser class created and logic accepting the metainformation of the specific files.
- Entry points updated.
What’s left?¶
- Update
pyproject.toml- Make sure the package name, version, and entry points match your parser.
- Adjust dependencies if your parser requires additional libraries (e.g.,
pandas).
- Update the
README.md- Replace the
README.mdcontent with a description of your parser. - Document how to install it and how to run it.
- Replace the
- Create a new release in GitHub
- Go to your repository on GitHub.
- Click on the Releases tab (or navigate to
https://github.com/[your-username]/[your-repo]/releases). - Click Create a new release.
- Choose a tag version (e.g.,
v1.0.0) and add a release title. - Optionally, add release notes describing changes or new features.
- Click Publish release to make it available.
Updating the Parser¶
Once your parser is implemented and tested, future updates are usually minimal and follow a clear process.
-
Modify only
parser.py- All changes should be contained within your parser class and helper functions.
- Avoid renaming packages or changing the project structure unless absolutely necessary.
-
Notify the Admin for a new release
- After updates, inform the administrator or the person responsible for releases.
- Provide details of the changes and any new dependencies.