How-to: Create New Parsers¶

This how-to guide explains how to create a custom parser that reads raw files (CSV, Excel, JSON, XML, etc) and transforms them into the bam-masterdata format. By following these steps, your parser can be integrated into the Data Store workflow and used in the Parser app. This allows you to bring custom or third-party data sources into the existing masterdata workflows without manual conversion.

Prerequisites

Python ≥ 3.10 installed
Knowledge of the bam-masterdata schema definitions in bam_masterdata/datamodel/. Learn more in Schema Definitions.

Use the GitHub parser example¶

Go to masterdata-parser-example.
Either fork it (keep your own version) or use it as a template to start a new repository.
Clone your fork/template locally:
```
git clone [your repository link]
```

Verify the folder structure includes src/, tests/, pyproject.toml, and README.md:

[your repo name]
├── LICENSE
├── pyproject.toml
├── README.md
├── src
│   └── masterdata_parser_example
│       ├── __init__.py
│       ├── parser.py
│       └── _version.py
└── tests
    ├── __init__.py
    ├── conftest.py
    └── test_parser.py

src/ → contains the parser package code
tests/ → contains test files to check your parser works correctly
pyproject.toml → defines dependencies and project configuration
README.md → instructions and documentation

Forking or using the template

You can read more details in the GitHub docs on forking a repository and on creating a repository from a template.

Either way, you should end up with your own repository in which you can work on the definition and logic behind the parser.

Set up a Virtual Environment¶

It is recommended to create a virtual environment named .venv (already included in .gitignore) to manage dependencies. In the terminal, do:

cd [your repo name]

You have two options to create a virtual environment:

Using venv

python -m venv .venv
source .venv/bin/activate  # on Linux/macOS
.\.venv\Scripts\activate  # on Windows

Using conda

conda create --prefix .venv python=3.10
conda activate .venv

Verify that everything is set up correctly by running inside the repo:

pip install --upgrade pip
pip install -e .
pytest tests

You should see all tests passing before you start customizing.

Faster pip installation

We recommend installing uv before installing the package by doing:

pip install --upgrade pip
pip install uv
uv pip install -e .

Modify the project structure and files¶

Since everything in the template project is named masterdata_parser_example and derivates, you will need to replace that with your own parser name. This ensures that your parser has a unique and consistent package name.

For the purpose of this guide, we will rename everything using a ficticious code name SupercodeX.

Python naming conventions

Packages / modules: lowercase, underscores allowed (e.g., my_parser)
Classes: CapWords / PascalCase (e.g., MyParser)
Variables / functions: lowercase with underscores (e.g., file_name, parse_file)

See the official Python style guide: PEP 8 – Naming Conventions

Rename project folder and parser class¶

Modify the src package name from masterdata_parser_example to your package name (e.g., supercode_x). This will affect on how your users install the package later on by doing pip install (e.g., pip install supercode_x).
Update in the pyproject.toml all occurrences of masterdata_parser_example to the new package name (e.g., supercode_x).
Update the [project] section in pyproject.toml with your specific information.
Go to src/supercode_x/parser.py and change the class name from MasterdataParserExample to your case (SupercodeXParser).
Update importing this class in src/supercode_x/__init__.py and tests/conftest.py.
Update the entry point dictionary in src/supercode_x/__init__.py.
Verify that the project is still working by running pytest tests. If everything is good, the testing should pass.

Rename entry point¶

Go to src/supercode_x/__init__.py.
Modify masterdata_parser_example_entry_point for your new entry point variable name (e.g., supercode_x_entry_point).
Update in the pyproject.toml all occurrences of masterdata_parser_example_entry_point to the new entry point name (e.g., supercode_x_entry_point).

Add parser logic¶

Open the src/.../parser.py file. After renaming your parser class to SupercodeXParser, you should have:

from bam_masterdata.datamodel.object_types import ExperimentalStep
from bam_masterdata.parsing import AbstractParser


class SupercodeXParser(AbstractParser):
    def parse(self, files, collection, logger):
        synthesis = ExperimentalStep(name="Synthesis")
        synthesis_id = collection.add(synthesis)
        measurement = ExperimentalStep(name="Measurement")
        measurement_id = collection.add(measurement)
        _ = collection.add_relationship(synthesis_id, measurement_id)
        logger.info(
            "Parsing finished: Added examples synthesis and measurement experimental steps."
        )

Writing a parser logic is composed of a series of steps: 1. The object type classes imported from bam_masterdata (in the example above, ExperimentalStep). 2. Open the files with Python and read metainformation from them. 3. Instantiate object types and add the metainformation to the corresponding fields. 4. Add those object types and their relationships to collection.

Optionally, you can add log messages (info, warning, error, or critical) to debug the logic of your parser.

Example¶

As an example, imagine we are expecting to pass a super.json file to our SupercodeXParser to read certain metadata. The file contents are:

{
    "program_name": "SupercodeX",
    "program_version": "1.1.0",
}

We recommend moving files in which you are testing the parser to a tests/data/ folder.

Step 1: Import necessary classes¶

At the top of parser.py, ensure you import:

# Step 1: import necessary classes
import json
from bam_masterdata.datamodel.object_types import SoftwareCode
from bam_masterdata.parsing import AbstractParser

Step 2: Modify the parse() method¶

Iterate over the files argument.
Open each file and read the JSON content.
Optionally, log progress using logger.info().

# Step 1: import necessary classes
import json
from bam_masterdata.datamodel.object_types import SoftwareCode
from bam_masterdata.parsing import AbstractParser


class SupercodeXParser(AbstractParser):
    def parse(self, files, collection, logger):
        for file_path in files:
            # Step 2: read files metainformation
            logger.info(f"Parsing file: {file_path}")
            with open(file_path, "r", encoding="utf-8") as f:
                data = json.load(f)

Step 3: Instantiate objects and add metadata¶

Instantiate SoftwareCode objects and fill in the fields with the JSON data.
Optionally, log progress using logger.info().

# Step 1: import necessary classes
import json
from bam_masterdata.datamodel.object_types import SoftwareCode
from bam_masterdata.parsing import AbstractParser


class SupercodeXParser(AbstractParser):
    def parse(self, files, collection, logger):
        for file_path in files:
            # Step 2: read files metainformation
            logger.info(f"Parsing file: {file_path}")
            with open(file_path, "r", encoding="utf-8") as f:
                data = json.load(f)

            # Step 3: Instantiate and populate classes metadata
            software = SoftwareCode(
                name=data.get("program_name"),
                version=data.get("program_version")
            )

Step 4: Add objects and relationships to the collection¶

Add the object to the collection using collection.add(object).
You can also add relationships between objects using collection.relationships(parent_id, child_id).

# Step 1: import necessary classes
import json
from bam_masterdata.datamodel.object_types import SoftwareCode
from bam_masterdata.parsing import AbstractParser


class SupercodeXParser(AbstractParser):
    def parse(self, files, collection, logger):
        for file_path in files:
            # Step 2: read files metainformation
            logger.info(f"Parsing file: {file_path}")
            with open(file_path, "r", encoding="utf-8") as f:
                data = json.load(f)

            # Step 3: Instantiate and populate classes metadata
            software = SoftwareCode(
                name=data.get("program_name"),
                version=data.get("program_version")
            )

            # Step 4: Add to collection
            software_id = collection.add(software)
            logger.info(f"Added SoftwareCode with ID {software_id}")

Referencing Existing Objects in OpenBIS¶

When parsing data, you may want to update an existing object in OpenBIS rather than creating a new one. This is useful when you're importing updated metadata for objects that already exist in the system.

To reference an existing object, set the code attribute on your object instance before adding it to the collection:

# Step 1
import json
from bam_masterdata.datamodel.object_types import SoftwareCode
from bam_masterdata.parsing import AbstractParser


class SupercodeXParser(AbstractParser):
    def parse(self, files, collection, logger):
        for file_path in files:
            # Step 2
            logger.info(f"Parsing file: {file_path}")
            with open(file_path, "r", encoding="utf-8") as f:
                data = json.load(f)

            # Step 3
            software = SoftwareCode(
                name=data.get("program_name"),
                version=data.get("program_version")
            )

            # Reference an existing object by setting its `code`
            # This code should match the object's code in OpenBIS
            if data.get("existing_identifier"):
                software.code = data.get("existing_identifier")
                logger.info(f"Referencing existing object: {software.code}")

            # Step 4
            software_id = collection.add(software)

Without code set (default): A new object is created in OpenBIS with an automatically generated code based on the object type's generated_code_prefix.
With code set: The parser looks for an existing object with that code in OpenBIS. If found, it updates the object's properties with the new values from your parser. If not found, the behavior depends on the OpenBIS configuration.

You can have multiple scenarios when deciding if setting up code or not:

Creating new objects: Leave the code attribute unset. The system will generate unique codes automatically.
Updating existing objects: Set the code attribute to match the code of an existing object in OpenBIS. For example, if you have a sample with code SAMPLE_001 in OpenBIS, set sample.code = "SAMPLE_001" in your parser to update it. Note that this is a static assignment.
Mixed workflow: You can create some new objects while updating others in the same parsing operation by setting code only where needed.

Code Format

The code must match the exact format used in OpenBIS, including any prefixes or separators. Codes are typically uppercase with underscores (e.g., SAMPLE_001, EXP_2024_01).

Identifier Construction

When referencing an existing object, the full identifier is constructed as:

With collection: /{space_name}/{project_name}/{collection_name}/{code}
Without collection: /{space_name}/{project_name}/{code}

Make sure the object exists at the expected location in the OpenBIS hierarchy.

Tips¶

Use the logger to provide useful messages during parsing, but bear in mind this can clutter the app if you plan to parse hundreds or more files.
```
logger.info("Parsing file XYZ")
```
Test your parser incrementally by adding one object at a time to the collection and verifying results. You can test this by modifying the tests/test_parser.py testing file.
When updating existing objects, log which objects are being updated to help with debugging and traceability.

Final steps¶

You now have all the core components of your custom parser in place:

Project structure set up.
Package renamed to your parser name.
Parser class created and logic accepting the metainformation of the specific files.
Entry points updated.

What’s left?¶

Update pyproject.toml
- Make sure the package name, version, and entry points match your parser.
- Adjust dependencies if your parser requires additional libraries (e.g., pandas).
Update the README.md
- Replace the README.md content with a description of your parser.
- Document how to install it and how to run it.
Create a new release in GitHub
- Go to your repository on GitHub.
- Click on the Releases tab (or navigate to https://github.com/[your-username]/[your-repo]/releases).
- Click Create a new release.
- Choose a tag version (e.g., v1.0.0) and add a release title.
- Optionally, add release notes describing changes or new features.
- Click Publish release to make it available.

Updating the Parser¶

Once your parser is implemented and tested, future updates are usually minimal and follow a clear process.

Modify only parser.py
- All changes should be contained within your parser class and helper functions.
- Avoid renaming packages or changing the project structure unless absolutely necessary.
Notify the Admin for a new release
- After updates, inform the administrator or the person responsible for releases.
- Provide details of the changes and any new dependencies.