Modifying files in python

Simple text files like parameters.txt

1P1=5
2P2=17
3P5=12

can easily be read ("r") via

with open("parameters.txt", "r") as prm_file:
    for line in prm_file:
        name, value = line.split("=")
        if name == "P2":
            print("[parameters.txt]  value of P2 is", float(value))

Other file formats like YAML or JSON offer a more convenient access. For example, modifying the parameters.yaml

 1Material parameters:
 2    E:
 3        name: "Young's modulus"
 4        value: 20000.
 5        unit: "MPa"
 6    nu:
 7        name: "Poission's ratio"
 8        value: 0.2
 9        unit: "-"
10    ft:
11        name: "tensile strength"
12        value: 4.
13        unit: "MPa"
14        
15Simulation parameters:
16    dt:
17        name: "time increment"
18        value: 0.1
19        unit: "s"
20

requires pip3 install pyyaml and is done by

import yaml

with open("parameters.yaml", "r") as prm_file:
    prm = yaml.load(prm_file, Loader=yaml.FullLoader)
prm_E = prm["Material parameters"]["E"]
assert prm_E["unit"] == "MPa"
print("[parameters.yaml] value of E is", prm_E["value"])

Writing YAML is done via

prm_E["value"] = 17840.2
with open("parameters_modified.yaml", "w") as prm_file:
    yaml.dump(prm, prm_file)
 1Material parameters:
 2  E:
 3    name: Young's modulus
 4    unit: MPa
 5    value: 17840.2
 6  ft:
 7    name: tensile strength
 8    unit: MPa
 9    value: 4.0
10  nu:
11    name: Poission's ratio
12    unit: '-'
13    value: 0.2
14Simulation parameters:
15  dt:
16    name: time increment
17    unit: s
18    value: 0.1

Interacting with MS Excel

A popular package that provides a pythonic representation of MS Excel workbooks/worksheets is openpyxl.

> pip3 install openpyxl

Open a worksheet

../_images/prm_xlsx.png
import openpyxl

workbook = openpyxl.load_workbook("parameters.xlsx")
worksheet = workbook.active

you can also select the worksheet by name

worksheet_by_name = workbook["Tabelle1"]
assert worksheet == worksheet_by_name

Reading from a worksheet

You can either access cells by name …

print("Cell A6 contains:", worksheet["A6"].value)

… or by index.

print("Cell (6,1) contains:", worksheet.cell(6, 1).value)
assert worksheet["A6"].value is not None
assert worksheet["A6"] == worksheet.cell(6, 1)

Note that openpyxl uses one-based-indexing. So in contrast to python lists or numpy, where the first entry is list[0] or numpy.array[0,0], the first cell in the worksheet is cell(1,1).

Openpyxl lets you conveniently loop through the cells of the worksheet, e.g. to find the cell that containts “P2” and to access its right neighbor.

for row in worksheet:
    for cell in row:
        if cell.value == "P2":
            i, j = cell.row, cell.column
            print("Found P2 at cell:", i, j)
            print("Cell right of P2 has value:", worksheet.cell(i, j + 1).value)

Modifying a worksheet

Similar to reading cells, you can assign values by name or index.

worksheet["B11"] = "P_new"
worksheet["C11"] = 17.0
assert worksheet.cell(11, 3).value == 17.0  # C-->3
worksheet.cell(11, 3).value = 6174.0

Don’t forget to save the modified workbook.

workbook.save("parameters_modified.xlsx")

Use cases

You can now get creative and write a function that finds the position of a string within a worksheet …

def row_col_of(ws, parameter_name):
    for row in ws:
        for cell in row:
            if cell.value == parameter_name:
                return cell.row, cell.column
    raise RuntimeError(f'There is no cell containing "{parameter_name}" in {ws}.')

… to extract parameter values from the worksheet …

i, j = row_col_of(worksheet, "P2")
P2_value = worksheet.cell(i, j + 1).value

and write it to a different file, e.g. by replacing a known placeholder with the value. We therefore load a base_input_file.dat (containing the placeholders),

 1* This is a made-up input file for the 
 2* Tool <ANSYS, ABACUS, you name it...>
 3* that contains some fixed parameters,
 4* and some that can be defined within
 5* a workflow.
 6
 7MESH     = "bridge_fine.xdmf";
 8E        = 20000.0;
 9nu       = 0.3;
10P2       = P2_PLACEHOLDER;
11delta_t  = 0.1;
12boundary = "periodic";
13
14...

replace the placeholder with our value from the Excel sheet and save it to a different file.

s = open("base_input_file.dat", "r").read()
s_modified = s.replace("P2_PLACEHOLDER", str(P2_value))
open("input_file_modified.dat", "w").write(s_modified)

Further reading

Automate the boring stuff, e.g.

Basic web searches like

  • “python read file”

  • “python loop through file line by line”

  • “python replace string in file”

  • will often lead you to stackoverflow.com

  • are often faster than reading whole tutorials

  • can result in ugly code, but that is OK.

Interacting with MS Word

One use case in reproducible science that works great in the context of LaTeX files is the separation of text and images, where the images are generated automatically in separate workflows. For MS Word, the challenge is to update images of a document automatically.

A MS Word document is basically a zip file containing a text layer and a picture layer. Those pictures are, if unzipped, stored in word/media and can be accessed via

import zipfile

z = zipfile.ZipFile("report.docx")
all_files = z.namelist()
images = [f for f in all_files if f.startswith("word/media/")]
print(images)

You can now modify the contents (actually creating a new zip file) to replace one of the images with another one. This is conveniently done by the provided class UpdatableZipFile (web search: python replace file in zip lead to https://stackoverflow.com/a/35435548 )

You can direcly overwrite the report.docx, but we chose to work on a copy of it (report_modified.docx) here.

from updateable_zip_file import *

shutil.copy("report.docx", "report_modified.docx")

new_images = [  # order matters!
    "dummy_images/red.png",
    "dummy_images/blue.png",
    "dummy_images/gray.png",
    "dummy_images/green.png",
]

with UpdateableZipFile("report_modified.docx", "a") as o:
    for new, old in zip(new_images, images):
        o.write(new, old)

The big inconvenience is that you cannot recover the image names from the *.docx file, so you have to rely on the ordering.

Comment (TTitscher): As the MS Excel/Word is not part of my daily work, I have no experience how the methods above perform in practice. However, I believe that this topic is important and I suggest forming a working group (lead by someone else) that investigates the methods further.