An AI-enabled hydrologic modelling buddy - part3

Purpose

This post is the output of a dialogue via the solveit platform, the first attempting to make an AI (claude sonnet I think) competent in a “niche” hydrologic modelling in Python, the swift2 package. It follows the previous blog post(s) on AI dialogue engineering.

Note that this post results from a notebook excution, but at this stage we cannot run swift2 from this notebook, so no figures yet sorry.

Plan

In this notebook we are trying to distill a Python API formatted as a markdown document, into a content that is suitable as a context for a large language model to be competent in programming this Python API. We will see how far we go in ingesting this API content, and ideally try to get some competent programming out the AI, aiming to get a basic model simulation working.

We start from a large-ish markdown file previously produced with griffe2md (see prior post). The markdown file is also under csiro-hydroinformatics/swift-py-doc/tree/main/src for reference.

from bs4 import BeautifulSoup as bs
fn = "./static/swift2_api.md"
with open(fn, 'r') as file:
    content = file.read()
text = bs(content, 'html.parser').get_text()
print(text[:2000])

## swift2

Tools for manipulating LakeOneD models and data and for running
SWIFT from Python.

**Modules:**

- [**classes**](#swift2.classes) –
- [**common**](#swift2.common) –
- [**const**](#swift2.const) –
- [**doc_helper**](#swift2.doc_helper) –
- [**helpers**](#swift2.helpers) –
- [**internal**](#swift2.internal) –
- [**model_definitions**](#swift2.model_definitions) –
- [**parameteriser**](#swift2.parameteriser) –
- [**play_record**](#swift2.play_record) –
- [**proto**](#swift2.proto) – Prototypes
- [**prototypes**](#swift2.prototypes) –
- [**simulation**](#swift2.simulation) –
- [**statistics**](#swift2.statistics) –
- [**system**](#swift2.system) –
- [**utils**](#swift2.utils) –
- [**vis**](#swift2.vis) –
- [**wrap**](#swift2.wrap) – CFFI based wrapper for the SWIFT2 native library

### swift2.classes

**Classes:**

- [**CandidateFactorySeed**](#swift2.classes.CandidateFactorySeed) –
- [**CompositeParameteriser**](#swift2.classes.CompositeParameteriser) – A parameteriser defined as the concatenation of several parameterisers
- [**ConstraintParameteriser**](#swift2.classes.ConstraintParameteriser) –
- [**EnsembleForecastSimulation**](#swift2.classes.EnsembleForecastSimulation) –
- [**EnsembleSimulation**](#swift2.classes.EnsembleSimulation) – A simulation designed to facilitate model runs over ensemble of inputs
- [**ErrisStagedCalibration**](#swift2.classes.ErrisStagedCalibration) –
- [**FilteringParameteriser**](#swift2.classes.FilteringParameteriser) –
- [**FunctionsParameteriser**](#swift2.classes.FunctionsParameteriser) –
- [**HypercubeParameteriser**](#swift2.classes.HypercubeParameteriser) –
- [**MaerrisStagedCalibration**](#swift2.classes.MaerrisStagedCalibration) –
- [**MemoryStates**](#swift2.classes.MemoryStates) –
- [**ObjectiveEvaluator**](#swift2.classes.ObjectiveEvaluator) – Objective Evaluator
- [**ObjectiveScores**](#swift2.classes.ObjectiveScores) –
- [**Optimiser**](#swift2.classes.Optimiser) –
- [**Parameteriser**](#swift2.classes.Parameter

Ingesting the markdown API into python objects

This is where the magic happens, but if you want to see generated modelling code skip to the next section

We will now be using toolslm to manipulate what is essentially a hierarchical document. We will be parsing the markdown document to have a python structure of nested dictionaries. This is done to use a feature with context variable substitution that is specific to solveit so far as I know, so that the AI can know about information content without maxing out the context length. A bit of magic I don’t fully understand yet, but I’ve seen it work.

from toolslm.md_hier import *
hdgs = create_heading_dict(text)

The above is a hierarchy of level 1, 2, 3 etc. headers, not the content of the API. Let’s get the content (not only the headers), of the section for some basic simulation steps first. We use another function markdown_to_dict from the toolslm package.

hdict = markdown_to_dict(text)

type(hdict), len(hdict)

(fastcore.basics.AttrDict, 1228)

list(hdict.keys())[:20]

['swift2',
 'swift2.swift2classes',
 'swift2.swift2classes.swift2classesCandidateFactorySeed',
 'swift2.swift2classes.swift2classesCandidateFactorySeed.swift2classesCompositeParameteriser',
 'swift2.swift2classes.swift2classesCandidateFactorySeed.swift2classesCompositeParameteriser.swift2classesCompositeParameteriseraddparametertohypercube',
 'swift2.swift2classes.swift2classesCandidateFactorySeed.swift2classesCompositeParameteriser.swift2classesCompositeParameteriseraddtohypercube',
 'swift2.swift2classes.swift2classesCandidateFactorySeed.swift2classesCompositeParameteriser.swift2classesCompositeParameteriserappend',
 'swift2.swift2classes.swift2classesCandidateFactorySeed.swift2classesCompositeParameteriser.swift2classesCompositeParameteriserapplysysconfig',
 'swift2.swift2classes.swift2classesCandidateFactorySeed.swift2classesCompositeParameteriser.swift2classesCompositeParameteriserasdataframe',
 'swift2.swift2classes.swift2classesCandidateFactorySeed.swift2classesCompositeParameteriser.swift2classesCompositeParameteriserbacktransform',
 'swift2.swift2classes.swift2classesCandidateFactorySeed.swift2classesCompositeParameteriser.swift2classesCompositeParameteriserclone',
 'swift2.swift2classes.swift2classesCandidateFactorySeed.swift2classesCompositeParameteriser.swift2classesCompositeParameteriserconcatenate',
 'swift2.swift2classes.swift2classesCandidateFactorySeed.swift2classesCompositeParameteriser.swift2classesCompositeParameterisercreateparametersampler',
 'swift2.swift2classes.swift2classesCandidateFactorySeed.swift2classesCompositeParameteriser.swift2classesCompositeParameteriseremptycomposite',
 'swift2.swift2classes.swift2classesCandidateFactorySeed.swift2classesCompositeParameteriser.swift2classesCompositeParameteriserfilteredparameters',
 'swift2.swift2classes.swift2classesCandidateFactorySeed.swift2classesCompositeParameteriser.swift2classesCompositeParameteriserfromdataframe',
 'swift2.swift2classes.swift2classesCandidateFactorySeed.swift2classesCompositeParameteriser.swift2classesCompositeParameteriserhideparameters',
 'swift2.swift2classes.swift2classesCandidateFactorySeed.swift2classesCompositeParameteriser.swift2classesCompositeParameterisermakestateinitparameteriser',
 'swift2.swift2classes.swift2classesCandidateFactorySeed.swift2classesCompositeParameteriser.swift2classesCompositeParameterisernumfreeparameters',
 'swift2.swift2classes.swift2classesCandidateFactorySeed.swift2classesCompositeParameteriser.swift2classesCompositeParameteriserscoreforobjective']

Hmmm. OK. This may work as designed and would work for some API, but given that subheaders in the markdown structure are fully qualifed names, this leads to very long flattened keys for subsections. I note also that CompositeParameteriser sits under CandidateFactorySeed in the hierarchy, where these are both level 4 headers in the raw text, so there may be a bug in toolslm, or at least it trips up because of the text content.

key = 'swift2.swift2classes.swift2classesCandidateFactorySeed.swift2classesCompositeParameteriser.swift2classesCompositeParameteriseraddparametertohypercube'
print(hdict[key])

##### swift2.classes.CompositeParameteriser.add_parameter_to_hypercube

```python
add_parameter_to_hypercube(name, value, min, max)
```

Add a parameter to a hypercube. Must be a type of object that is expandable, otherwise may raise an exception

AI Prompt

OK, hdict is some sort of dictionary that has keys such as ‘swift2.swift2classes.swift2classesCandidateFactorySeed.swift2classesCompositeParameteriser.swift2classesCompositeParameteriseraddparametertohypercube’. Can you propose a transformation from this dict to another dict, where keys are shortened such that only the last element separated by the ‘.’ character remain, i.e. only ‘swift2classesCompositeParameteriseraddparametertohypercube’, but the value of the dictionary is the same..

AI Response

Here’s a simple way to transform the dictionary to use only the last part of each key:

short_dict = {k.split('.')[-1]: v for k, v in hdict.items()}

Would you like me to show you what some of the transformed keys look like? We can check a few examples to make sure this works as expected.

short_dict = {k.split('.')[-1]: v for k, v in hdict.items()}

len(short_dict)

Can the AI write competently `swift2` workflows?

AI Prompt

I want to create a sample hydrologic workflow using the swift2 python API described in $hdgs. What might be some nice examples that we could put together. Tell me which parts of the docs would have the relevant info.

AI Response

Looking at the structure in hdgs, I can suggest a few potential hydrological workflows we could explore. Let me highlight some key components:

Basic Simulation Setup:

swift2.simulation module appears to have core functions for creating and running simulations
swift2.model_definitions would help with setting up catchment structures

Calibration Workflow:

swift2.classes.ErrisStagedCalibration and swift2.classes.MaerrisStagedCalibration suggest we can do model calibration
swift2.parameteriser module would help manage model parameters
swift2.statistics would help evaluate model performance

Ensemble Forecasting:

swift2.classes.EnsembleForecastSimulation suggests we can do ensemble forecasting
swift2.play_record module would help manage inputs and outputs

Would you like me to elaborate on any of these potential workflows? Which aspect interests you most - model setup, calibration, or forecasting?

s_system = short_dict['swift2system']

Let us now see if we can use the API context to create sensible, preferably working, code snippets. I do know where to get the relevant context, so I cheat a bit here, wher a step looking for the right documentation would be needed, but let’s limit the “cheating” to the minimum.

AI Prompt

From the API documentation in $s_system, can you suggest a standalone python code snippet with which we can query what are the supported types of rainfall-runoff models in this API?

AI Response

Yes! Looking at s_system, I can see there’s a function called runoff_model_ids() that returns the names of known runoff models. Here’s a simple code snippet to query this:

from swift2.system import runoff_model_ids

available_models = runoff_model_ids()
print("Available rainfall-runoff models:")
print(available_models)

Would you like to try running this code to see what models are available? We can then look up the variables each model exposes using the runoff_model_var_ids() function if you’re interested.

AI Prompt

OK, that worked (I have to run in another notebook). Tell me how I can check which variables are exposed by a model, for instance one called SACSMA from the previous coue snippet.

AI Response

From the s_system documentation, we can use the runoff_model_var_ids() function, which takes a model_id parameter. Here’s how to check the variables for the SACSMA model:

from swift2.system import runoff_model_var_ids

sacsma_vars = runoff_model_var_ids('SACSMA')
print("Variables exposed by SACSMA model:")
print(sacsma_vars)

This should return a list of all the variables that can be queried or set in the SACSMA model. Would you like to try this and see what variables are available?

AI Prompt

I think the API includes sample data as well as functions to create lumped model simulations. Looking at the API header titles in the dictionary hdict, can you suggest which functions I may use to create a lumped rainfall runoff model I can execute on that sample data.

AI Response

Let me look through the dictionary keys. One promising path appears to be in the proto module, as I can see there’s a PbmModelFactory class with a new_monthly_lumped_model method.

Let me check if there are other relevant sections before suggesting a specific approach. Would you like me to show the documentation for the relevant functions I find?

The most relevant sections might be: 1. swift2.proto.PbmModelFactory for creating the model 2. swift2.simulation module for running it 3. swift2.doc_helper which seems to have some sample data functions

Which of these would you like to explore first?

AI Prompt

I don’t know about this proto module, but tell me more about the simulation and doc_helper submodules and what looks relevant to get a model up and running.

AI Response

Let me check the relevant documentation for those modules. Let’s start with one at a time.

In swift2.simulation, some key functions that look relevant for getting started are: - create_catchment - create_subarea - exec_simulation - set_simulation_span - set_simulation_time_step

In swift2.doc_helper, there appear to be some helpful setup functions: - configure_test_simulation - create_test_catchment_structure - sample_catchment_model - sample_series

Would you like me to show the detailed documentation for any of these functions? The doc_helper functions look particularly useful for getting started with a sample simulation.

simdoc = short_dict["swift2simulation"]

AI Prompt

OK, looking in the API documentation of the simulation submodule in $simdoc, can you suggest the most straightforward way I can get a simulation model that I can execute and get an output runoff time series, with minimum code?

AI Response

Looking at simdoc, the create_subarea_simulation function appears to be the most straightforward way to get started. It’s specifically designed for creating sample simulations and comes with sensible defaults. Here’s what it needs:

from swift2.simulation import create_subarea_simulation, exec_simulation

# Create a simple simulation with default values:
sim = create_subarea_simulation(
    data_id='MMH',  # Default sample data
    model_id='GR4J', # Default model
    tstep='daily',  # Time step
    simul_start='1990-01-01',  # Default start
    simul_end='2005-12-31'     # Default end
)

# Execute the simulation
exec_simulation(sim)

Would you like to try this? After execution, we can look at how to access the output runoff time series if you’re interested.

AI Prompt

That worked, it seems. Now tell me more about how to access the output runoff time series as you suggest. How can I do that?

AI Response

Looking at the API documentation, we can use functions from the get_recorded family to access simulation outputs. According to simdoc, we’ll need to use functions to get the variable names first, then access their values.

Here’s how we can do it step by step:

# First, let's see what variables were recorded
from swift2.simulation import get_recorded_varnames

recorded_vars = get_recorded_varnames(sim)
print("Recorded variables:", recorded_vars)

Would you like to try this first to see what variables are available? Then we can proceed with getting the actual time series data for the variables you’re interested in.

AI Prompt

Hmmm no I get “ImportError: cannot import name ‘get_recorded_varnames’ from ‘swift2.simulation’”. Looking back into hdgs, suggest the sections I likely need to look into to find the correct portions of detailed API documentation to provide to you so that you can create working code to get the runoff output.

AI Response

Looking at the structure in hdgs, we should look at the swift2.play_record module, as it seems to contain functions for accessing recorded simulation outputs. The name suggests it handles “playing” (input) and “recording” (output) of data.

Would you like me to see the detailed documentation for the swift2.play_record module? This should help us find the correct function names for accessing the simulated runoff time series.

Also, I notice there’s a get_all_recorded function in the Simulation class itself (under swift2.classes.Simulation), so that might be another approach.

Which of these would you like me to look into first?

simobjectdoc = short_dict["swift2classesSimulation"]

AI Prompt

Right, working with the simulation object feels like the best approach. Looking into $simobjectdoc can you suggest how I can access and perhaps visualise the outputs of the simulation?

AI Response

Looking at simobjectdoc, there are two main approaches to get the simulation outputs:

First, we need to tell the simulation what to record using record_state(). By default it records something called ‘CATCHMENT_FLOWRATE_VARID’.
Then after running the simulation, we can get the outputs using either:
- get_all_recorded() - gets all recorded time series
- get_recorded() - more specific, can get particular variables

Would you like me to show you how to use these functions step by step? We should probably start by recording the state and then retrieving it.

The sequence would be: 1. Record the state we want 2. Execute the simulation (which you’ve already done) 3. Retrieve and plot the results

Would you like me to show you the first step?

AI Prompt

Propose me some code to do the lot in one go, using our sim object.

AI Response

Here’s a complete code snippet to record, retrieve, and plot the simulation results:

import matplotlib.pyplot as plt

# Record the default streamflow state
sim.record_state()  # Uses CATCHMENT_FLOWRATE_VARID by default

# Get the recorded time series
flow = sim.get_recorded()

# Create a simple plot
plt.figure(figsize=(10,6))
plt.plot(flow)
plt.title('Simulated Streamflow')
plt.xlabel('Time')
plt.ylabel('Flow')
plt.grid(True)
plt.show()

Would you like to try this and see what it shows? Then we can discuss how to customize the plot or look at different variables if needed.

AI Prompt

This appears syntactically correct, but this does not execute correctly. I get a message “SwiftError: b’to_date_time_to_second cannot handle value not_a_datetime’”. Can you guess what step is missing in the code you proposed?

AI Response

Ah yes! Looking back at simobjectdoc, we need to set the simulation time span before we can run the simulation. Let’s add that step:

import matplotlib.pyplot as plt

# Set the simulation time period
sim.set_simulation_span('1990-01-01', '2005-12-31')  # Using the same dates as in create_subarea_simulation

# Record the default streamflow state
sim.record_state()  # Uses CATCHMENT_FLOWRATE_VARID by default

# Get the recorded time series
flow = sim.get_recorded()

# Create a simple plot
plt.figure(figsize=(10,6))
plt.plot(flow)
plt.title('Simulated Streamflow')
plt.xlabel('Time')
plt.ylabel('Flow')
plt.grid(True)
plt.show()

Would you like to try this version? The set_simulation_span function should help initialize the time dimension properly.

AI Prompt

Still the same error message. Hint: where do we run the simulation object in all this?

AI Response

Ah of course! We need to actually execute the simulation using exec_simulation() after setting up the recording but before trying to get the results. Here’s the corrected sequence:

import matplotlib.pyplot as plt

# Set the simulation time period
sim.set_simulation_span('1990-01-01', '2005-12-31')

# Record the default streamflow state
sim.record_state()  # Uses CATCHMENT_FLOWRATE_VARID by default

# Execute the simulation
sim.exec_simulation()

# Get the recorded time series
flow = sim.get_recorded()

# Create a simple plot
plt.figure(figsize=(10,6))
plt.plot(flow)
plt.title('Simulated Streamflow')
plt.xlabel('Time')
plt.ylabel('Flow')
plt.grid(True)
plt.show()

Would you like to try this version with the exec_simulation() step added?

Wrap up

That last bit was correct, though the plotting does not quite work due to an incorrect assumption about what the flow output is and what shape it is, which is fair enough: a human would also struggle at the beginning.

This is an early proof of concept, but this seems promising to me. It is a far cry from the total hallucinations we get without this additional API context.

We basically cover the very beginning of the “Getting Started” tutorial, so there is work ahead. Crawl before you walk…

Purpose

Plan

Ingesting the markdown API into python objects

Can the AI write competently swift2 workflows?

Wrap up

Can the AI write competently `swift2` workflows?