Lithology classification using Hugging Face, part 3

First working lithology classification training on the Namoi data
hugging-face
NLP
lithology
Author

J-M

Published

July 3, 2022

About

This is a continuation of Lithology classification using Hugging Face, part 2.

The “Part 2” post ended up on an error on calling trainer.train, with incompatible tensor dimensions in a tensor multiplication. It was not clear at all (to me) what the root issue was. After getting back to basics and looking at the HF Text classification how-to, I noticed that my Dataset contained pytorch tensors or lists thereof, where the how-do just had simple data types.

Long story short, I removed the tokernizer’s parameter return_tensors="pt", and did not call tok_ds.set_format("torch"), and surprised, it worked. I had added these because the initial trial complained about a mix of GPU and CPU data.

Plan

At this stage, it is worthwhile laying out a roadmap of where this line of work may go:

  • Complete a classification on at least a subset of the Namoi dataset (this post)
  • Upload a trained model to Hugging Face Hub, or perhaps fastai X Hugging Face Group 2022
  • Set up a Gradio application on HF Spaces
  • Project proposal at work. Weekend self-teaching can only go so far.

Walkthrough

Much of the code in this section is very similar to Lithology classification using Hugging Face, part 2, so blocks will be less commented.

import numpy as np
import pandas as pd
import torch
from datasets import Dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from pathlib import Path
from datasets import ClassLabel
from transformers import TrainingArguments, Trainer
from sklearn.metrics import f1_score
from sklearn.metrics import roc_curve,confusion_matrix,auc
import matplotlib.pyplot as plt
from collections import Counter

# Some column string identifiers
MAJOR_CODE = "MajorLithCode"
MAJOR_CODE_INT = "MajorLithoCodeInt"  # We will create a numeric representation of labels, which is (I think?) required by HF.
MINOR_CODE = "MinorLithCode"
DESC = "Description"

fn = Path("~").expanduser() / "data/ela/shp_namoi_river/NGIS_LithologyLog.csv"
litho_logs = pd.read_csv(
    fn, dtype={"FromDepth": str, "ToDepth": str, MAJOR_CODE: str, MINOR_CODE: str}
)

def token_freq(tokens, n_most_common=50):
    list_most_common = Counter(tokens).most_common(n_most_common)
    return pd.DataFrame(list_most_common, columns=["token", "frequency"])

litho_classes = litho_logs[MAJOR_CODE].values
df_most_common = token_freq(litho_classes, 50)

NUM_CLASSES_KEPT=17

labels_kept = df_most_common["token"][:NUM_CLASSES_KEPT].values 
labels_kept = labels_kept[labels_kept != "None"]
labels_kept
array(['CLAY', 'GRVL', 'SAND', 'SHLE', 'SDSN', 'BSLT', 'TPSL', 'SOIL',
       'ROCK', 'GRNT', 'SDCY', 'SLSN', 'CGLM', 'MDSN', 'UNKN', 'COAL'],
      dtype=object)
kept = [x in labels_kept for x in litho_classes]
litho_logs_kept = litho_logs[kept].copy()  # avoid warning messages down the track.
labels = ClassLabel(names=labels_kept)
int_labels = np.array([
    labels.str2int(x) for x in litho_logs_kept[MAJOR_CODE].values
])
int_labels = int_labels.astype(np.int8) # to mimick chapter3 HF so far as I can see
litho_logs_kept[MAJOR_CODE_INT] = int_labels

We will fine tune a smaller version of DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing available on the Hugging Face model repository.

STARTING_MODEL = "microsoft/deberta-v3-small"

Dealing with imbalanced classes with weights

sorted_counts = litho_logs_kept[MAJOR_CODE].value_counts()
class_weights = (1 - sorted_counts / sorted_counts.sum()).values
class_weights = torch.from_numpy(class_weights).float().to("cuda")

Tokenisation

p = Path("./tokz_pretrained")
pretrained_model_name_or_path = p if p.exists() else STARTING_MODEL

# Tokenizer max length
max_length = 128

# https://discuss.huggingface.co/t/sentence-transformers-paraphrase-minilm-fine-tuning-error/9612/4
tokz = AutoTokenizer.from_pretrained(pretrained_model_name_or_path, use_fast=True, max_length=max_length, model_max_length=max_length)
if not p.exists():
    tokz.save_pretrained("./tokz_pretrained")

We know from the previous post that we should work with lowercase descriptions to have a more sensible tokenisation

litho_logs_kept[DESC] = litho_logs_kept[DESC].str.lower()
litho_logs_kept_mini = litho_logs_kept[[MAJOR_CODE_INT, DESC]]
litho_logs_kept_mini.sample(n=10)
MajorLithoCodeInt Description
88691 3 shale
77323 11 siltstone
42318 0 clay fine sandy water supply
85089 1 gravel; as above, except gravels 70% 2-10mm, 3...
112223 0 clay; 70%, light brown. coarse sand to fine gr...
35510 0 clay
106351 0 clay
80478 0 clay; ligth grey with brown streaks - with som...
20290 1 gravel
23426 0 clay, gravelly, blueish

Create dataset and tokenisation

We will use a subset sample of the full dataset to train on, for the sake of execution speed, for now

len(litho_logs_kept_mini)
123657
litho_logs_kept_mini_subset = litho_logs_kept_mini.sample(len(litho_logs_kept_mini) // 4)
len(litho_logs_kept_mini_subset)
30914
ds = Dataset.from_pandas(litho_logs_kept_mini_subset)

def tok_func(x):
    return tokz(
        x[DESC],
        padding="max_length",
        truncation=True,
        max_length=max_length,
        # return_tensors="pt", ## IMPORTANT not to use return_tensors="pt" here, perhaps conter-intuitively
    )
tok_ds = ds.map(tok_func)
num_labels = len(labels_kept)
Parameter 'function'=<function tok_func at 0x7f49f6e17a60> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
# NOTE: the local caching may be superflous
p = Path("./model_pretrained")

model_name = p if p.exists() else STARTING_MODEL
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels, max_length=max_length)
                                                           # label2id=label2id, id2label=id2label).to(device) 
if not p.exists():
    model.save_pretrained(p)
print(type(model))
<class 'transformers.models.deberta_v2.modeling_deberta_v2.DebertaV2ForSequenceClassification'>
tok_ds = tok_ds.rename_columns({MAJOR_CODE_INT: "labels"})
# Keep the description column, which will be handy later despite warnings at training time.
tok_ds = tok_ds.remove_columns(['__index_level_0__'])
# tok_ds = tok_ds.remove_columns(['Description', '__index_level_0__'])
# Not sure why, but cannot set the labels class otherwise `train_test_split` complains
# tok_ds.features['labels'] = labels
dds = tok_ds.train_test_split(test_size=0.25, seed=42)
# Defining the Trainer to compute Custom Loss Function, adapted from [Simple Training with the 🤗 Transformers Trainer, around 840 seconds](https://youtu.be/u--UVvH-LIQ?t=840)
class WeightedLossTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        # Feed inputs to model and extract logits
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # Extract Labels
        labels = inputs.get("labels")
        # Define loss function with class weights
        loss_func = torch.nn.CrossEntropyLoss(weight=class_weights)
        # Compute loss
        loss = loss_func(logits, labels)
        return (loss, outputs) if return_outputs else loss
def compute_metrics(eval_pred):
    labels = eval_pred.label_ids
    predictions = eval_pred.predictions.argmax(-1)
    f1 = f1_score(labels, predictions, average="weighted")
    return {"f1": f1}
output_dir = "./hf_training"
batch_size = 64 # 128 causes a CUDA out of memory exception... Maybe I shoudl consider dynamic padding instead. Later.
epochs = 3 # low, but for didactic purposes will do.
lr = 8e-5  # inherited, no idea whether appropriate. is there an lr_find in hugging face?
training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=epochs,
    learning_rate=lr,
    lr_scheduler_type="cosine",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size * 2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    logging_steps=len(dds["train"]),
    fp16=True,
    push_to_hub=False,
    report_to="none",
)
model = model.to("cuda:0")

The above nay not be strictly necessary, depending on your version of transformers. I bumped into the following issue, which was probably the transformers 4.11.3 bug: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dds["train"],
    eval_dataset=dds["test"],
    tokenizer=tokz,
    compute_metrics=compute_metrics,
)
Using amp half precision backend

Training

trainer.train()
The following columns in the training set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: Description. If Description are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
/home/abcdef/miniconda/envs/hf/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 23185
  Num Epochs = 3
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 1089
The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: Description. If Description are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 7729
  Batch size = 128
Saving model checkpoint to ./hf_training/checkpoint-500
Configuration saved in ./hf_training/checkpoint-500/config.json
Model weights saved in ./hf_training/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./hf_training/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./hf_training/checkpoint-500/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: Description. If Description are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 7729
  Batch size = 128
Saving model checkpoint to ./hf_training/checkpoint-1000
Configuration saved in ./hf_training/checkpoint-1000/config.json
Model weights saved in ./hf_training/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in ./hf_training/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./hf_training/checkpoint-1000/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: Description. If Description are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 7729
  Batch size = 128


Training completed. Do not forget to share your model on huggingface.co/models =)

[1089/1089 04:57, Epoch 3/3]
Epoch Training Loss Validation Loss F1
1 No log 0.072295 0.983439
2 No log 0.063188 0.985492
3 No log 0.061934 0.986534

TrainOutput(global_step=1089, training_loss=0.16952454397938907, metrics={'train_runtime': 297.9073, 'train_samples_per_second': 233.479, 'train_steps_per_second': 3.655, 'total_flos': 2304099629568000.0, 'train_loss': 0.16952454397938907, 'epoch': 3.0})

Exploring results

This part is newer compared to the previous post, so I will elaborate a bit.

I am not across the high level facilities to assess model predictions (visualisation, etc.) so what follows may be sub-optimal and idiosyncratic.

test_pred = trainer.predict(trainer.eval_dataset)
The following columns in the test set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: Description. If Description are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 7729
  Batch size = 128
[61/61 00:08]
test_pred
PredictionOutput(predictions=array([[-0.519 , -2.127 ,  0.61  , ..., -1.751 , -1.145 , -2.832 ],
       [ 0.8223,  2.123 ,  9.27  , ..., -3.254 , -0.8325, -1.929 ],
       [-1.003 , -0.469 , -1.233 , ..., -1.084 , -0.7856, -0.4966],
       ...,
       [-0.53  , -1.396 , -0.5615, ..., -2.506 , -1.985 , -3.44  ],
       [-0.453 , -1.442 , -0.621 , ..., -2.424 , -1.973 , -3.44  ],
       [-1.388 , -2.346 , -1.186 , ..., -1.94  , -0.6084, -2.22  ]],
      dtype=float16), label_ids=array([4, 2, 8, ..., 5, 5, 6]), metrics={'test_loss': 0.061934199184179306, 'test_f1': 0.9865336898918051, 'test_runtime': 8.7643, 'test_samples_per_second': 881.873, 'test_steps_per_second': 6.96})

This is lower level than I anticipated. The predictions array appear to be the logits. Note that I was not sure label_ids was, and it is not the predicted label, but the “true” label.

test_df = trainer.eval_dataset.to_pandas()
y_true = test_df.labels.values.astype(int)
y_true
array([4, 2, 8, ..., 5, 5, 6])

To get the predicted labels, I seem to need to do the following song and dance:

preds_tf = torch.asarray(test_pred.predictions, dtype=float)
predictions = torch.nn.functional.softmax(preds_tf, dim=-1)
highest = np.argmax(predictions, axis=1)
y_pred = np.array(highest)
y_pred
array([4, 2, 8, ..., 5, 5, 6])
differ = np.logical_not(y_true == y_pred)
print("There are {0} records in the validation data set that differ from true labels".format(np.sum(differ)))
There are 104 records in the validation data set that differ from true labels

Let’s look at where we fail to match the labels:

differing = test_df[differ]
lbl_true = labels.int2str(differing.labels.values)
descriptions = differing.Description.values 
lbl_pred = labels.int2str(y_pred[differ])
pd.options.display.max_colwidth = 150
pd.options.display.max_rows = 110
pd.DataFrame.from_dict({
    "label_true": lbl_true,
    "label_pred": lbl_pred,
    "desc": descriptions,
})
label_true label_pred desc
0 TPSL GRNT topoil; granite, grey
1 TPSL CLAY none
2 SDSN CLAY none
3 SHLE CLAY clay multicoloured sandy
4 TPSL CLAY none
5 SAND GRNT granite sand
6 SDSN CLAY gray
7 TPSL CLAY none
8 CLAY SDCY sandy clay
9 SDSN CLAY none
10 TPSL CLAY clay - brown, silty
11 SDSN CLAY brown
12 CLAY SHLE grey
13 SLSN SDSN grey soft silstone
14 SAND SDCY sandy bands, brown
15 SDCY CLAY clay sandy
16 BSLT CLAY none
17 UNKN CLAY carbonaceous wood - dark bluish black to black with associated associated minor khaki and dark grey clay. sample retained for analysis
18 GRNT SAND grantie; ligh pinkish grey, medium, fragments of quartz, hornblende & mica, increased pink feldspar
19 UNKN CLAY none
20 SDSN CLAY none
21 ROCK UNKN missing
22 SLSN SDSN silstone
23 CLAY GRVL light grey medium to coarse sandy gravel - 30%, and gravelly clay - 70%. gravel mainly basalt and jasper
24 CLAY SDCY sandy clay
25 SDSN CLAY none
26 GRVL SAND sand and gravel
27 SAND SOIL soil + sand
28 UNKN CLAY none
29 SHLE CLAY brown
30 CLAY SAND clayey sand (brown) - fine-medium
31 UNKN CLAY white puggy some slightly hard
32 GRNT CLAY none
33 SHLE CLAY none
34 BSLT CLAY none
35 GRVL CLAY none
36 SHLE CLAY none
37 CLAY SDCY sandy and gravel aquifer with bands of clay
38 SDCY CLAY clay sandy
39 CLAY SDSN silty
40 CLAY SDCY sandy clay, light grey, fine
41 BSLT SHLE blue bassalt
42 CLAY SDCY sandy brown clay
43 CLAY SAND sand - silty up to 1mm, clayey
44 SOIL CLAY none
45 GRVL SAND wash alluvial
46 CLAY SDCY sandy clay
47 GRNT CLAY none
48 UNKN CLAY none
49 SDSN SAND sand - mostly white very fine to very coarse gravel
50 GRVL CLAY gravelly sandy clay
51 SOIL CLAY none
52 BSLT SDSN brown weathered
53 GRVL SAND brown sand and fine gravel
54 GRVL SAND course sand and gravel, w/b
55 SDCY GRNT silt, sandy/silty sand
56 TPSL BSLT blue basalt
57 GRVL CLAY stones clay
58 ROCK CLAY ochrs yellow
59 GRVL ROCK stone, clayed to semi formed sandstone
60 UNKN SAND soak water bearing
61 BSLT SDSN balsalt: weathered
62 SOIL CLAY none
63 UNKN CLAY very
64 GRVL CLAY gravelly sandy clay
65 SDSN GRNT granite sand
66 SHLE CLAY none
67 ROCK UNKN water bearing
68 BSLT SAND h/frac, quartz
69 MDSN COAL coal 80% & mudstone, 20%; dark grey, strong, carbonaceous
70 GRVL CLAY as above
71 CLAY GRVL gravelly clay
72 GRVL SAND sand + gravel (water)
73 CLAY GRVL with gravel
74 SAND SDCY sandy yellow
75 CLAY SOIL brown soil and clay
76 CLAY SDCY sandy clay
77 UNKN CLAY hard slightly stoney
78 SDSN CLAY none
79 SDSN ROCK sandsstone
80 TPSL CLAY none
81 SOIL CLAY none
82 SHLE BSLT shae (brown)
83 BSLT CLAY none
84 CLAY SDCY sandy clay
85 SAND SOIL surface soil
86 GRVL CLAY none
87 SAND CLAY none
88 SDSN CLAY none
89 CLAY SHLE grey
90 SDCY CLAY clay sandy water supply
91 GRVL BSLT blue/dark mixed
92 GRVL SAND sand + gravel + white clay
93 UNKN SHLE grey very hard
94 UNKN CLAY white fine, and clay, nodular
95 CLAY SLSN yellow clayey siltstone
96 SDSN CLAY none
97 SDSN SAND brown sand + stones (clean)
98 SDSN CLAY yellow
99 BSLT UNKN broken
100 CLAY SDCY sandy clay stringers
101 SAND CLAY none
102 SDSN ROCK bedrock - sandstone; whitish greyish blue, highly weathered, fine grains, angular to subangular, predominantly clear quartz. very small amounts of...
103 TPSL CLAY none

Observations

The error rate is rather low for a first trial, though admittedly we know that many descriptions are fairly unambiguous. If we examine the failed predictions, we can make a few observations:

  • There are many none descriptions that are picked up as CLAY, but given that the true labels are not necessarily UNKN for these, one cannot complain too much about the model. The fact that some true labels are set to CLAY for these hints at the use of contextual information, perhaps nearby lithology log entries being classified as CLAY.
  • The model picks up several sandy clay as SDCY, which is a priori more suited than the true labels, at least without other information context explaining why the “true” classification ends up being another category such as CLAY
  • Typographical errors such as ssandstone are throwing the model off, which is extected. A production pipeline would need to have an orthographic correction step.
  • grammatically unusual expressions such as clay sandy and clayey/gravel brown are also a challenge for the model.
  • More nuanced descriptions such as light grey medium to coarse sandy gravel - 30%, and gravelly clay - 70%. gravel mainly basalt and jasper where a human reads that the major class is clay, not gravel, or broken rock is more akin to gravel than rock.

Still, the confusion matrix is overall really encouraging. Let’s have a look:

import seaborn as sns

from matplotlib.ticker import FixedFormatter

def plot_cm(y_true, y_pred, title, figsize=(10,10), labels=None):
    ''''
    input y_true-Ground Truth Labels
          y_pred-Predicted Value of Model
          title-What Title to give to the confusion matrix
    
    Draws a Confusion Matrix for better understanding of how the model is working
    
    return None
    
    '''
    cm = confusion_matrix(y_true, y_pred, labels=np.unique(y_true))
    cm_sum = np.sum(cm, axis=1, keepdims=True)
    cm_perc = cm / cm_sum.astype(float) * 100
    annot = np.empty_like(cm).astype(str)
    nrows, ncols = cm.shape
    for i in range(nrows):
        for j in range(ncols):
            c = cm[i, j]
            p = cm_perc[i, j]
            if i == j:
                s = cm_sum[i]
                annot[i, j] = '%.1f%%\n%d/%d' % (p, c, s)
            elif c == 0:
                annot[i, j] = ''
            else:
                annot[i, j] = '%.1f%%\n%d' % (p, c)
    cm = pd.DataFrame(cm, index=np.unique(y_true), columns=np.unique(y_true))
    cm.index.name = 'Actual'
    cm.columns.name = 'Predicted'
    fig, ax = plt.subplots(figsize=figsize)
    ff = FixedFormatter(labels)
    ax.yaxis.set_major_formatter(ff)
    ax.xaxis.set_major_formatter(ff)
    plt.title(title)
    sns.heatmap(cm, cmap= "YlGnBu", annot=annot, fmt='', ax=ax)

def roc_curve_plot(fpr,tpr,roc_auc):
    plt.figure()
    lw = 2
    plt.plot(fpr, tpr, color='darkorange',
             lw=lw, label='ROC curve (area = %0.2f)' %roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()
plot_cm(y_true, y_pred, title="Test set confusion matrix", figsize=(16,16), labels=labels.names)
/tmp/ipykernel_29992/2038836972.py:37: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.yaxis.set_major_formatter(ff)
/tmp/ipykernel_29992/2038836972.py:38: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.xaxis.set_major_formatter(ff)

Conclusion, Next

Despite quite a few arbitrary shortcuts in the overall pipeline, we have a working template to fine-tune a pre-trained classification model to classify primary lithologies.

I’ll probably have to pause on this work for a few weeks, though perhaps a teaser Gradio app on Hugging Face spaces in the same vein as this one is diable with relatively little work.

Appendix

# Later on, in another post, for predictions on the CPU:
# model_cpu = model.to("cpu")

# from transformers import TextClassificationPipeline
# tokenizer = tokz

# pipe = TextClassificationPipeline(model=model_cpu, tokenizer=tokenizer, return_all_scores=True)
# # outputs a list of dicts like [[{'label': 'NEGATIVE', 'score': 0.0001223755971295759},  {'label': 'POSITIVE', 'score': 0.9998776316642761}]]

# pipe("clayey sand")

# raw_inputs = [
#     "I've been waiting for a HuggingFace course my whole life.",
#     "I hate this so much!",
# ]
# inputs = tokz(raw_inputs, padding=True, truncation=True, return_tensors="pt")
# print(inputs)

# pipe("I've been waiting for a HuggingFace course my whole life.")