Integrating machine learning with Cartesi
In this machine learning tutorial, we will predict a classification based on the Titanic dataset, which shows the characteristics of people onboard the Titanic and whether those people survived the disaster.
You can submit inputs describing a person's features to determine if that person is likely to have survived.
Set up your environment
Install these to set up your environment for quick building.
Cartesi CLI is a simple tool for building applications on Cartesi. Install Cartesi CLI for your OS of choice.
Docker Desktop is the tool you need to run the Cartesi Machine and its dependencies. Install Docker 4.x.
Python 3.x: This is used to write your backend application logic. Install Python.
Understanding the dApp
ML Model Generation: The dApp generates a logistic regression model using sci-kit-learn, NumPy, and pandas.
m2cgen Transpilation: The dApp uses the m2cgen (Model to Code Generator) library to transpile the ML model into pure Python code without external dependencies. This translation simplifies the execution process, particularly in the Cartesi Machine environment.
The practical goal of the application is to predict a classification based on the Titanic dataset.
Users can submit inputs describing a person’s features (e.g., age, sex, embarked port), and the application predicts whether that person is likely to have survived the Titanic disaster.
The model currently considers only three characteristics of a person to predict their survival, even though other attributes are available in the dataset:
- Age.
- Sex, which can be
male
orfemale
. - Embarked, which corresponds to the port of embarkation and can be
C
(Cherbourg),Q
(Queenstown), orS
(Southampton).
As such, inputs to the dApp should be given as a JSON string such as the following:
{ "Age": 37, "Sex": "male", "Embarked": "S" }
The predicted classification result will be 0
(did not survive) or 1
(did survive).
Clone the repo for this project, and let’s go through it:
git clone https://github.com/Mugen-Builders/m2cgen.git
The m2cgen
folder contains a model folder with a Python script and a requirements.txt
file.
The build_model.py
file contains the logic for creating the model for our solution, while the requirements.txt contains the libraries needed for the script.
You can think of the build_model.py
as a jupyter-notebook file, which we experiment with and create models before using. Let's look at what the build_model.py
does.
import pandas as pd
import m2cgen as m2c
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
train_csv = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
include = ["Age", "Sex", "Embarked", "Survived"]
dependent_var = "Survived"
train_df = pd.read_csv(train_csv)
if include:
train_df = train_df[include]
independent_vars = train_df.columns.difference([dependent_var])
categoricals = []
for col, col_type in train_df[independent_vars].dtypes.iteritems():
if col_type == 'O':
categoricals.append(col)
else:
train_df[col].fillna(0, inplace=True)
train_df_ohe = pd.get_dummies(train_df, columns=categoricals, dummy_na=True)
x = train_df_ohe[train_df_ohe.columns.difference([dependent_var])]
y = train_df_ohe[dependent_var]
model.fit(x, y)
model_to_python = m2c.export_to_python(model)
model_columns = list(x.columns)
model_classes = train_df[dependent_var].unique().tolist()
with open("model.py", "w") as text_file:
print(f"{model_to_python}", file=text_file)
print(f"columns = {model_columns}", file=text_file)
print(f"classes = {model_classes}", file=text_file)
print("Model exported successfully")
The script primarily aims to export and prepare the ML model for integration into our Cartesi application. It includes the necessary libraries and functions to read data, preprocess it, train a model, and export it to a model.py
file.
The m2cgen.py
file in the root folder contains the application logic.
from os import environ
import model
import json
import traceback
import logging
import requests
#Cartesi API Definitions
logging.basicConfig(level="INFO")
logger = logging.getLogger(__name__)
rollup_server = environ["ROLLUP_HTTP_SERVER_URL"]
logger.info(f"HTTP rollup_server url is {rollup_server}")
#Util functions
def hex2str(hex):
return bytes.fromhex(hex[2:]).decode("utf-8")
def str2hex(str):
return "0x" + str.encode("utf-8").hex()
def classify(input):
score = model.score(input)
class_index = None
if isinstance(score, list):
class_index = score.index(max(score))
else:
if (score > 0):
class_index = 1
else:
class_index = 0
return model.classes[class_index]
def format(input):
formatted_input = {}
for key in input.keys():
if key in model.columns:
formatted_input[key] = input[key]
else:
ohe_key = key + "_" + str(input[key])
ohe_key_unknown = key + "_nan"
if ohe_key in model.columns:
formatted_input[ohe_key] = 1
else:
formatted_input[ohe_key_unknown] = 1
output = []
for column in model.columns:
if column in formatted_input:
output.append(formatted_input[column])
else:
output.append(0)
return output
#Cartesi API
def handle_advance(data):
status = "accept"
try:
input = hex2str(data["payload"])
input_json = json.loads(input)
input_formatted = format(input_json)
predicted = classify(input_formatted)
output = str2hex(str(predicted))
response = requests.post(rollup_server + "/notice", json={"payload": output})
except Exception as e:
status = "reject"
msg = f"Error processing data {data}\n{traceback.format_exc()}"
response = requests.post(rollup_server + "/report", json={"payload": str2hex(msg)})
return status
def handle_inspect(data):
response = requests.post(rollup_server + "/report", json={"payload": data["payload"]})
return "accept"
handlers = {
"advance_state": handle_advance,
"inspect_state": handle_inspect,
}
finish = {"status": "accept"}
while True:
response = requests.post(rollup_server + "/finish", json=finish)
if response.status_code == 202:
pass
else:
rollup_request = response.json()
data = rollup_request["data"]
handler = handlers[rollup_request["request_type"]]
finish["status"] = handler(rollup_request["data"])
This script is the core of our application, responsible for interacting with the Cartesi Rollups infrastructure.
It leverages the pre-trained Machine Learning model we create with the build_model.py
script and receives input data from the Cartesi Rollup server.
The script then processes this data, applies the model to make predictions, and communicates the results to the Rollup server.
The primary functions of this script include data conversion, model prediction, and communication with the Cartesi infrastructure. It ensures our ML-based application seamlessly integrates into Cartesi’, allowing us to harness the power of machine learning within the blockchain environment.
Now, let’s build and send inputs to the application.
Build and run the m2cgen application
To build the container of the m2cgen is very straightforward. Run:
cartesi build
After the build process is complete, run the node with the command:
cartesi run
Your application is now ready to receive inputs.
Sending inputs to the application
To interact with the application, provide the input in JSON. The input should include key-value pairs for specific features the ML model uses for prediction. Here’s an example:
{ "Age": 37, "Sex": "male", "Embarked": "S" }
The application responds with a predicted classification result, where 0 indicates the person did not survive, and 1 indicates survival.
To send inputs, run:
cartesi send generic
Example: Send an input to the application.
> cartesi send generic
? Chain Foundry
? RPC URL http://127.0.0.1:8545
? Wallet Mnemonic
? Mnemonic test test test test test test test test test test test junk
? Account 0xf39Fd6e51aad88F6F4ce6aB8827279cffFb92266 9999.970671818064986684 ETH
? Application address 0xab7528bb862fb57e8a2bcd567a2e929a0be56a5e
? Input String encoding
? Input (as string) { "Age": 37, "Sex": "male", "Embarked": "S" }
✔ Input sent: 0xe2a2ba347659e53c53f3089ff3268255842c03bafbbf185375f94c7a78f3f98a
Retrieving outputs
The cartesi send generic
sends a notice containing a payload to the Rollup Server's /notice
endpoint.
Notice payloads will be returned in hexadecimal format; developers will need to decode these to convert them into plain text.
We can query these notices using the GraphQL playground hosted on http://localhost:8080/graphql
or with a custom frontend client.
You can retrieve all notices sent to the rollup server with the query:
query notices {
notices {
edges {
node {
index
input {
index
}
payload
}
}
}
}
Alternatively, you can query this on a frontend client:
const response = await fetch("http://localhost:8080/graphql", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: '{ "query": "{ notices { edges { node { payload } } } }" }',
});
const result = await response.json();
for (let edge of result.data.notices.edges) {
let payload = edge.node.payload;
}
Changing the application
This dApp was created generically, so that you can change the target dataset and predictor algorithm.
To change those, open the file model/build_model.py
and change the following variables defined at the beginning of the script:
model
: defines the sci-kit-learn predictor algorithm to use. While it currently uses sklearn.linear_model.LogisticRegression, many other possibilities are available, from several types of linear regressions to solutions such as support vector machines (SVMs).train_csv
: a URL or file path to a CSV file containing the dataset. It should include a first row with the feature names, followed by the data.include
: an optional list indicating a subset of the dataset's features to be used in the prediction model.dependent_var
: the feature to be predicted, such as the entry's classification.