After growing a machine studying mannequin, you want a spot to run your mannequin and serve predictions. If your organization is within the early stage of its AI journey or has price range constraints, it’s possible you’ll battle to discover a deployment system to your mannequin. Constructing ML infrastructure and integrating ML fashions with the bigger enterprise are main bottlenecks to AI adoption [1,2,3]. IBM Db2 can assist remedy these issues with its built-in ML infrastructure. Somebody with the information of SQL and entry to a Db2 occasion, the place the in-database ML characteristic is enabled, can simply study to construct and use a machine studying mannequin within the database.
On this publish, I’ll present develop, deploy, and use a call tree mannequin in a Db2 database.
These are my main steps on this tutorial:
- Arrange Db2 tables
- Discover ML dataset
- Preprocess the dataset
- Practice a call tree mannequin
- Generate predictions utilizing the mannequin
- Consider the mannequin
I carried out these steps in a Db2 Warehouse on-prem database. Db2 Warehouse on cloud additionally helps these ML options.
The machine studying use case
I’ll use a dataset of historic flights within the US. For every flight, the dataset has info such because the flight’s origin airport, departure time, flying time, and arrival time. Additionally, a column within the dataset signifies if every flight had arrived on time or late. Utilizing examples from the dataset, we’ll construct a classification mannequin with resolution tree algorithm. As soon as skilled, the mannequin can obtain as enter unseen flight knowledge and predict if the flight will arrive on time or late at its vacation spot.
1. Arrange Db2 tables
The dataset I exploit on this tutorial is obtainable right here as a csv file.
Making a Db2 desk
I exploit the next SQL for making a desk for storing the dataset.
db2start
hook up with <database_name>
db2 "CREATE TABLE FLIGHTS.FLIGHTS_DATA_V3 (
ID INTEGER NOT NULL GENERATED BY DEFAULT AS IDENTITY,
YEAR INTEGER ,
QUARTER INTEGER ,
MONTH INTEGER ,
DAYOFMONTH INTEGER ,
DAYOFWEEK INTEGER ,
UNIQUECARRIER VARCHAR(50 OCTETS) ,
ORIGIN VARCHAR(50 OCTETS) ,
DEST VARCHAR(50 OCTETS) ,
CRSDEPTIME INTEGER ,
DEPTIME INTEGER ,
DEPDELAY REAL ,
DEPDEL15 REAL ,
TAXIOUT INTEGER ,
WHEELSOFF INTEGER ,
CRSARRTIME INTEGER ,
CRSELAPSEDTIME INTEGER ,
AIRTIME INTEGER ,
DISTANCEGROUP INTEGER ,
FLIGHTSTATUS VARCHAR(1) )
ORGANIZE BY ROW";
After creating the desk, I exploit the next SQL to load the information, from the csv file, into the desk:
db2 "IMPORT FROM 'FLIGHTS_DATA_V3.csv' OF DEL COMMITCOUNT 50000 INSERT INTO FLIGHTS.FLIGHTS_DATA_V3"
I now have the ML dataset loaded into the FLIGHTS.FLIGHTS_DATA_V3 desk in Db2. I’ll copy a subset of the data from this desk to a separate desk for the ML mannequin improvement and analysis, leaving the unique copy of the information intact.
SELECT rely(*) FROM FLIGHTS.FLIGHTS_DATA_V3
— — —
1000000
Making a separate desk with pattern data
Create a desk with 10% pattern rows from the above desk. Use the RAND perform of Db2 for random sampling.
CREATE TABLE FLIGHT.FLIGHTS_DATA AS (SELECT * FROM FLIGHTS.FLIGHTS_DATA_V3 WHERE RAND() < 0.1) WITH DATA
Depend the variety of rows within the pattern desk.
SELECT rely(*) FROM FLIGHT.FLIGHTS_DATA
— — —
99879
Look into the scheme definition of the desk.
SELECT NAME, COLTYPE, LENGTH
FROM SYSIBM.SYSCOLUMNS
WHERE TBCREATOR = 'FLIGHT' AND TBNAME = 'FLIGHTS_DATA'
ORDER BY COLNO
FLIGHTSTATUS is the response or the goal column. Others are characteristic columns.
Discover the DISTINCT values within the goal column.
From these values, I can see that it’s a binary classification activity the place every flight arrived both on time or late.
Discover the frequencies of distinct values within the FLIGHTSTATUS column.
SELECT FLIGHTSTATUS, rely(*) AS FREQUENCY, rely(*) / (SELECT rely(*) FROM FLIGHT.FLIGHTS_DATA) AS FRACTION
FROM FLIGHT.FLIGHTS_DATA fdf
GROUP BY FLIGHTSTATUS
From the above, I see the courses are imbalanced. Now I’ll not achieve any additional insights from all the dataset, as this could leak info to the modeling section.
Creating practice/check partitions of the dataset
Earlier than accumulating deeper insights into the information, I’ll divide this dataset into practice and check partitions utilizing Db2’s RANDOM_SAMPLING SP. I apply stratified sampling to protect the ratio between two courses within the generated coaching knowledge set.
Create a TRAIN partition.
name IDAX.RANDOM_SAMPLE('intable=FLIGHT.FLIGHTS_DATA, fraction=0.8, outtable=FLIGHT.FLIGHTS_TRAIN, by=FLIGHTSTATUS')
Copy the remaining data to a check PARTITION.
CREATE TABLE FLIGHT.FLIGHTS_TEST AS (SELECT * FROM FLIGHT.FLIGHTS_DATA FDF WHERE FDF.ID NOT IN(SELECT FT.ID FROM FLIGHT.FLIGHTS_TRAIN FT)) WITH DATA
2. Discover knowledge
On this step, I’ll have a look at each pattern data and the abstract statistics of the coaching dataset to realize insights into the dataset.
Look into some pattern data.
SELECT * FROM FLIGHT.FLIGHTS_TRAIN FETCH FIRST 10 ROWS ONLY
Some columns have encoded the time as numbers:
— CRSDEPTIME: Pc Reservation System (scheduled) Departure Time (hhmm)
— DepTime: Departure Time (hhmm)
— CRSArrTime: Pc Reservation System (scheduled) Arrival Time
Now, I gather abstract statistics from the FLIGHTS_TRAIN utilizing SUMMARY1000 SP to get a worldwide view of the traits of the dataset.
CALL IDAX.SUMMARY1000('intable=FLIGHT.FLIGHTS_TRAIN, outtable=FLIGHT.FLIGHTS_TRAIN_SUM1000')
Right here the intable has the title of the enter desk from which I need SUMMARY1000 SP to gather statistics. outtable is the title of the desk the place SUMMARY1000 will retailer gathered statistics for all the dataset. Apart from the outtable, SUMMARY1000 SP creates just a few further output tables — one desk with statistics for every column sort. Our dataset has two kinds of columns — numeric and nominal. So, SUMMARY1000 will generate two further tables. These further tables observe this naming conference: the title of the outtable + column sort. In our case, the column varieties are NUM, representing numeric, and CHAR, representing nominal. So, the names of those two further tables might be as follows:
FLIGHTS_TRAIN_SUM1000_NUM
FLIGHTS_TRAIN_SUM1000_CHAR
Having the statistics accessible in separate tables for particular datatypes makes it simpler to view the statistics that apply to particular datatype and scale back the variety of columns whose statistics are seen collectively. This simplifies the evaluation course of.
Test the abstract statistics of the numeric column.
SELECT * FROM FLIGHT.FLIGHTS_TRAIN_SUM1000_NUM
For the numeric columns, SUMMARY1000 collect the next statistics:
- Lacking worth rely
- Non-missing worth rely
- Common
- Variance
- Normal deviation
- Skewness
- Extra kurtosis
- Minimal
- Most
Every of those statistics can assist uncover insights into the dataset. As an illustration, I can see that DEPDEL15 and DEPDELAY columns have 49 lacking values. There are massive values in these columns: AIRTIME, CRSARRTIME, CRSDEPTIME, CRSELAPSEDTIME, DEPDELAY, DEPTIME, TAXIOUT, WHEELSOFF, and YEAR. Since I’ll create a call tree mannequin, I don’t have to take care of the massive worth and the lacking values. Db2 will take care of each points natively.
Subsequent, I examine the abstract statistics of the nominal columns.
choose * from FLIGHT.FLIGHTS_TRAIN_SUM1000_CHAR
For nominal columns, SUMMARY1000 gathered the next statistics:
- Variety of lacking values
- Variety of non-missing values
- Variety of distinct values
- Frequency of essentially the most frequent worth
3. Preprocess knowledge
From the above knowledge exploration, I can see that the dataset has no lacking values. These 4 TIME columns have massive values: AIRTIME, CRSARRTIME, DEPTIME, WHEELSOFF. I’ll go away the nominal values in all columns as-is, as the choice tree implementation in Db2 can take care of them natively.
Extract the hour half from the TIME columns — CRSARRTIME, DEPTIME, WHEELSOFF.
From trying up the outline of the dataset, I see the values within the CRSARRTIME, DEPTIME, and WHEELSOFF columns are encoding of hhmm of the time values. I extract the hour a part of these values to create, hopefully, higher options for the educational algorithm.
Scale CRSARRTIME COLUMN: divide the worth with 100 offers the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TRAIN SET CRSARRTIME = CRSARRTIME / 100
Scale DEPTIME COLUMN: divide the worth by 100 offers the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TRAIN SET DEPTIME = DEPTIME / 100
Scale WHEELSOFF COLUMN: divide the worth by 100 will give the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TRAIN SET WHEELSOFF = WHEELSOFF / 100
4. Practice a call tree mannequin
Now the coaching dataset is prepared for the choice tree algorithm.
I practice a call tree mannequin utilizing GROW_DECTREE SP.
CALL IDAX.GROW_DECTREE('mannequin=FLIGHT.flight_dectree, intable=FLIGHT.FLIGHTS_TRAIN, id=ID, goal=FLIGHTSTATUS')
I known as this SP utilizing the next parameters:
- mannequin: the title I need to give to the choice tree mannequin — FLIGHT_DECTREE
- intable: the title of the desk the place the coaching dataset is saved
- id: the title of the ID column
- goal: the title of the goal column
After finishing the mannequin coaching, the GROW_DECTREE SP generated a number of tables with metadata from the mannequin and the coaching dataset. Listed here are a number of the key tables:
- FLIGHT_DECTREE_MODEL: this desk accommodates metadata concerning the mannequin. Examples of metadata embody depth of the tree, technique for dealing with lacking values, and the variety of leaf nodes within the tree.
- FLIGHT_DECTREE_NODES: this desk gives details about every node within the resolution tree.
- FLIGHT_DECTREE_COLUMNS: this desk gives info on every enter column and their function within the skilled mannequin. The data consists of the significance of a column in producing a prediction from the mannequin.
This hyperlink has the entire record of mannequin tables and their particulars.
5. Generate predictions from the mannequin
For the reason that FLIGHT_DECTREE mannequin is skilled and deployed within the database, I can use it for producing predictions on the check data from the FLIGHTS_TEST desk.
First, I preprocess the check dataset utilizing the identical preprocessing logic that I utilized to the TRAINING dataset.
Scale CRSARRTIME COLUMN: divide the worth by 100 will give the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TEST SET CRSARRTIME = CRSARRTIME / 100
Scale DEPTIME COLUMN: divide the worth by 100 will give the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TEST SET DEPTIME = DEPTIME / 100
Scale WHEELSOFF COLUMN: divide the worth by 100 will give the hour of the flight arrival time:
UPDATE FLIGHT.FLIGHTS_TEST SET WHEELSOFF = WHEELSOFF / 100
Producing predictions
I exploit PREDICT_DECTREE SP to generate predictions from the FLIGHT_DECTREE mannequin:
CALL IDAX.PREDICT_DECTREE('mannequin=FLIGHT.flight_dectree, intable=FLIGHT.FLIGHTS_TEST, outtable=FLIGHT.FLIGHTS_TEST_PRED, prob=true, outtableprob=FLIGHT.FLIGHTS_TEST_PRED_DIST')
Right here is the record of parameters I handed when calling this SP:
- mannequin: the title of the choice tree mannequin, FLIGHT_DECTREE
- intable: title of the enter desk to generate predictions from
- outtable: the title of the desk that the SP will create and retailer predictions to
- prob: a boolean flag indicating if we need to embody within the output the likelihood of every prediction
- outputtableprob: the title of the output desk the place the likelihood of every prediction might be saved
6. Consider the mannequin
Utilizing generated predictions for the check dataset, I compute just a few metrics to judge the standard of the mannequin’s predictions.
Making a confusion matrix
I exploit CONFUSION_MATRIX SP to create a confusion matrix based mostly on the mannequin’s prediction on the TEST dataset.
CALL IDAX.CONFUSION_MATRIX('intable=FLIGHT.FLIGHTS_TEST, resulttable=FLIGHT.FLIGHTS_TEST_PRED, id=ID, goal=FLIGHTSTATUS, matrixTable=FLIGHT.FLIGHTS_TEST_CMATRIX')
In calling this SP, listed below are a number of the key parameters that I handed:
- intable: the title of the desk that accommodates the dataset and the precise worth of the goal column
- resulttable: the title of the desk that accommodates the column with predicted values from the mannequin
- goal: the title of the goal column
- matrixTable: The output desk the place the SP will retailer the confusion matrix
After the SP completes its run, we have now the next output desk with statistics for the confusion matrix.
FLIGHTS_TEST_CMATRIX:
This desk has three columns. The REAL column has the precise flight standing. PREDICTION column has the anticipated flight standing. Since flight standing takes two values – 0 (on time) or 1 (delayed), we have now 4 potential combos between values within the REAL and the PREDICTION columns:
- TRUE NEGATIVE: REAL: 0, PREDICTION: 0 — The mannequin has precisely predicted the standing of these flights that arrived on schedule. From that CNT column, we see that 11795 rows from the TEST desk belong to this mixture.
- FALSE POSITIVE: REAL: 0, PREDICTION: 1 — these are the flights that truly arrived on time however the mannequin predicted them to be delayed. 671 is the rely of such flights.
- FALSE NEGATIVE: REAL: 1, PREDICTION: 0 — these flights have arrived late, however the mannequin predicted them to be on time. From the CNT desk, we discover their rely to be 2528.
- TRUE POSITIVE: REAL: 1, PREDICTION: 1 — the mannequin has precisely recognized these flights that have been late. The rely is 4981.
I exploit these counts to compute just a few analysis metrics for the mannequin. For doing so, I exploit CMATRIX_STATS SP as follows:
CALL IDAX.CMATRIX_STATS('matrixTable=FLIGHT.FLIGHTS_TEST_CMATRIX')
The one parameter this SP wants is the title of the desk that accommodates the statistics generated by the CONFUSION_MATRIX SP within the earlier step. CMATRIX_STATS SP generates two units of output. The primary one reveals general high quality metrics of the mannequin. The second consists of the mannequin’s predictive efficiency for every class.
First output — general mannequin metrics embody correction predictions, incorrect prediction, general accuracy, weighted accuracy. From this output, I see that the mannequin has an general accuracy of 83.98% and a weighted accuracy of 80.46%.
With classification duties, it’s normally helpful to view the mannequin’s high quality components for every particular person class. The second output from the CMATRIX_STATS SP consists of these class degree high quality metrics.
For every class, this output consists of the True Optimistic Fee (TPR), False Optimistic Fee (FPR), Optimistic Predictive Worth (PPV) or Precision, and F-measure (F1 rating).
Conclusions and key takeaways
If you wish to construct and deploy an ML mannequin in a Db2 database utilizing Db2’s built-in saved procedures, I hope you’ll discover this tutorial helpful. Listed here are the primary takeaways of this tutorial:
- Demonstrated an entire workflow of making and utilizing a call tree mannequin in a Db2 database utilizing in-database ML Saved procedures.
- For every step within the workflow, I supplied concrete and useful SQL statements and saved procedures. For every code instance, when relevant, I defined intuitively what it does, and its inputs and outputs.
- Included references to IBM Db2’s documentation for the ML saved procedures I used on this tutorial.
O’Reilly’s 2022 AI Adoption survey[3] underscored challenges in constructing technical infrastructure and expertise hole as two prime bottlenecks to AI adoption within the enterprise. Db2 solves the primary one by supplying an end-to-end ML infrastructure within the database. It additionally lessens the latter, the abilities hole, by offering easy SQL API for growing and utilizing ML fashions within the database. Within the enterprise, SQL is a extra frequent talent than ML.
Take a look at the next sources to study extra concerning the ML options in IBM Db2 and see further examples of ML use instances carried out with these options.
Discover Db2 ML Product Documentation
Discover Db2 ML samples in GitHub
References
- Paleyes, A., Urma, R.G. and Lawrence, N.D., 2022. Challenges in deploying machine studying: a survey of case research. ACM Computing Surveys, 55(6), pp.1–29.
- Amershi, S., Begel, A., Chook, C., DeLine, R., Gall, H., Kamar, E., Nagappan, N., Nushi, B. and Zimmermann, T., 2019, Could. Software program engineering for machine studying: A case research. In 2019 IEEE/ACM forty first Worldwide Convention on Software program Engineering: Software program Engineering in Observe (ICSE-SEIP) (pp. 291–300). IEEE.
- Loukides, Mike, AI Adoption within the Enterprise 2022. https://www.oreilly.com/radar/ai-adoption-in-the-enterprise-2022/