Machine Learning, Data and the Strange Business of Teaching Computers to Guess

index
Jun 17
9 min read

Imagine we're sitting in a pub and I tell you I've built a machine that can predict what you'll order next.

It hasn't met you before. It doesn't know whether you prefer beer, wine or something involving an umbrella and a slice of pineapple. But I've shown it thousands of previous orders from people who arrived at a similar time, sat in a similar part of the pub, came with a similar group and spent a similar amount.

The machine looks at all that data and says, with 82 per cent confidence, that you'll order a pint.

You do.

At this point, someone will usually announce that the machine has learned to think.

It hasn't.

What it's done is find a pattern and make a statistically informed guess.

That's machine learning in its simplest form. It's not magic, consciousness or a tiny mathematician living inside a server. It's a computational system using historical data to estimate what’s likely to happen next.

The interesting part is how often that turns out to be useful.

Traditional software follows rules

Most traditional software works because someone has explicitly told it what to do.

If a customer spends more than £100, apply a discount.
If the password is wrong three times, lock the account.
If the temperature rises above a threshold, switch on the cooling system.

A programmer defines the rules, the computer executes them, and everyone goes home reasonably happy.

Machine learning changes that arrangement.

Instead of writing every rule manually, we give the computer examples and ask it to discover a useful function for itself.

In mathematical terms, we're trying to learn a mapping:

f(x) = y

The input, x, might be a customer record, an image, a document or a collection of sensor readings.

The output, y, might be a fraud score, an object label, a recommendation or a predicted failure.

The model's job is to approximate the function connecting the two.

Of course, nobody usually knows the true function. If we did, we probably wouldn't need machine learning in the first place. So the system adjusts its internal parameters until its predictions become acceptably close to the examples it's been shown.

That's the learning bit.

A model is really a very complicated set of adjustable numbers

The word "model" sounds grander than it often is.

At its core, a machine learning model is a mathematical structure containing parameters. These parameters are adjusted during training to reduce error.

Take linear regression - We might try to predict house prices using:

price = weight₁ × size + weight₂ × location + bias

The weights determine how much each feature contributes to the final answer. During training, the system compares its prediction with the real price, calculates the error and adjusts those weights.

A neural network uses the same broad principle, just at a much larger scale.

It contains layers of interconnected units, each applying weighted transformations and nonlinear activation functions. The network's output is compared with the expected answer, and the error is propagated backwards using backpropagation.

Gradient descent then nudges the parameters in the direction that reduces the loss.

That's the elegant version.

The pub version is that the model guesses, gets told how wrong it was, adjusts itself slightly and tries again millions of times.

Data isn't just fuel

People often say data is the new oil.

It's a catchy phrase, but it's also slightly misleading.

Oil is broadly consistent. A barrel of oil doesn't suddenly contain duplicated records, contradictory labels, missing fields and a note from someone in operations saying, "Don't use this after March."

Data does.

Machine learning systems don't simply consume data. They inherit its structure, errors, assumptions and biases.

If a model is trained on incomplete data, it'll learn an incomplete view of the world.

If the examples are badly labelled, it'll learn the wrong relationships.

If the training data reflects historical bias, the model can reproduce that bias with mathematical efficiency.

This is where the familiar phrase "garbage in, garbage out" becomes too polite.

With modern machine learning, it's often garbage in, confidence score out.

That's more dangerous because the output looks scientific. It comes with percentages, graphs and decimal places. The presentation suggests certainty even when the underlying evidence is poor.

Correlation is useful, but it's not understanding

A machine learning model is generally excellent at finding correlations.

It may discover that customers who exhibit a certain pattern of transactions are more likely to default. It may learn that a particular arrangement of pixels often corresponds to a cat. It may identify the language patterns associated with a support ticket that's likely to escalate.

What it doesn't necessarily know is why.

This matters because correlations can be fragile.

Imagine a model trained to identify military vehicles in photographs. If all the training images of tanks were taken on cloudy days, the model might quietly learn that grey skies are a useful predictor of tanks.

It would perform brilliantly during testing if the test data contained the same pattern.

Then someone shows it a tank in bright sunshine and everything falls apart.

This is called shortcut learning. The model has found an easy statistical signal that works on the training data but doesn't represent the real concept we're interested in.

Computer scientists describe this as a generalisation problem.

The model hasn't learned "tankness". It's learned something that happened to correlate with tanks in the dataset.

Training performance isn't the same as real-world performance

A model can memorise its training data and appear highly accurate.

This is overfitting.

Think of a student who memorises every answer in a practice paper but hasn't understood the subject. Give them the same questions and they'll look brilliant. Change the wording and they're finished.

A good model needs to perform well on data it hasn't seen before.

That's why datasets are usually divided into training, validation and test sets.

The training set is used to fit the model.
The validation set helps tune choices such as architecture, regularisation and hyperparameters.
The test set provides a final check on unseen data.

Even that isn't enough if all three sets come from the same flawed or unrepresentative source.

The real test comes after deployment, when the model meets changing behaviour, unusual edge cases and users who behave in ways no dataset designer anticipated.

That's when data drift and concept drift enter the conversation.

Data drift means the input distribution changes.

Concept drift means the relationship between inputs and outputs changes.

A fraud model trained on last year's behaviour may become less useful when criminals adopt new tactics. A customer-service classifier may degrade when the company launches a new product or changes its terminology.

The model hasn't broken in the traditional sense.

The world has moved.

Computational systems are pipelines, not isolated models

One of the biggest misunderstandings about machine learning is the idea that the model is the whole system.

It isn't.

The model may be the glamorous part, but it's usually one component in a much larger computational pipeline.

Data has to be collected, cleaned, normalised, transformed and stored.
Features may need to be engineered or embedded.
The model must be trained, evaluated, versioned, deployed and monitored.
Predictions need to be passed into other systems.
Human beings may need to review uncertain or high-risk outputs.
Feedback must be captured so the system can be improved.
A brilliant model inside a badly designed pipeline is still a bad system.

In production, data engineering, observability, latency, governance and failure handling often matter as much as raw predictive accuracy.

A model that's 98 per cent accurate but takes 20 seconds to answer may be useless in a real-time application.
A model that's 95 per cent accurate but reliably returns an answer in 100 milliseconds may create far more value.

Engineering always has trade-offs.

Generative AI is still prediction

Large language models can feel fundamentally different because their outputs are fluent, creative and often surprisingly coherent.

But underneath, they're still prediction systems.

A language model receives a sequence of tokens and estimates the probability distribution of what token should come next.

It doesn't retrieve a fully written answer from a hidden database. It generates the response one token at a time.

Given the sequence:

The capital of France is...

the model assigns high probability to "Paris".

Given a more complex prompt, it performs the same basic operation across a far richer context.

Transformer architectures make this possible through attention mechanisms, which allow the model to calculate how strongly different tokens relate to one another within the context window.

The result can look like reasoning.

Sometimes it is functionally similar to reasoning.

But the system is still producing outputs from learned statistical relationships. It doesn't automatically know whether a generated statement is true, current or supported by the organisation's approved knowledge.

That's why a language model can produce a beautifully written answer that happens to be completely wrong.

Fluency isn't proof.

The hidden issue is computational confidence

Human beings are easily impressed by confident answers.

We're even more impressed when those answers arrive instantly and are written in polished language.

Machine learning systems exploit that weakness without intending to.

A model may produce a high-confidence output because the input resembles patterns in its training data. That confidence doesn't necessarily mean the answer is correct in the real world.

It means the model is confident within its own learned representation.

Those aren't the same thing.

This distinction becomes critical in areas such as healthcare, finance, law, compliance and customer service.

A model can be computationally certain and operationally wrong.

That's why responsible systems need thresholds, uncertainty measures, source validation, escalation rules and human oversight.

The goal isn't to remove uncertainty.

It's to recognise and manage it.

Where organisational knowledge enters the picture

This brings us to a less glamorous but more immediate problem.

Most businesses aren't training foundation models from scratch. They're connecting existing AI systems to their own documents, policies, knowledge bases and operational content.

That often means using retrieval augmented generation, usually called RAG.

In a RAG system, the user's question is converted into a vector representation. The system searches for semantically similar content in a vector database, retrieves relevant chunks and gives them to the language model as context.

The model then produces an answer grounded, at least in theory, in the retrieved material.

It's a clever approach because it allows organisations to use current internal information without retraining the entire model.

But there's a catch.

RAG can only retrieve what's there.

If the knowledge base contains three conflicting refund policies, the system may retrieve all three.
If the correct document has a vague title and poor metadata, the system may not find it.
If outdated content remains highly ranked, the model may use it.
If a scanned PDF can't be parsed properly, that knowledge may be functionally invisible.

The AI isn't failing because its neural architecture is inadequate.

It's failing because the surrounding data and knowledge environment is unreliable.

This is where index fits in

index helps organisations improve the information their people and AI systems depend on.

index Scan examines content across platforms such as SharePoint, ServiceNow, Confluence and document repositories. It identifies contradictions, duplicates, outdated material, broken links, missing ownership, poor structure and content that's difficult for machines to interpret.

index Solve supports the remediation process. It helps teams correct, merge, retire or improve problematic content while keeping human review and governance in place.

index Sustain shows whether knowledge quality is improving. It provides evidence around AI readiness, risk, findability, compliance and the operational effect of better information.

Take a simple example:

Suppose a company has three documents describing how customers can cancel a contract.

One says seven days.
Another says fourteen.
The third doesn't specify a period at all.

A RAG system might retrieve any of them. A chatbot could then confidently give different answers to different customers.

index identifies the contradiction before the AI turns it into a customer-facing problem.

That's the point.

The model may be sophisticated, but the quality of the answer still depends on the quality of the evidence it receives.

Machine learning doesn't remove the need for good systems

We're often tempted to imagine machine learning as a replacement for conventional software, structured data and human judgement.

In practice, it works best alongside them.

Rules are useful when the logic is known and must be consistent.

Machine learning is useful when patterns are complex and difficult to express manually.

Humans are useful when context, accountability and judgement matter.

The strongest computational systems combine all three.

A fraud platform may use hard rules to block impossible transactions, a machine learning model to identify suspicious patterns and a human investigator to review uncertain cases.

An AI assistant may retrieve approved knowledge, generate a response and ask a human to verify it when confidence is low.

That's not a weakness.

It's sensible system design.

The real intelligence is in the architecture

Machine learning models are getting better very quickly.

But better models won't compensate for broken pipelines, poor data, weak governance or unmaintained knowledge.

The real challenge isn't simply building something that can predict.

It's building a system that can be trusted.

That means understanding where the data came from, how the model was trained, what the output represents, how performance is measured and what happens when the system is wrong.

It means monitoring the model after deployment rather than treating launch day as the finish line.

It means accepting that computational systems live inside organisations, and organisations are messy.

So yes, the machine may correctly predict that you'll order a pint.

But before we declare it intelligent, we should probably ask a few more questions.

What data did it use?
Was the data representative?
Did it learn your preferences or just notice that most people in that pub order beer?
What happens when you decide to order champagne?
And most importantly, who's checking whether the answer still makes sense?

Because machine learning isn't really about teaching computers to know.

It's about teaching them to estimate.

The hard part is knowing when we should trust the estimate.

by Paul Tucker - contact@index-ai.net