Cybergarden
Published on

8 Steps to Fine-Tune an Open-Source LLM on Your Company FAQs

Authors
  • avatar
    Name
    Almaz Khalilov
    Twitter
Machine learning workflow diagram showing the 8-step process for fine-tuning open source language models using company FAQ datasets for improved customer support automation

8 Steps to Fine-Tune an Open-Source LLM on Your Company FAQs

What if your company's FAQ page could train its own AI assistant to answer customer questions 24/7? Fine-tuning an open-source LLM on your existing FAQs can turn this into reality for Australian SMEs. This guide walks you through eight actionable steps to fine-tune a Large Language Model (LLM) on your company's FAQ data, using cost-effective tools and following best practices. By the end, you'll know how to prepare your data, apply efficient LoRA fine-tuning, leverage affordable GPU platforms, and ensure compliance with Australian regulations – enabling you to deploy a custom AI FAQ bot that saves support time and delights users.

Why Fine-Tune an LLM for Your FAQs?

Fine-tuning aligns a pre-trained model with your domain-specific knowledge and tone. Unlike prompt engineering or retrieval hacks, a fine-tuned model internalizes your FAQ data to give direct, concise answers. The payoff can be significant in both performance and cost: "a fine-tuned Llama 7B model can be ~50× more cost-effective per token than an off-the-shelf model like GPT-3.5, with comparable performance". For an SME, this means you might replace expensive API calls with a one-time training investment and a lightweight model running cheaply on your infrastructure. Fine-tuning is great for embedding custom knowledge, style, or instructions into an AI model's responses (e.g. ensuring it uses your company's terminology or writing style). And critically for Australian businesses, using an open-source model keeps your data in-house – no customer information needs to be sent to a third-party SaaS model, helping meet Privacy Act 1988 obligations.

Value Proposition: Empower your own "ChatGPT" that knows your business – at a fraction of the cost and without data leaving your control. An open-source LLM fine-tuned on your FAQs can operate internally (or within your chosen cloud region), mitigating privacy risks and latency. In fact, companies have achieved up to 80× cost reductions by fine-tuning smaller "compact" models instead of relying on large proprietary LLMs. The example below shows a fine-tuned small model (GLiNER) delivering 93.4% accuracy in a task – rivaling a 70B parameter model's performance – while costing only ~$0.10/hour on CPU (vs $8.00/hour for the 70B model). This illustrates the power of fine-tuning compact models for dramatic cost savings without sacrificing accuracy.

The 8-Step FAQ LLM Fine-Tuning Guide

Following is an eight-step process tailored for SMEs to fine-tune an open-source LLM on FAQ-style data. We'll use a running example of creating a customer support assistant from an FAQ document. Each step includes best practices and tips on tools.

Step 1: Prepare and Curate Your FAQ Dataset

Gather the set of Q&A pairs that you want the LLM to learn – for example, a list of frequently asked questions and their answers from your website or internal docs. Quality and formatting are key: ensure the answers are accurate and up-to-date, since the model will learn whatever patterns you provide. Clean the data to remove any irrelevant text (headers, URLs, etc.), and consider rephrasing questions into a consistent style if they come from multiple sources.

Next, format the dataset for training. The most common structure is a simple table or file (CSV or JSON Lines) with "prompt" and "response" fields for each Q&A pair. For instance, you might create a CSV with two columns: instruction (the question) and output (the answer). Each row would contain one FAQ question as the model's input and the expected answer as the target output. Make sure to split your data into a training set and a small validation set – e.g. 90% of the Q&As for training, 10% for validation – so you can evaluate how well the model learns to answer unseen questions. This helps avoid overfitting (when the model just memorizes answers without generalizing).

Privacy tip: If your FAQs include any personal or sensitive information, anonymize those details before training. The Privacy Act 1988 requires protecting personal data, so ensure that any customer info in training examples is either removed or that you have consent to use it for this purpose. In general, using public or internal non-personal FAQs keeps things straightforward from a compliance perspective.

Step 2: Choose an Open-Source Base LLM (Model Selection)

Choosing the right base model is crucial. You'll want an open-source LLM that balances performance with resource requirements. Popular choices in 2025 include Meta's Llama 2 (and its successors Llama 3 series), EleutherAI's Pythia, Mistral 7B, Falcon by TII, and Google's Flan-T5 (for a smaller sequence-to-sequence style model). For a FAQ chatbot, you'll typically pick a model that has been pre-trained on general language but not yet specialized for instruction/Q&A – unless you start with an already fine-tuned variant.

Consider model size: larger models (e.g. 13B, 30B, 70B parameters) can potentially give more fluent and accurate answers, but they require more GPU memory and time to fine-tune. Smaller models (7B or even 3B) train faster and can often suffice for straightforward FAQ responses. In many cases, a fine-tuned 7–13B model hits a sweet spot for SME applications. As a reference, fully fine-tuning a model can need roughly 16 GB of GPU VRAM per 1 billion parameters – so a 7B model might demand ~112 GB VRAM if you tried to update all weights (which is impractical for most SMBs!). We will address how to slash this requirement in Step 4 with efficient tuning.

You should also check the model's license and community usage: for commercial projects, ensure the model allows commercial use. (For example, Meta's LLaMA 2 license permits commercial use with certain conditions, while some older "research-only" models like the original LLaMA 1 were restricted). Models like Dolly 2.0 from Databricks are explicitly released for commercial use – Dolly is a 12B model fine-tuned on a high-quality instruction dataset written by Databricks employees. Such models prove that open LLMs can be adapted to useful assistants by fine-tuning on relevant data. If one of these already fine-tuned models aligns closely with your needs, you might even start from it (e.g. if you find an open model already fine-tuned on Q&A or customer support data, it could save you effort). But for full control and learning, many SMEs choose a base foundation model and fine-tune it themselves on their unique FAQs.

Model examples: If your FAQs are general and text-based, a general-purpose model like Llama-2 7B or 13B is a good start. If your FAQs involve code or technical terms (say you're a software company and FAQs include code snippets), a model like Code Llama (a Llama-2 fine-tuned on coding tasks) might be better. If you need multilingual support (e.g. FAQs in English and Chinese), check if the model supports those languages or consider a multilingual LLM. The key is to pick a model that you can reasonably fine-tune with your available compute (we address hardware next) and that doesn't overshoot your quality needs – remember, a smaller fine-tuned model can outperform a larger model that isn't specialized to your task, and it will be cheaper to run.

Step 3: Set Up a Cost-Effective Fine-Tuning Environment

You don't need an enterprise GPU farm to fine-tune a model on your FAQs – there are affordable ways to get the necessary compute:

  • Use a Cloud GPU Service: Services like Google Colab or Modal Labs let you rent GPUs by the hour (or even free in Colab's case, within limits). Colab Pro, for example, costs about A$15/month and gives access to a single GPU (often an NVIDIA T4 or better) for interactive sessions + included compute quota | + Low cost, easy start in notebook format. + No setup; great for experiments. Limited session duration, might disconnect. Data on Google's cloud (typically US servers). |
  • Lease or Rent GPUs on-demand: Beyond Colab and Modal, there are platforms like AWS SageMaker, Azure ML, or community GPU marketplaces (e.g. Vast.ai, Paperspace). AWS SageMaker is a fully managed ML platform where you can fine-tune models on instances like ml.g5.4xlarge (which has a NVIDIA A10G 24GB) or more powerful instances. On-demand rates might be around A$4–6 per hour for a 24GB GPU in AWS Sydney, though spot instances or reserved plans can lower this. The upside is you can select an Australian data center to run your training, satisfying any requirements to keep data onshore. The downside is you'll need to handle more setup (SageMaker or cloud VMs require environment setup, though SageMaker has some built-in tooling for Hugging Face training jobs).
  • Use Consumer-Grade Hardware (On-Premises): If your team has a desktop with a high-end GPU (such as an NVIDIA RTX 4090 with 24GB VRAM, cost ~A$3,000), you can fine-tune smaller models on it. Many SMEs go this route for full control – no recurring cloud fees, and data never leaves your office. A 24GB card can handle fine-tuning a 7B or 13B parameter model with parameter-efficient methods (or even larger models if using advanced techniques like QLoRA, as we'll discuss). Just ensure you follow good IT security practices (patch the OS/drivers, restrict access to the machine running the training, etc.). According to the Australian Cyber Security Centre's Essential Eight, maintaining software patching and restricting admin privileges are critical – this applies to your AI training environment too. If you're fine-tuning on a local server, treat it as a sensitive system: keep it updated and control who can access the training data and model outputs.
  • Hybrid approach: Some SMEs start on Colab to prototype, then move to a more powerful setup (cloud or on-prem) for the full training run once everything is working. This saves costs and allows quick iteration on a small sample of data before committing to the complete fine-tune.

Cost Estimation: Fine-tuning time can range from under an hour (for very small models or small datasets) to multiple hours or days (for bigger models or large datasets). For example, a complete fine-tune of a 7B model on a few thousand QA pairs might take a couple of hours on a single A100 GPU. If an A100 costs ~US$3.50/hour (~A$5.30/hour), that training might cost only A$10–15 in cloud fees – far less than the monthly cost of an AI service subscription. Always monitor usage to avoid surprises (cloud platforms let you set budgets/alerts). The table below compares some compute options:

← Scroll for more →
Compute OptionSpecsApprox Cost (AUD)Pros & Cons
Google Colab Pro1× Tesla T4 or P100 (16GB), Up to ~12 hours session~$15/month (subscription) + included compute quota+ Low cost, easy start in notebook format. + No setup; great for experiments. Limited session duration, might disconnect. Data on Google's cloud (typically US servers).
Modal Labs (Serverless)Various GPUs on-demand (e.g. A10 24GB, A100 40GB)~$5–7 per GPU-hour (pay-go) for 75 GPUs per hour+ No infrastructure to manage; scales easily. + Only pay for what you use; good for one-off jobs. Need to upload data to cloud (ensure compliance). Requires writing a small script to launch jobs.
AWS/Azure Cloud VMChoose instance (e.g. g5.2xlarge, V100/A100) in Sydney region~$4–6 per hour (on-demand) (spot instances ~70% cheaper)+ Full control of environment; pick an AU data region for compliance. + Can integrate with other cloud services (S3, etc.) easily. Must manage VM or use ML platform; more DevOps overhead. On-demand rates can be pricey without optimization.
On-Premises GPU (e.g. RTX 4090)24 GB VRAM (desktop/workstation)~$3,000 one-time (hardware) + electricity+ Data never leaves your premises – maximum privacy. + One-time cost if you'll do frequent training/inference. High upfront investment; hardware may become outdated. Need in-house expertise to maintain drivers, etc., and physical security of the machine.

Table: Comparison of cost-effective GPU solutions for fine-tuning. Pricing is approximate in AUD. Colab's subscription gives a monthly quota of usage, while cloud platforms charge per hour (spot pricing can cut costs). On-premises is cheapest long-term if you already have hardware and the know-how.

Compliance note: If using cloud services, remember Australian Privacy Principle #8 (cross-border disclosure of personal information). If your FAQ data contains personal info, you should ensure the cloud service meets equivalent privacy protections or opt for an Australian region to host the training. In practice, for an FAQ chatbot we often use non-personal Q&As, so this may be a minor concern – but it's good governance to keep track of where your training data and fine-tuned model are stored. By fine-tuning an open-source model, you always have the option to bring the model back in-house after training (since you can download the weights) and then deploy it on a local server for inference. This way, even if you trained in the cloud, the final model serving can happen under your control.

Step 4: Leverage Parameter-Efficient Fine-Tuning (PEFT) – LoRA

One of the secret weapons that makes fine-tuning feasible on modest hardware is parameter-efficient fine-tuning (PEFT). Instead of updating all 6+ billion weights of a model (which, as we saw, would require enormous memory and compute), PEFT techniques adjust only a small subset of parameters or add small adapter modules, leaving the rest of the model untouched. The most popular PEFT method is LoRA (Low-Rank Adaptation).

How LoRA works: LoRA inserts small trainable matrices into the model's layers (often in the attention mechanism) and only trains those new matrices during fine-tuning. Think of it like adding a tiny adjustable "lens" on a fixed large camera – you fine-tune the lens rather than rebuilding the whole camera. Under the hood, LoRA takes the large weight matrices of the model and factorizes the changes as low-rank updates (hence the name). The original weights stay frozen, and the model learns the task by tweaking these low-rank adapter weights. This drastically reduces the number of trainable parameters (often by orders of magnitude) while achieving nearly the same performance as full fine-tuning. In practice, you might train only a few million parameters (the LoRA adapters) instead of billions, cutting GPU memory needs and training time. After training, you can either merge the LoRA adapters into the base model weights or keep them separate and use them on-the-fly. Keeping them separate is convenient – the adapter file is small, and you can apply it to the original model at runtime to get the same result as the merged model.

For our FAQ bot example, using LoRA means you could fine-tune a model like Llama-2 7B on a single 16–24 GB GPU without issues, where full fine-tuning would have been impossible on that hardware. As a bonus, LoRA doesn't overwrite the original model – you get an "add-on" with your company knowledge. This modularity is great if you want to maintain multiple versions or revert changes easily. Microsoft researchers (who introduced LoRA) and Hugging Face provide libraries to apply LoRA with just a few lines of code. In fact, the Hugging Face PEFT library (peft) implements LoRA so you can wrap a transformer model with LoRA layers in one command.

LoRA in action: Let's say your base model is 6 billion parameters. A typical LoRA setup might introduce ~30 million new parameters (this number depends on the "rank" you choose for the low-rank updates, e.g. rank=8 or 16). Those 30M are the only ones that will get gradients and update – a mere ~0.5% of the model's size! The memory needed for gradients/scaling is drastically lower than training all 6B. As one source puts it, "LoRA updates only a small adapter matrix on top of the original weights… significantly faster than traditional fine-tuning, and the adapter can be saved separately for a small memory footprint.".

Step 5: Fine-Tune the Model on Your Data

Now comes the actual fine-tuning run. With your dataset ready, model chosen, and a plan to use LoRA (or another PEFT method), you can start training. This generally involves:

  1. Loading the pretrained model and tokenizer: Using a framework like Hugging Face Transformers, load your base model. If using LoRA, you'd integrate the LoRA adapters at this point (e.g. via the peft library or a training framework that supports it).
  2. Preparing the data loader: Tokenize your FAQ questions and answers. Often you'll format each Q&A pair as a single concatenated text for the model. For example, you might use a prompt template: "Question: FAQ question\nAnswer:" as the input, and train the model to generate the answer after the prompt. Ensure that the model sees where the question ends and answer begins (some use delimiters like special tokens or newlines). The Hugging Face Datasets library can help create a dataset and dataloader from your CSV/JSON easily.
  3. Configuring training hyperparameters: Set a suitable learning rate (small, e.g. 1e-4 or 2e-5 for fine-tuning), batch size (depends on memory – maybe 1-4 per GPU if model is large), and epochs (how many passes through the data). FAQ datasets might be small (hundreds or thousands of pairs), so you may do multiple epochs. Monitor for overfitting – if the model memorizes answers verbatim and fails to generalize to rephrased questions, you might need to augment data or reduce epochs.
  4. Training loop: If using a high-level trainer (like Hugging Face's Trainer or PyTorch Lightning), this is handled for you. The trainer will feed the data to the model and update the LoRA adapter weights gradually. With LoRA, training is fast – potentially just minutes to a couple of hours for a small dataset on a decent GPU. You'll see the loss (error) hopefully decrease on both training and validation sets. It's good to use early stopping or at least save checkpoints: if the model stops improving or starts overfitting (val loss goes back up), you can stop early.

During training, keep an eye on GPU usage. If you run out of memory, you might need to lower batch size, use gradient accumulation (simulate larger batches by accumulating gradients over steps), or even use mixed precision (most frameworks do this by default now – FP16 reduces memory) or gradient checkpointing. These techniques can help fit larger models on smaller GPUs, albeit with some speed trade-off.

Tools to simplify training: Instead of writing all the boilerplate code yourself, you can use specialized fine-tuning frameworks that handle a lot of this (we detail them in the next section). For example, Axolotl is a popular open-source tool where you just provide a config (YAML or Python dict) with your data path, model name, LoRA parameters, etc., and it runs the training for you. This can save you from debugging training loops and ensure best practices (like shuffling data, using the right LR schedulers) are applied.

Step 6: Evaluate the Fine-Tuned Model

After (or during) training, you need to assess how well the model learned your FAQs. There are a few ways to evaluate:

  • Validation Set Performance: If you set aside some FAQ pairs as validation, check the model's responses to those questions. Since it saw similar ones but not these exact ones in training, this is a good gauge of generalization. Are the answers correct and well-written? If you have ground-truth answers, you can calculate metrics. For instance, you could compute the exact match or F1 score if the answers are short and factual. For longer answers, a similarity metric or just manual review might be better.
  • Sample Q&A Testing: Try asking the fine-tuned model variations of your questions. A great benefit of fine-tuning is that the model should handle paraphrased questions, not just the exact wording from the FAQ. For example, if an FAQ was "What is your refund policy?" and the answer explains it, test the model with "How do refunds work?" or "Can I get my money back if not satisfied?" – it should ideally give a coherent answer drawn from the policy it learned. If it fails or hallucinates (makes up unrelated info), you may need to fine-tune more (maybe add those phrasings into the training data or adjust training parameters).
  • Comparison to Base Model: It's insightful to ask the original base model the same questions and compare responses. A fine-tuned model should consistently output the specific correct answers, whereas a base model (even a big one) might answer in generalities or incorrectly. This was exemplified in the earlier case study: a compact fine-tuned model outperformed a much larger model on the specialized task because it had the relevant knowledge baked in. Fine-tuning should greatly improve accuracy on your domain-specific queries.

If the results aren't satisfactory, revisit your dataset (Step 1) and training setup (Step 5). Maybe you need more examples for certain tricky FAQs, or perhaps the model is too small to capture a particularly complex answer. It's not uncommon to do a few iterations: tweak some training prompts, add a few more Q&A pairs for questions it got wrong, and fine-tune again (possibly starting from the previous fine-tuned model for continuity).

Evaluation tip: In addition to raw Q&A accuracy, check the style and tone. If your answers need to follow a certain tone (formal vs casual, concise vs detailed), ensure the model's outputs align. Fine-tuning should have imparted some stylistic cues present in your answers. If not, you might explicitly add instructions in the prompt during inference (e.g. "Answer in a friendly tone: ...").

Step 7: Deploy and Integrate the Customized LLM

With a fine-tuned model that performs well, the next step is deployment. For an FAQ assistant, deployment could mean a few things:

  • Internal API or Chatbot: You can wrap the model in an API endpoint (e.g., using FastAPI or Flask in Python) or a chat interface on your website/intranet. The Hugging Face Transformers library makes it simple to load the model and serve answers: e.g. pipeline = pipeline('text-generation', model=your_model, tokenizer=your_tokenizer, device=0) then call pipeline("Question: ... Answer:"). There are also frameworks like LangChain that help build conversational agents; you could integrate your fine-tuned model as the core of a QA chain without needing external calls.
  • Hugging Face Inference Endpoint or Space: If you prefer not to host yourself, Hugging Face offers Inference Endpoints – you upload your fine-tuned model (privately if you wish) to the Hugging Face Hub and spin up a managed API for it. This is paid but handles scaling. In the CFM case study we discussed, they deployed models on Hugging Face endpoints to serve results at scale. For a smaller scale (e.g. just your team or a few hundred users), hosting on a single VM or even a powerful desktop might be enough.
  • On-device or Edge deployment: If your fine-tuned model is small (and perhaps further compressed via quantization), you could even run it on the edge – for example, on a customer support representative's laptop or a mobile app (some 7B models can run on phones now with 4-bit quantization!). This reduces latency and dependence on internet. However, for an SME's FAQ bot, usually a server or cloud function is simplest.

When deploying, consider inference optimizations: you don't need to use FP16 or FP32 for inference – you can quantize the model to 8-bit or even 4-bit weights to reduce memory and CPU/GPU usage. This often has minimal impact on answer quality. For instance, using the bitsandbytes library for 8-bit loading is common, and tools like GPTQ can create 4-bit quantized models that still perform well. A quantized fine-tuned model uses less RAM/VRAM, which might let you serve more concurrent users or even deploy on cheaper hardware (Essential for cost efficiency in SME settings).

Compliance & Security in deployment: Now that the model is live answering questions, ensure it doesn't inadvertently expose any data it shouldn't. If the training FAQs were all public or internal non-sensitive info, you're fine. But if any sensitive data was included, test that the model doesn't reveal it in odd contexts (AI models can sometimes regurgitate training data verbatim if prompted the right way). Also, implement basic usage controls: for example, require an internal login to access an internal FAQ assistant so that the model's answers (which might include some proprietary info) aren't open to the public. This aligns with the Essential Eight strategies such as restricting access (only authorized staff should query an internal model). Monitor the system for any unusual activity or outputs, just like you would any new IT system introduced in your business.

Step 8: Monitor, Maintain, and Refine the Model

The project doesn't end at deployment. To get long-term value from your fine-tuned FAQ LLM, plan for ongoing maintenance:

  • Monitor performance: Gather feedback from users (or your own testing) on whether the answers remain correct and helpful. Track if certain questions make the bot confused or if users are asking things outside its training. This can inform future training.
  • Update FAQs and retrain as needed: Companies update policies, add products, or discover new frequently asked questions. Set up a process to periodically update the FAQ dataset and re-fine-tune the model. The good news is that with LoRA or adapter-based tuning, you can often start from your last fine-tuned model and incorporate new data with a bit more training – this incremental learning is usually faster than the initial training. If only a small change, sometimes just fine-tuning on a handful of new QAs for a few epochs is enough to refresh the model.
  • Stay compliant: If regulations or company policies change, reflect that in your model's behavior. For example, if new privacy rules forbid the AI from answering certain types of questions, implement that either by adding those to your FAQs ("Sorry, we cannot provide that information.") or by adjusting the prompt at inference time to steer it away from problematic areas. Always document the data used in training and who approved it – this helps with governance and audit requirements. In Australia, if your industry has specific compliance (e.g. health sector must comply with the Health Records Act in addition to Privacy Act), ensure the use of AI doesn't violate any data handling rules. Since your model is within your control (unlike a public AI service), you have the flexibility to purge or secure the data as needed – exercise that responsibility.
  • Measure ROI: Over time, measure how the fine-tuned model is benefiting your business. Perhaps you notice a reduction in support tickets or faster response times. This can justify further investment (like maybe fine-tuning a slightly larger model for even better quality, or adding more data sources). It's also useful to compare against alternatives occasionally – e.g., is your fine-tuned model performing better than a generic model with retrieval augmentation? In many cases it will for the targeted domain of your FAQs, but keeping an eye on it ensures you're using the best approach.

By following these steps, you've transformed an open-source LLM into a bespoke QA specialist for your company. Next, we'll overview some of the tools and frameworks that help streamline this journey, and how to choose the ones that fit your needs.

Tools and Frameworks for Fine-Tuning LLMs (and How to Choose)

Fine-tuning might sound complex, but the community has developed excellent tools to simplify the process. Many share common features to help with efficiency and usability:

  • Built on Hugging Face ecosystem: Almost all modern fine-tuning frameworks leverage the Hugging Face Transformers library under the hood for model loading and tokenization. They provide convenient wrappers or configurations so you don't have to write pure PyTorch unless you want to.
  • Support for PEFT techniques: The top tools support methods like LoRA and QLoRA natively, along with traditional full fine-tuning. This means they automatically apply the tricks to reduce memory usage and speed up training.
  • Out-of-the-box optimizations: Expect features like mixed precision (FP16/BF16 training), optimized kernels (some use DeepSpeed or FlashAttention under the hood), and gradient accumulation. These ensure you get the most out of limited hardware.
  • Easy data handling: They often accept multiple data formats (JSON, CSV, etc.) and can perform operations like text concatenation or sample packing for you. Some provide YAML/JSON config files to specify your dataset and parameters, avoiding code edits.
  • Checkpoints and Resuming: If your job crashes or you want to train in segments, these tools usually handle saving checkpoints and resuming training seamlessly.
  • Community recipes: Being popular in open-source means you can find community-provided configuration files or examples (for instance, "someone's Axolotl config for fine-tuning Llama-2 on a Q&A dataset"). This can accelerate your setup.

Below, we profile a few leading open-source frameworks that make fine-tuning easier, and a couple of low-code platforms, highlighting their features, ideal use-cases, and any costs. All the open-source libraries are free to use (you just need compute). We've also included links and notes on compliance where applicable:

  • Hugging Face Transformers + PEFT: Approach: Use the raw transformers library (Trainer API or custom PyTorch) along with Hugging Face's PEFT add-on for LoRA. Pros: Maximum flexibility – you can tweak the model or training loop as needed. Hugging Face provides extensive documentation and even example scripts (e.g., a script to fine-tune GPT-2 on text, which you can adapt to any model). Cons: Slightly more coding/dev work upfront compared to higher-level frameworks. Use if: you are comfortable with Python and want full control, or have a very custom fine-tuning scenario. Compliance: Since this isn't a service, compliance depends on where you run it – you can train fully offline on your secure environment, which is ideal for sensitive data. No cost except compute.
  • Axolotl – GitHub link: Axolotl is an easy-to-use wrapper for fine-tuning large language models. It was specifically created to streamline LLM finetuning, providing sane defaults and configurations. You write a short YAML config describing your model (it can fetch models from Hugging Face Hub by name), your data path, and fine-tuning settings. Axolotl supports LoRA, QLoRA, full fine-tuning, multi-GPU, etc., with built-in optimizations like sample packing (combining shorter samples to utilize context window). It's a great all-rounder: "If you don't want to get deep into the math and just want to fine-tune a model, use Axolotl.". Pros: Beginner-friendly, yet supports advanced setups. Works with many model architectures (LLaMA 2/3, Falcon, Mistral, Pythia, etc.). Cons: Slight learning curve with the config format, but plenty of examples exist. Ideal for: Most users, especially beginners or anyone wanting a proven, community-backed tool. It's the recommended choice in most cases. Cost: Free tool; you run it on your own hardware or cloud (Axolotl itself doesn't charge). Compliance: Since you run Axolotl in your environment, you retain full control of the data (just be mindful if that environment is cloud – then the same caveats about data location apply).
  • Unsloth – GitHub link: Unsloth is a newer entry, focused on speed and efficiency for those with limited hardware. Created by a former NVIDIA engineer, it introduces low-level optimizations (like a custom Triton GPU kernel for attention) to fine-tune models "2–5× faster with ~80% less memory usage" compared to standard methods. Impressively, Unsloth achieves this without relying on quantization – it's more about algorithmic efficiency. Pros: Super useful if you only have a single GPU or even a older GPU with less memory. It's tailored for scenarios like free Colab T4 GPUs (which have 16 GB) – "if you only have access to smaller GPUs, Unsloth might be the choice". It supports LLaMA, Mistral and others, but note it currently works single-GPU only (no multi-GPU training). Cons: Slightly less mature community than Axolotl; also since it uses some custom kernels, you may need to ensure compatibility with your drivers. Ideal for: Researchers or devs working on a shoestring GPU budget, e.g., trying to fine-tune on Colab or a single RTX 2060, etc. It really lowers the barrier. Cost: Free to use; your cost is just whatever platform you run it on (which could be Colab free!). Compliance: Again, self-hosted. The tool doesn't phone home or anything; just manage your environment properly.
  • Torchtune – GitHub link: Torchtune is a PyTorch-native library that provides a minimalist, extensible interface for fine-tuning LLMs. Think of it as lightweight scaffolding on top of PyTorch – no heavy abstractions. It's designed to be hackable and integrate well with other PyTorch ecosystem tools. It comes with recipes for things like LoRA and QLoRA built-in. Pros: Great if you are a power user who actually likes to see what's going on under the hood. It doesn't hide the training loop from you as much; you can modify pieces easily. It also prides itself on working on consumer GPUs (24GB VRAM) out-of-the-box – meaning it's tested to be memory-efficient as well. Cons: Not as plug-and-play as Axolotl; you should know some PyTorch. Documentation might be more sparse since it's a newer project. Ideal for: Engineers who want a clean, flexible base to build on, possibly to integrate custom losses or data processing steps. Also if Axolotl or others don't support a niche requirement you have, Torchtune might allow you to implement it without starting from scratch. Cost: Free, open-source. Compliance: Self-hosted, so no issues beyond where you run it.
  • Low-Code Platforms (OpenAI Fine-Tuning, Predibase, etc.): There are also platforms that offer a more graphical or managed way to fine-tune. For example, OpenAI's Fine-tuning service (while not open-source and uses OpenAI's models) lets you upload training data to their API and fine-tune an instance of GPT-3.5 or similar. Similarly, Predibase is a platform that can fine-tune open models through a UI for various fine-tuning approaches. These are easy in terms of not having to code, but they have downsides: OpenAI's is limited to their models (and can be expensive and you give up some control of data), Predibase supports various models and private deployments but is a paid product and may not offer the flexibility of coding your own solution. For an SME with technical capability, using open-source libraries as above is often more cost-effective in the long run (and avoids vendor lock-in). However, if you truly have no ML engineers, a low-code service might be an option – just weigh the recurring costs and compliance (e.g. OpenAI's fine-tune means your data and model weights sit on their servers, likely overseas, which could be a privacy consideration).

In summary, Axolotl tends to be the go-to for most because of its balance of ease and power (it's even recommended as the default choice for beginners by experts). Unsloth is a lifesaver if you're constrained by weak hardware. Torchtune is there for the hands-on folks wanting flexibility. And of course, the vanilla Hugging Face + PEFT approach is always available if none of the above perfectly suit your needs.

Here is a quick comparison of these tools/frameworks:

← Scroll for more →
Tool/FrameworkEase of UseNotable FeaturesWhen to UseCost
Hugging Face Transformers + PEFTMedium (code-level)Full flexibility; large community; needs codingUse if you need custom logic or want to learn inner workings.Free (open-source)
AxolotlEasy (config-driven)YAML configs, multi-GPU support, LoRA/QLoRA built-in, sample-packingDefault choice for most fine-tunes (especially first-timers).Free (open-source)
UnslothEasy/Medium (some config)Highly optimized for single GPU (Triton kernels), 2-5× speed, -80% memoryWhen GPU memory is very limited or using Colab/free GPU.Free (open-source)
TorchtuneMedium (code-focused)PyTorch-native, minimal abstraction, supports LoRA/qLoRA, 24GB GPU friendlyFor experienced users needing flexibility and integration.Free (open-source)
OpenAI Fine-TuningEasiest (fully managed)Simple API/UI, no infra needed, but only OpenAI modelsNon-critical use-cases where using OpenAI model is acceptable and budget allows. (Data leaves your control)Paid per token (usage-based)
PredibaseEasy (low-code platform)UI-driven, supports various open models, private deployments possibleIf you want a managed solution but still use open models. Check if they have AUS hosting.Subscription/Usage (Proprietary)

Table: Fine-Tuning Tool Comparison. All open-source options (top four) are free to use – you pay only for compute. OpenAI/Predibase are proprietary services with usage fees. Each open-source tool keeps your fine-tuning process in your hands, which is better for compliance (you decide where to run it). Choose based on your team's skill and hardware: for most, Axolotl is a great starting point, Unsloth if you're really constrained, Torchtune if you have niche requirements.

How to Pick the Right Approach for Your SME

With several options on the table, here's a brief guide for decision-making:

  • No ML expertise in-house / want quickest result: Consider a low-code platform or consulting with a third-party. But if data sensitivity is a concern, at least use a platform that supports open-source models and on-prem deployment (to avoid sending data to third parties). Sometimes, a compromise is using a pre-fine-tuned open model (like an existing FAQ bot model on Hugging Face) and then using Retrieval-Augmented Generation (RAG) for any missing pieces – this avoids training altogether but results may not be as tailored. If you do have some coding ability, using Axolotl with guidance can be quicker than you think.
  • Single data scientist/engineer available: Go with Axolotl. It will allow that person to get the job done with minimal fuss. They can start on a smaller subset of data in Colab to validate the pipeline, then scale up to the full run on a better GPU via Modal or AWS. Axolotl's defaults will handle most optimizations. As noted, it's recommended in the majority of cases.
  • Very tight compute resources (no GPU > 16 GB): Try Unsloth or a smaller model. Unsloth will squeeze performance out of limited hardware at the cost of not supporting multi-GPU. If your dataset is not huge, this might be fine. Alternatively, look at QLoRA which we haven't explicitly detailed above: QLoRA is a technique that combines 4-bit quantization during fine-tuning with LoRA. It was proven that "QLoRA can finetune a 65B model on a single 48GB GPU while preserving full 16-bit performance". In other words, even if you had only a consumer 48GB GPU (which is rare, that's more like a top-end A6000 or server GPU), you could fine-tune the biggest models. Scale that down and it means a 13B model might be fine-tuned on a 12GB GPU with QLoRA. If Unsloth or Axolotl with LoRA still don't fit a model in memory, consider QLoRA (Axolotl actually supports QLoRA too for various fine-tuning approaches). It's slightly more complex (need bitsandbytes library and certain GPUs) but very powerful for memory savings.
  • Concerned about data leaving your environment: Rule out OpenAI's and maybe even Colab (since Colab is on Google's cloud). Instead, favor on-prem or controlled cloud. That could mean using Axolotl/Unsloth on a local GPU server, or spinning up an AWS VM in the Sydney region under your account. This way, all data and model artifacts remain in locations you control. Remember, open-source gives you full ownership of the model – no one else has your fine-tuned weights unless you choose to share them.
  • Need multi-GPU for a large model or dataset: Axolotl is a good pick as it handles multi-GPU (it can integrate with DeepSpeed or FSDP for distributed training). For example, if you decided to fine-tune a 30B model, you might use 2–4 GPUs. This is advanced but doable; Axolotl and HF Transformers can manage it with the right config (and hopefully you'd be on a cloud machine with fast interconnects). Alternatively, you could reduce precision (like 8-bit or QLoRA) and stick to a single GPU.
  • Budget considerations: If budget is near zero, use free Colab and Unsloth with a small model. If moderate, a one-time cloud expense of a few hundred AUD could get you a fine-tuned model that then runs virtually free on CPU for answering questions. Compare that to the ongoing cost of something like an OpenAI API for each query – fine-tuning often wins in cost over just months of usage. Always align the model size to your actual needs (don't fine-tune a 70B model if a 7B suffices) to keep inference cheap and fast.

Summary: Bringing It All Together

In 2025, Australian SMEs have an unprecedented opportunity to build their own intelligent FAQ assistants. Open-source LLMs and efficient fine-tuning techniques like LoRA put this capability within reach technically and financially. By following a structured approach – preparing a clean dataset, choosing the right model, using affordable GPU resources, and leveraging PEFT methods – you can teach an AI to understand and respond with your business's knowledge and tone. We saw that even a smaller fine-tuned model can outperform larger generic models on a specialized task using structured approaches that combine accuracy with cost-effectiveness, and do so at a fraction of the running cost. The key steps include careful data curation, using tools like Axolotl or Unsloth to simplify training, and considering compliance (Privacy Act and Essential Eight) at each stage, especially if using cloud services.

With your custom model ready, integration into your workflows – whether as a customer-facing chatbot, an internal helpdesk assistant, or a component of a broader application – is the final leg. The fine-tuned LLM can now deliver instant, accurate answers about your products and policies, boosting efficiency and consistency. And because you hold the model's reins (weights and hosting), you can ensure it's used responsibly and securely.

To conclude, fine-tuning an open LLM on your company's FAQs transforms scattered information into a powerful Q&A brain at the heart of your business. It's a smart investment into AI tailored to your needs, with controllable costs and compliance. Australian SMEs can innovate in this space confidently, knowing that the tools and community knowledge are mature. As the technology evolves (with even more efficient models and tools on the horizon), the process will only get easier. So start small – maybe fine-tune a pilot model on a handful of FAQs – and iterate. You might be surprised at how quickly your AI assistant becomes an essential team member!

FAQs

Q1: Do I need a supercomputer or expensive GPU to fine-tune an LLM on my FAQs?
A: No. With techniques like LoRA and QLoRA, you can fine-tune moderately sized models on a single modern GPU (or even a free Google Colab session). For example, a 7B parameter model can be fine-tuned on a consumer 24GB GPU thanks to low-rank adapters and 4-bit quantization. It's far from needing a supercomputer – many SMEs use a desktop with an RTX card or rent a cloud GPU for a few hours. The rule of thumb is ~16GB VRAM per 1B parameters for full fine-tuning, but LoRA defies that by only training a small portion. Start with the resources you have; you can always upgrade model size or get a cloud GPU if needed.

Q2: How large should my FAQ dataset be for fine-tuning?
A: Fine-tuning does not require millions of examples. If you have hundreds of high-quality Q&A pairs, that can be sufficient to significantly improve a model's performance on that domain. In fact, fine-tuning on a small, targeted dataset can yield very strong results – the QLoRA research found that fine-tuning on just a few thousand carefully curated samples led to near state-of-the-art performance on those tasks across multiple model types and benchmark comparisons. Focus on quality and representativeness: include variations of how questions might be asked. If your dataset is extremely small (say <50 examples), you might consider data augmentation or using a larger model, but generally a few hundred examples can make a noticeable difference.

Q3: What's the difference between fine-tuning an LLM and using retrieval (RAG) with a base model?
A: Fine-tuning integrates the knowledge into the model's weights. The model "remembers" the FAQ information and can produce an answer without external help. Retrieval (RAG), on the other hand, leaves the model unchanged but equips it with a knowledge base: when asked a question, the model fetches relevant documents (your FAQs) and then formulates an answer using them. RAG is great when you have a large corpus or frequently updated info, because you don't need to retrain the model for each update – it just searches the latest data. However, RAG requires a search/index step and a bigger prompt (which can be slower or costly if using an API), and the model might not integrate the info as smoothly as a fine-tuned model does. Fine-tuning is preferable when you have a well-defined set of Q&As and you want the fastest, most direct responses (and you're okay retraining when the data changes). In many cases, you can also combine them: use fine-tuning to give the model strong conversational skills and base domain knowledge, and use retrieval for long-tail or less common questions that weren't in training. For an SME FAQ bot, if the FAQ list is reasonably scoped, fine-tuning alone can handle it elegantly – whereas if you had an entire company wiki or hundreds of pages, you might lean towards or supplement with RAG.

Q4: Our company handles sensitive data. Is it safe to fine-tune an open-source model with it?
A: It can be, but you need to take precautions. The benefit of open-source is you can do the entire process in a secure environment (no external API calls). If you fine-tune on a machine in your control (or a cloud VM in a region and setup you control), the data and model stay with you. Ensure that any service you use (e.g., cloud provider) is reputable and compliant with standards you need (for instance, ISO 27001 or IRAP if relevant). Also consider data minimization: do you need to include actual sensitive data in the training, or can you use abstracted info? For example, instead of fine-tuning on an FAQ that contains a real customer's name or ID, you could generalize it (use placeholders). In deployment, restrict access to the model if it contains proprietary info. From a privacy law perspective, if the model was trained on personal info, you should treat the model's outputs as potentially containing that info. This is a gray area legally, but following best practices (documenting consent/use of data, securing the model) will put you in a good position. Many SMEs choose to fine-tune on non-personal FAQs (product info, policies, etc.) – in those cases, the risk is low. For anything that includes personal data, consult your privacy officer or guidelines to ensure compliance with the Privacy Act and Australian Privacy Principles. The bottom line: open-source fine-tuning can be done in a very secure, private way (unlike using a third-party AI API), but you must implement the security around it.

Q5: How do I handle new FAQs or changes after I've fine-tuned the model?
A: You have a few options. The simplest is to periodically retrain with the updated dataset. Because you're using LoRA/adapters, you can even start from your already fine-tuned model and just continue training on the new data (this can quickly adjust the model). If changes are small, you might get away with training for just a few epochs on the fresh entries. Another approach is "continual learning" – for example, keep an ongoing set of Q&A pairs and do incremental training every so often (but be careful of the model gradually distorting if you never revisit old data – you might want to mix some original FAQ data in so it doesn't forget). If the FAQs change very frequently or you want to avoid retraining too often, consider integrating a retrieval step as mentioned, so that for any question not explicitly in the model, it can pull from the updated info. However, for most typical FAQ updates (which might happen monthly or quarterly), retraining is quite feasible – fine-tuning on a small dataset can be done in minutes to an hour, so it's not a heavy lift. You can script this update process as well. Since you own the pipeline, it's up to you how to schedule it. Always test the model after an update, to ensure the new info is correctly learned and the old answers still look good (no regressions).

Q6: Can fine-tuning make the model worse at general capabilities or introduce biases?
A: It's possible if done poorly, but manageable. When you fine-tune, especially with a small dataset, the model might become highly specialized. This is usually what you want – e.g., it will prefer your style of answer and knowledge. But it could also mean the model's behavior shifts in unintended ways. For example, if all your FAQ answers are very short, the model might start giving overly curt answers even when a user might benefit from more detail. Or if your data has a certain bias (all questions assume a certain context), the model could reflect that. To mitigate this, consider the following:

  • Mix data if needed: Some practitioners mix a bit of original general data or instructions to retain general abilities. However, for a pure FAQ bot, you usually don't need to do this unless you see a problem.
  • Evaluation: test the fine-tuned model on some general questions or edge cases. See if it still behaves safely and coherently. The base models (especially if you chose an instruction-tuned base like Llama-2-chat) have a lot of safety training. Fine-tuning on a narrow set might override some safety guardrails (e.g., if your data inadvertently encourages the model to give out info it normally wouldn't). Be mindful if any of your answers could be sensitive. If you notice issues, you might want to add some safety instructions as part of prompts or even as additional training data (like Q: "Should I mix bleach and ammonia?" A: "I'm sorry, I cannot assist with that." to reinforce that it should refuse certain harmful queries).
  • Bias: If your FAQ data is unbiased and factual, the main bias introduced will just be towards your company's content (which is fine). But always consider inclusion – e.g., if your answers use exclusively masculine pronouns or a single dialect, the model will mirror that. Diversify your wording if that matters to your context.

In general, fine-tuning on a focused dataset will narrow a model's output distribution – that's the goal (make it focus on relevant info). As long as the dataset is well-curated, this specialization is a positive. Just keep an eye out and test thoroughly. The advantage is, unlike a closed API, you can adjust the model if you find issues – either by further fine-tuning, or by post-processing outputs, etc. You have the levers to shape it as needed.