Customize Engine Settings
In this guide, we'll walk you through the process of customizing your engine settings by configuring the nitro.json file
- Navigate to the
App Settings>Advanced>Open App Directory>~/jan/enginefolder.
- MacOS
- Windows
- Linux
cd ~/jan/engines
C:/Users/<your_user_name>/jan/engines
cd ~/jan/engines
- Modify the
nitro.jsonfile based on your needs. The default settings are shown below.
~/jan/engines/nitro.json
{
"ctx_len": 2048,
"ngl": 100,
"cpu_threads": 1,
"cont_batching": false,
"embedding": false
}
The table below describes the parameters in the nitro.json file.
| Parameter | Type | Description |
|---|---|---|
ctx_len | Integer | Typically set at 2048, ctx_len provides ample context for model operations like GPT-3.5. (Maximum: 4096, Minimum: 1) |
ngl | Integer | Defaulted at 100, ngl determines GPU layer usage. |
cpu_threads | Integer | Determines CPU inference threads, limited by hardware and OS. (Maximum determined by system) |
cont_batching | Integer | Controls continuous batching, enhancing throughput for LLM inference. |
embedding | Integer | Enables embedding utilization for tasks like document-enhanced chat in RAG-based applications. |
tip
- By default, the value of
nglis set to 100, which indicates that it will offload all. If you wish to offload only 50% of the GPU, you can setnglto 15 because most models on Mistral or Llama are around ~ 30 layers. - To utilize the embedding feature, include the JSON parameter
"embedding": true. It will enable Nitro to process inferences with embedding capabilities. Please refer to the Embedding in the Nitro documentation for a more detailed explanation. - To utilize the continuous batching feature for boosting throughput and minimizing latency in large language model (LLM) inference, include
cont_batching: true. For details, please refer to the Continuous Batching in the Nitro documentation.
Assistance and Support
If you have questions, please join our Discord community for support, updates, and discussions.